huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
25.03k stars 5.17k forks source link

DiffusionDet: Diffusion models for object detection #1350

Open 345ishaan opened 1 year ago

345ishaan commented 1 year ago

Model/Pipeline/Scheduler description

Recent work which leverages diffusion models for object detection task. https://arxiv.org/abs/2211.09788

Add capability to run it through HF diffusers pipeline and if possible also create benchmarks or comparison on datasets like nuScenes.

Open source status

Provide useful links for the implementation

No response

patrickvonplaten commented 1 year ago

Model weights seem to be available as well no? https://github.com/ShoufaChen/DiffusionDet#models

vvvm23 commented 1 year ago

@patrickvonplaten mind if I give this a try? This would be my first time contributing a model, so I might need a hand occasionally.

345ishaan commented 1 year ago

@patrickvonplaten mind if I give this a try? This would be my first time contributing a model, so I might need a hand occasionally.

Sorry I forgot to assign the issue to myself while opening but was actually planning to look into this.

vvvm23 commented 1 year ago

Go for it! 🤗

345ishaan commented 1 year ago

Go for it! 🤗

I am happy to collaborate if you want :) I have done it in the past and given I will be only working outside office-hours, things can move faster that way.

vvvm23 commented 1 year ago

Ordinarily I would say yes, but I don't think I can dedicate any time towards it until Tuesday at the earliest. So probably best for you to make a start yourself and if you have anything you want to hand off to me, I can chip in a bit 😅

patrickvonplaten commented 1 year ago

Also more than happy to help if needed :-)

345ishaan commented 1 year ago

@patrickvonplaten Plan to run through their code this weekend in inference mode. Let me know if you have a task checklist in your mind which i should be following. Happy to split up if needed.

patrickvonplaten commented 1 year ago

Sure, I think the following would make sense:

  1. Get the pipeline working witht the original codebase
  2. Add the core unet model to diffusers a) First make sure weights can be correctly b) Then check forward pass
  3. Add remaining components

Also happy to guide you through a PR :-)

345ishaan commented 1 year ago

I am able to run the original codebase here: https://colab.research.google.com/drive/1rA5SXuTx2pI6o7tWA6Ad5QRZn4a1ajMh#scrollTo=Sn5gWF3fhpf-

345ishaan commented 1 year ago

@patrickvonplaten This work is tried on types of encoder, CNN based (Resnet Style) and Transformer based (Swin Transformer). Do you prefer transformer based, also is there a HF implementation of Swin Transformer which i should refer?

patrickvonplaten commented 1 year ago

Think leveraging an existing transformers implementation could make a lot of sense here (also cc @ShoufaChen as the author of diffusionDet :-) )

And maybe @NielsRogge FYI

NielsRogge commented 1 year ago

Yes we do have Swin implemented here -> https://github.com/huggingface/transformers/blob/main/src/transformers/models/swin/modeling_swin.py. So you can do from transformers import SwinModel

ShoufaChen commented 1 year ago

Hello everyone,

Thanks for your efforts in integrating DiffusionDet into awesome diffusers.

We provided Swin-Base model here: https://github.com/ShoufaChen/DiffusionDet/releases/download/v0.1/diffdet_coco_swinbase.pth.

ShoufaChen commented 1 year ago

Hi, @345ishaan ,

May I ask about your progress on this integration?

I am glad to offer help.

345ishaan commented 1 year ago

@ShoufaChen Sorry for the delay here. I am running very busy at work because of EOY launches (hopefully over by tomorrow). Last weekend, I was able to run your demo succesfully in a standalone colab. This Fri-Sun, I was planning to look into doing the following:

1) create a diffusion pipeline for DiffusionDet. 2) Preload weights from the SwinTF encoder model into the one under huggingface. 3) Read and understand the detection decoder and try integrating.

Happy to work alongside you here as I guess we can do it much faster with you being involved. Please let me know what suits you. I am definitely interested in pushing it to finish line.

ShoufaChen commented 1 year ago

Hi, @345ishaan ,

You can leave the most challenging part to me since I think I am more familiar with DiffusionDet (as the author of this work).

345ishaan commented 1 year ago

Yes we do have Swin implemented here -> https://github.com/huggingface/transformers/blob/main/src/transformers/models/swin/modeling_swin.py. So you can do from transformers import SwinModel

@patrickvonplaten @NielsRogge i am guessing transformers and diffusers are maintained as separate libraries. so i branched out the SwinTransformer impln linked above into src/diffusers/models..any suggestions if i should avoid it?

patrickvonplaten commented 1 year ago

Hey @345ishaan,

We're trying to leverage transformers as much as possible. So for the image encoding which is based on a SwinModel, please don't add the code to src/diffusers/models instead:

We only put actual diffusion models into src/diffusers - i.e. models that are called over and over again in the denoising process.

Does this make sense?

345ishaan commented 1 year ago

Hey @345ishaan,

We're trying to leverage transformers as much as possible. So for the image encoding which is based on a SwinModel, please don't add the code to src/diffusers/models instead:

We only put actual diffusion models into src/diffusers - i.e. models that are called over and over again in the denoising process.

Does this make sense?

Thanks for the explaination @patrickvonplaten, will follow what you suggested.