Adding DETA to 🤗 Transformers

NielsRogge commented 1 year ago

Hi DETA authors,

As this work is very nice and it builds upon DETR and Deformable DETR, both of which are available in 🤗 Transformers, it was relatively straightforward to implement DETA as well (as the only difference is a tweak in the loss function + postprocessing).

Here's a notebook that illustrates inference with DETA models: https://colab.research.google.com/drive/1epI4ejrD0dbrSR9vRRhEPE7duoALqIk9?usp=sharing.

Now I'd also like to make a fine-tuning tutorial for people, illustrating how to fine-tune DETA on a custom dataset. For that I'm using my original DETR fine-tuning tutorial, and tweaking it for DETA. However here I got a question; I'm fine-tuning on the "balloon" dataset which only consists of 1 class (balloon). However during inference, I'm getting an error stating that that "topk is out of range". This is because of this line which seems to select the top 10,000 scores, however when you're fine-tuning on a single class, then the number of queries number of classes = 300 1 = 300. Hence this is smaller than 10,000 => so was wondering what the recommendation here is when fine-tuning on a dataset with only a single class (or more generally, for any custom dataset).

Also, I'm currently hosting the DETA checkpoints on my personal username on HuggingFace:

It would be cool if you could create an organization on the 🤗 Hub and host the checkpoints there (or under your own personal username if you prefer so). This way, you can also write model cards (READMEs) for those repositories etc. It seems there's already an org for the UT-data-bootcamp, but not sure we should host the checkpoints there.

Let me know what you think!

Open-sourcely yours,

Niels ML Engineer @ HF

jozhang97 commented 1 year ago

Hi Niels,

Thanks for this effort, I'm delighted to hear about it! Also glad to hear that it was relatively straightforward to implement.

Hence this is smaller than 10,000 => so was wondering what the recommendation here is when fine-tuning on a dataset with only a single class (or more generally, for any custom dataset).

This preNMS topk is important because it reduces the number of predictions fed into NMS. If the number of predictions is too high, NMS becomes too slow. In your case of 300 predictions, NMS runs fast enough and thus this topk step is not needed and can be removed. In general for custom dataset, it's fine to run something like score.topk(min(10000,len(score))).

It would be cool if you could create an organization on the 🤗 Hub and host the checkpoints there.

Cool! I've uploaded our models to three model repos at https://huggingface.co/jozhang97/deta-resnet-50 and https://huggingface.co/jozhang97/deta-swin-l and https://huggingface.co/jozhang97/deta-swin-l-o365 Is this what you mean?

Let me know if there is anything else I can do to help.

NielsRogge commented 1 year ago

Cool, I tried it by fine-tuning DETA-resnet-50 with the exact same training hyperparameters as my DETR tutorial (300 steps with a learning rate of 1e-4 for the Transformer and 1e-5 for the backbone, weight decay of 1e-4 and gradient clipping of 0.1), this is giving me the following result on one of the validation images:

So this might still need some tweaking (the results after fine-tuning DETR looked a lot better, see the bottom of this notebook). So this might have to do with postprocessing settings, or training settings.

However it seems basic training works, so I'll first integrate the model in the library, and you can then perhaps go over my fine-tuning tutorial and see if some things can be improved, if you're up for that of course.

jozhang97 commented 1 year ago

Ahh yes, I reckon its due to the NMS postprocessing. NMS is not ideal in crowded scenes (see SoftNMS paper).

Hopefully it can be quickly fixed by tweaking the NMS box threshold at this line. If not, perhaps NMS variants, like SoftNMS, would do better. I'd be happy to take a stab at this.

xingyizhou commented 1 year ago

Hi @NielsRogge , thanks for making the tutorial! A quick note on the demo results: DETA used sigmoid activation for classification (see here), while the DETR visualization code above seems using softmax. Please make sure this is changed accordingly.

Best, Xingyi

NielsRogge commented 1 year ago

Hi @xingyizhou,

Yes I'm aware of that (the author of Conditional DETR said it to me when porting Conditional DETR to 🤗 Transformers 😄 ). The post_process_object_detection method in the notebook above is using sigmoid as can be seen here. I've removed the softmax part from the notebook since it wasn't used anyway.

NielsRogge commented 1 year ago

Hi @xingyizhou @jozhang97,

I'd like to merge DETA into HuggingFace Transformers, just wondering at which organization we can host the checkpoints. I see in the thread above that you uploaded the original checkpoints to your personal account, but was wondering where you'd like the HF-compatible DETA checkpoints to be hosted. Typically, they are hosted as part of an organization (like University of Texas).

Kindly let me know!

Kind regards,

Niels

jozhang97 commented 1 year ago

Hi @NielsRogge Lets use my personal one. I noticed you sent a PR today, is there anything I need to do for that?

NielsRogge commented 1 year ago

Hi @jozhang97 yes I've uploaded the HF models to your personal account.

Maybe as a next step, would it be possible to look into fine-tuning of DETA on a custom dataset?

The relevant docs is here: https://huggingface.co/docs/transformers/tasks/object_detection (one would need to replace DetrForObjectDetection by DetaForObjectDetection as well as the image processor)

NielsRogge commented 1 year ago

Hi,

We've created a demo for DETA: https://huggingface.co/spaces/hysts/DETA. But the model often has low confidences even for seemingly easy examples, the results don't look that impressive despite the 63 AP on COCO (YOLO models for instance would recognize all objects in the demo images above), could you take a look?

jozhang97 commented 1 year ago

Hi,

Is this what you get for the demo for the swin model? From the turquoise knifes, it does not look like NMS is ran. Presumably, it uses sigmoid instead of softmax?

Could you share where I could run the YOLO model?

Not sure if I have the bandwidth to train DETA on custom dataset. Sorry

NielsRogge commented 1 year ago

Hi,

Both sigmoid and NMS are used. The app uses the post_process_object_detection method as seen here which is based on your NMSPostProcess method. It's an exact copy as can be seen in the code:

sigmoid is applied here
NMS is applied here

jozhang97 / DETA

Adding DETA to 🤗 Transformers #3