ttengwang / PDVC

End-to-End Dense Video Captioning with Parallel Decoding (ICCV 2021)
MIT License
200 stars 23 forks source link

A question about object detection #31

Closed qt2139 closed 2 years ago

qt2139 commented 2 years ago

Thank you so much for this wonderful project. When I tried to run your code on my validation set, I ran into some problems. For example, in a video, a cat runs out of a Christmas gift box, but the prediction is: a woman runs out of the Christmas gift box. Another video of mine shows some sheep walking and the prediction is that some horses are walking. From this, it can be seen that the model can recognize the action, but not the type of the object. I think it may be the problem of ActivityNet, because the animal category in the dataset only contains dogs and horses. Could you please provide a pre training weight obtained after pre-training on ImageNet-22K. I think this may be really effective for the model when it comes to object detection. Finally thank you for your contribution.

ttengwang commented 2 years ago

Sorry that I did not have enough time to extract the detection features with IN22K pre-trained models. I'd like to retrain the PDVC model if you could provide your extracted video features of ActivityNet Captions :)

qt2139 commented 2 years ago

Sorry that I did not have enough time to extract the detection features with IN22K pre-trained models. I'd like to retrain the PDVC model if you could provide your extracted video features of ActivityNet Captions :)

Thank you for your reply. I just randomly used a few videos to see the performance of the model, so I don't have video features. Because the model does not contain many categories (for example, the animal category only contains dogs and horses), the process of object detection may be problematic when there are categories in the video that are not in ActivityNet. But I noticed that the model works really well if ActivityNet's categories (eg dog and horse) are present in the video. To be honest, pre-training on ImageNet-22k is a very time-consuming process. But your model is really awesome. Thank you very much for your work. Finally, if you have time someday, you can do ImageNet-22k pre-training, which I believe will work better for the model.

ttengwang commented 2 years ago

Agree! Thanks for your insightful comments, and I will consider it as future work.