ttengwang / PDVC

End-to-End Dense Video Captioning with Parallel Decoding (ICCV 2021)
MIT License
200 stars 23 forks source link

Comparison with Base Transformer on YouCook2 #18

Closed anilbatra2185 closed 2 years ago

anilbatra2185 commented 2 years ago

Hi @ttengwang

Appreciate you for sharing the code.

I am wondering if you train the base Transformer +LSTM on Youcook2 dataset, i.e. similar to Row 1 and 2 in Table 7 (a).

I am wondering if the current code supports to train the base transformer or not.

Thanks

ttengwang commented 2 years ago

Sorry, I never tried this setting and this code does not support base Transformer. In my early experiments, I tried the original DETR on ActivitNet Captions but found that the predicted captions are almost the same. Then I move to the Deformable DETR which has a prior to constrain the distribution of attention weights.

anilbatra2185 commented 2 years ago

thanks @ttengwang for confirming. I am trying to train the setting, the model is not getting trained. So, just wondering if there is any important trick to train the simple DETR style model. Appreciate any thoughts or suggestions.

Thanks