v-iashin / MDVC

PyTorch implementation of Multi-modal Dense Video Captioning (CVPR 2020 Workshops)
https://v-iashin.github.io/mdvc
142 stars 19 forks source link

Order of training the captioning module v/s the proposal module (and whether training is E2E?) #2

Closed amanchadha closed 4 years ago

amanchadha commented 4 years ago

Hi Vladimir,

First, thanks for the great codebase - everything is neatly organized in the source files - a nice deviation compared to what AI codebases from papers usually look like :)

Some questions:

Train and Predict Run the training and prediction script. It will, first, train the captioning model and, then, evaluate the predictions of the best model in the learned proposal setting.

From your comments on training, it is clear that the captioning module is trained first (on GT proposals?). However, it is not very clear when the proposal module is trained. Is the training end-to-end as in Zhuo et al. [59] where both modules are trained in unison (where the captioning module is able to influence the event proposal mechanism)? Can you explain this sequence clearly (maybe for the sake for everyone by updating the readme?). Thanks!

v-iashin commented 4 years ago

Hi, Thanks for the positive words and feedback.

Regarding your questions:

Therefore, the sequence of training:

  1. Training the captioning module on GT proposals (it will evaluate the model on GT automatically)
  2. Evaluate the captioning module with the proposals from Wang et. al 2018
amanchadha commented 4 years ago

Thank you - you've been very helpful!