v-iashin / MDVC

PyTorch implementation of Multi-modal Dense Video Captioning (CVPR 2020 Workshops)
https://v-iashin.github.io/mdvc
142 stars 19 forks source link

Requesting tensorboard log file for best model #12

Closed VP-0822 closed 4 years ago

VP-0822 commented 4 years ago

Hi,

First of all great work with codebase and paper. Model architecture is explained very well in the paper. I am working on the improvements by keeping your work as a baseline. I am trying to train the dataset on my own, with the little refactoring of code as per my need. I am requesting you to please provide tensorboard log file for best_model.pt you have already shared in the repo. I am particularly interested in looking for the results of validation sets epoch by epoch. As currently after certain epochs, I see the prediction for 'videos_to_monitor' is coming '', which is interesting. And I would like to see how training progressed in your best model.

VP-0822 commented 4 years ago

Here is my file after 16 epochs, events.out.tfevents.1592425758.04a1b10b6b08.125.zip

v-iashin commented 4 years ago

Hi. I am glad you liked the paper and the source code 🙂.

I inspected your tb (thanks for it btw) and found it to be quite different from mine. I am afraid something went wrong on the way. In my case, it starts by repeating itself (sometimes like yours), but after a couple of epochs, it makes the captions more appealing. Your curves look promising, though!

I would try to inspect what your decoder (and generator) are doing. Also, check out your loss design as your model seems to receive a weird response for its predictions. Or maybe even attention spans (if you are using it, ofc). Another shot in the blue would be to check if your special tokens are encoded into the same integers (pad-1, start-2, end-3) as you may mask out different things instead of the padding--you mentioned the unk token.

Here is the tb generated during training the best model: events.out.tfevents.1573036460.3x2080-12432.38798.0.zip

_Tiny hint: in case you are wondering how to display text summary for each epoch try using --samples_per_plugin=text=200 when starting tensorboard_

VP-0822 commented 4 years ago

Thanks for the inspection. After seeing the file you provided, I believe I have messed up something, I need to debug it. If I understood correctly the statement 'check out your loss design as your model seems to receive a weird response for its predictions' Is it how the graphs look and how loss is behaving weirdly for certain steps? image

v-iashin commented 4 years ago

My pleasure!

The problem is likely is that you are writing several tb summaries into one file. Hence, you see your lines break and start at 0th epoch while being connected to curves from the previous experiment.

What I meant by that statement is the fact that your model converges to the state when predicting nothing is better than to predict anything at all, while your loss is still decreasing. It seems the loss might receive different ground truth than it is expected (always end-token-index for example).

Also, check the variables in next_word = preds[:, -1].max(dim=-1)[1].unsqueeze(1) in greedy_decoder to see what the softmax (log_softmax) returns.

VP-0822 commented 4 years ago

This is how ReversibleField for Caption returns data for input to decoder, image

Shouldn't the end_token (3) be the last element in tensor? Because in your code when you do, caption_idx, caption_idx_y = caption_idx[:, :-1], caption_idx[:, 1:] You are trying to remove end_token from caption to create input token for decoder and removing start_token to prepare caption for calculating loss.

Note: my caption field self.CAPTION_FIELD = data.ReversibleField( tokenize='spacy', init_token=self.start_token, eos_token=self.end_token, pad_token=self.pad_token, lower=True, batch_first=True, is_target=True, unk_token=UNKNOWN_TOKEN)

v-iashin commented 4 years ago

Shouldn't the end_token (3) be the last element in tensor?

Well, it should. However, it is common to use padding (1) up to the largest length in a batch (7th row in your case if you are printing caption_idx[:, :-1]--please verify) and masking it out in attention and loss.

You are trying to remove end_token from caption to create input token for the decoder and removing start_token to prepare caption for calculating loss.

Ok, let me clarify this point a bit. Let's consider a sequence of tokens in the batch:

Ground Truth Sequence:   2   4  19 559  12   4 131   3   1   1   1

Then we need to construct the input sequence of previous caption words (caption_idx) and ground truth sequence, which the decoder will try to predict the next word (caption_idx_y)

Ground Truth Sequence:   2   4  19 559  12   4 131   3   1   1   1
caption_idx:             2   4  19 559  12   4 131   3   1   1
caption_idx_y:           4  19 559  12   4 131   3   1   1   1 (caption_idx shifted left)

Then, given caption_idx, the decoder will generate a distribution for the next word (p*)

pred:                   p1  p2  p3  p4  p5  p6 ...

Therefore, cross-entropy will compare the predicted distributions (p*) with the one-hot encoding (OHE) of each token from caption_idx_y.

# j-th captions
loss_1,j(p1, OHE(4)); loss_2,j(p2, OHE(19)); loss_3,j(p3, OHE(559)); ...

my caption field

Yep, looks the same except for the UNK argument. Hopefully, not much has been changed between torchtext versions.


If you are ok with the provided tensorboard log, please close the issue and create a separate issue if you have other questions.

VP-0822 commented 4 years ago

Thank you so much for the detailed clarification.