Mask loss - Githubissues

john2019-warwick commented 4 years ago

Hi, Luowei, thanks for your answer about the sampling, I have trained a model with mask and I also want to train a model with mask. Should I just put mask_weight=1.0, is there any other parameters should be changed? I have tried mask_weight=1.0 and got this error with mask, could you explain a little about this? Thank you! Screenshot from 2019-11-09 15-19-30

LuoweiZhou commented 4 years ago

@john2019-warwick First, make sure you are using python3. Are you using multi-GPU training (distributed data parallel)?

john2019-warwick commented 4 years ago

Yes, I am using python3, here is the parameter I set: train_para1 train_para2 , the world_size=1, so no distributed training implemented. I just change mask_weight from 0 to 1. By the way, beside the parameter d_model, why the help annotation says 'size of the rnn in number of hidden nodes in each layer', based on my understanding, you use attention model (transformers) rather than RNN, right?

LuoweiZhou commented 4 years ago

It seems even though you set world_size=1, your script can still see 8 GPUs. Can you try adding CUDA_VISIBLE_DEVICES=0. d_model indicates the Transformer hidden size (the code uses to work with the baseline LSTM-based models and we forgot to revise the helper, feel free to submit PR).

john2019-warwick commented 4 years ago

I have tried and put CUDA_VISIBLE_DEVICES=0 ahead of my train.py script, but it still doesn't work. Screenshot from 2019-11-15 17-46-43 By the way, I run the test.py and find a problem here: https://github.com/salesforce/densecap/blob/45b20bb0860b0c5a5fc4878284c1ab0e6892a68f/scripts/test.py#L216, you wrote 'python2' but the requirements and the usage of this evaluation script says python3 or python. What is the correct version I should use?

LuoweiZhou commented 4 years ago

@john2019-warwick Sorry for the delay due to my travels. The evaluation script requires python2 in particular. Is the problem you mentioned above resolved?

john2019-warwick commented 4 years ago

I have not solved that problem yet, I just focused on the understanding of your paper, and trying to resolve the problem of python version change in subprocessing.

LuoweiZhou commented 4 years ago

Just to make sure, you have had something like CUDA_VISIBLE_DEVICES=0 python train.py..., right?

For the issue, can you print out mask_loss and total_loss first to see why the sizes are not matching?

john2019-warwick commented 4 years ago

Yes, I have one gpu0 in my server, I also think there is something wrong with mask_loss, I will soon check it! By the way, in your proposal decoder, you try to learn three parameters: Pe, thetac, theatal, but why there are the output is 4-dimensional? Here: https://github.com/salesforce/densecap/blob/45b20bb0860b0c5a5fc4878284c1ab0e6892a68f/model/action_prop_dense_cap.py#L203

LuoweiZhou commented 4 years ago

@john2019-warwick There is another value called overlapping_score used in our preliminary experiments but is deprecated now: https://github.com/salesforce/densecap/blob/45b20bb0860b0c5a5fc4878284c1ab0e6892a68f/model/action_prop_dense_cap.py#L174

john2019-warwick commented 4 years ago

Thanks you! So you just simply adjust the dimension of the output of proposal network to 3 in your latest experiment, right? Also, after proposal learning, the visual features are input into the encoder again to combine with mask learned for language decoder learning, right? Another question is that your window size is 480, is it possible that there are more than one annotation during this time since it is a dense video task, how do you proceed it?

LuoweiZhou commented 4 years ago

i) We didn't make the change in the code but feel free to do so and submit PR. ii) Yes, the visual feature will pass through the entire Transformer for caption generation. iii) We treat each event independently so we will output one caption per event and it doesn't matter how many events are in the 480 window.

john2019-warwick commented 4 years ago

Thanks for your reply, for question 3, I still cannot understand, eg. In a window of 480 frame, if there are two events, what you in fact put in training? different sentences with the same 480*1024 visual feature? If this is the case, when test, if there are overlapping events (take place with overlap), but only one caption output?

salesforce / densecap

Mask loss #28