Open john2019-warwick opened 4 years ago
@john2019-warwick First, make sure you are using python3. Are you using multi-GPU training (distributed data parallel)?
Yes, I am using python3, here is the parameter I set: , the world_size=1, so no distributed training implemented. I just change mask_weight from 0 to 1. By the way, beside the parameter d_model, why the help annotation says 'size of the rnn in number of hidden nodes in each layer', based on my understanding, you use attention model (transformers) rather than RNN, right?
It seems even though you set world_size=1
, your script can still see 8 GPUs. Can you try adding CUDA_VISIBLE_DEVICES=0
. d_model
indicates the Transformer hidden size (the code uses to work with the baseline LSTM-based models and we forgot to revise the helper, feel free to submit PR).
I have tried and put CUDA_VISIBLE_DEVICES=0 ahead of my train.py script, but it still doesn't work. By the way, I run the test.py and find a problem here: https://github.com/salesforce/densecap/blob/45b20bb0860b0c5a5fc4878284c1ab0e6892a68f/scripts/test.py#L216, you wrote 'python2' but the requirements and the usage of this evaluation script says python3 or python. What is the correct version I should use?
@john2019-warwick Sorry for the delay due to my travels. The evaluation script requires python2 in particular. Is the problem you mentioned above resolved?
I have not solved that problem yet, I just focused on the understanding of your paper, and trying to resolve the problem of python version change in subprocessing.
Just to make sure, you have had something like CUDA_VISIBLE_DEVICES=0 python train.py...
, right?
For the issue, can you print out mask_loss
and total_loss
first to see why the sizes are not matching?
Yes, I have one gpu0 in my server, I also think there is something wrong with mask_loss, I will soon check it! By the way, in your proposal decoder, you try to learn three parameters: Pe, thetac, theatal, but why there are the output is 4-dimensional? Here: https://github.com/salesforce/densecap/blob/45b20bb0860b0c5a5fc4878284c1ab0e6892a68f/model/action_prop_dense_cap.py#L203
@john2019-warwick There is another value called overlapping_score
used in our preliminary experiments but is deprecated now: https://github.com/salesforce/densecap/blob/45b20bb0860b0c5a5fc4878284c1ab0e6892a68f/model/action_prop_dense_cap.py#L174
Thanks you! So you just simply adjust the dimension of the output of proposal network to 3 in your latest experiment, right? Also, after proposal learning, the visual features are input into the encoder again to combine with mask learned for language decoder learning, right? Another question is that your window size is 480, is it possible that there are more than one annotation during this time since it is a dense video task, how do you proceed it?
i) We didn't make the change in the code but feel free to do so and submit PR. ii) Yes, the visual feature will pass through the entire Transformer for caption generation. iii) We treat each event independently so we will output one caption per event and it doesn't matter how many events are in the 480 window.
Thanks for your reply, for question 3, I still cannot understand, eg. In a window of 480 frame, if there are two events, what you in fact put in training? different sentences with the same 480*1024 visual feature? If this is the case, when test, if there are overlapping events (take place with overlap), but only one caption output?
Hi, Luowei, thanks for your answer about the sampling, I have trained a model with mask and I also want to train a model with mask. Should I just put mask_weight=1.0, is there any other parameters should be changed? I have tried mask_weight=1.0 and got this error with mask, could you explain a little about this? Thank you!