Questions about training

masonwang513 commented 4 years ago

From your previous answers:

We only use sampled 3 frames for training. That's the reason why we sample frames from videos.
The gradient is computed based on 4 samples in the batch. Backpropagation is done after all the frames are processed.

I have two another questions targeting these two answers:

why do you only use 3 frames for training ? According to your paper, more previous frames do benefit model performance; what's more, more than 3 previous frames would be used and added into memory in inference mode, meaning that it causes inconsistency between training and testing; so why not just use longer frames in main training ?
Is BP or BP-Through-Time used for gradient computation ? For each sample, there are several frames computed one by one and the subsequent frames rely on previous frames' activations and predictions, so whether gradients are computed each time a frame is forwarded (and previous activations are detached) OR gradients are only computed after all frames' losses are accumulated? If it is former, it is simple BP, otherwise, it's BPTT, right?

seoungwugoh commented 4 years ago

Hi @masonwang513, Here are my answers:

Yes, there is inconsistency between training and testing. The reason why we use only 3 frames for training is to reduce computation and accelerate training. We found that our model trained using very short clip performs well on a long clips. It is due to attention mechanism we use is not sensitive to the size of memory.
We tried both but found no big difference (detaching vs non-detaching). But, the important point is to make the second forward step use the output of the first step (to adapt to its own output).

ryancll commented 4 years ago

Hi @seoungwugoh, For your answer 2, did you mean teacher forcing strategy is not suitable for training STM model?

seoungwugoh commented 4 years ago

@ryancll I don't know what teacher forcing strategy is. Can you describe more about it?

ryancll commented 4 years ago

@seoungwugoh During the training, instead of feeding previous predicted masks into memory, sometimes we can feed the ground truth masks into memory to guide the training process. This strategy is widely used in NLP Seq2Seq task, but I'm not sure if it is useful for STM.

seoungwugoh commented 4 years ago

@ryancll We did not use such training technique in our work. But it seems interesting idea to try. I think it will be effective for some very challenging training samples where the network fail to deliver good results for the first estimation.

seoungwugoh / STM

Questions about training #17