Closed saicoco closed 3 years ago
Hi @saicoco, it is a trick to reduce the cost of GPU memory. The sequence_output is divided into many parts according to the step_size=5
, then the for loop
will process each part of them, finally, the similarity matrix will be gathered by torch.cat()
(#L374).
Got it
does the 'step_size=5' mean that one video with five captions?
https://github.com/microsoft/UniVL/blob/main/modules/modeling.py#L346