I wonder how this function, say MultimodalMinibatchLoaderCaption:next_batch, works on preparing the text data.
Seems that the tensor txt is created to store a batch of chosen sentences and finally returned as training data. But I'm confused on the assignment of txt in the for-loop in lines 83 to 101. It's NOT assigned directly like the img or lab (line 80 & 81).
So why txt is assigned in this way? How this works? I guess the for-loop is the flipping of sentence as the cmd:option -flip says and as my comprehension this operation only add some noise to the sentences data like image flipping?
p.s. I can run this code correctly on the CUB dataset but failed on wikipedia dataset, with NO augmentation on images as this paper say and for each image there is only 1 corresponding sentence. I preprocessed the data to keep the same shape with the data provided by the author, i.e., image of shape (#samples, 1024, 1) and text (#samples, 201, 1)
I wonder how this function, say
MultimodalMinibatchLoaderCaption:next_batch
, works on preparing the text data. Seems that the tensortxt
is created to store a batch of chosen sentences and finally returned as training data. But I'm confused on the assignment oftxt
in the for-loop in lines 83 to 101. It's NOT assigned directly like theimg
orlab
(line 80 & 81). So whytxt
is assigned in this way? How this works? I guess the for-loop is the flipping of sentence as thecmd:option -flip
says and as my comprehension this operation only add some noise to the sentences data like image flipping?p.s. I can run this code correctly on the CUB dataset but failed on wikipedia dataset, with NO augmentation on images as this paper say and for each image there is only
1
corresponding sentence. I preprocessed the data to keep the same shape with the data provided by the author, i.e., image of shape(#samples, 1024, 1)
and text(#samples, 201, 1)