reedscot / cvpr2016

Learning Deep Representations of Fine-grained Visual Descriptions
http://arxiv.org/abs/1605.05395
MIT License
334 stars 97 forks source link

util/MultimodalMinibatchLoaderCaption:next_batch #10

Open iTomxy opened 5 years ago

iTomxy commented 5 years ago

I wonder how this function, say MultimodalMinibatchLoaderCaption:next_batch, works on preparing the text data. Seems that the tensor txt is created to store a batch of chosen sentences and finally returned as training data. But I'm confused on the assignment of txt in the for-loop in lines 83 to 101. It's NOT assigned directly like the img or lab (line 80 & 81). So why txt is assigned in this way? How this works? I guess the for-loop is the flipping of sentence as the cmd:option -flip says and as my comprehension this operation only add some noise to the sentences data like image flipping?

p.s. I can run this code correctly on the CUB dataset but failed on wikipedia dataset, with NO augmentation on images as this paper say and for each image there is only 1 corresponding sentence. I preprocessed the data to keep the same shape with the data provided by the author, i.e., image of shape (#samples, 1024, 1) and text (#samples, 201, 1)