Open ShuangLI59 opened 8 years ago
I would be interested in this too. I'd like to try out GANs for image generation on some datasets other than birds and flowers, but the roadblock for many GAN implementations seems to lead back to here and the text embedding, particularly for the latest and greatest, StackGAN. I've been poking at it and I can't figure out how exactly this data is supposed to be encoded for the training code to work. All the data seems to be in .t7 and .txt files, looking at the multiloader code, but the array dimensions, like 10x67x1024, don't make any sense - the 10 refers to the 10 captions per image and the variable dimension like 40 or 67 refers to how many images in a particular class, I'm able to figure out that much, but the 1024 makes no sense since none of the captions are longer than 400 characters, and viewing the raw data in th
, they are small floats 0-1 and not integers as you would expect from an ASCII encoding. I also thought that the 1024 might be the images, since 1024px is a common dimension, but then there would need to be another 3 arrays for the RGB, no? The only mention of 1024 I see is in the paper, where the RNN embedding is 1024-dimensions, which would explain the small non-integers, but it makes no sense for the .t7s to be storing the embedding for each caption because this data is supposed to be used to be training the embedding model in the first place! Right? (And where on earth are the images in these t7s? They all seem too small to be storing the images, yet the paper's method for joint training requires the image to be available for the convolutional part of the training of the embedding. They're not being read from the HDF5 files because a grep of 'hdf' turns up nothing in the repo, so where are they coming from?) So I am confused and have no idea how I would train an embedding on a different caption corpus than bird/flower.
@reedscot if you could help all of us who are trying to implement Stackgan on a different dataset. Please let us know how we can prepare the dataset for training.
Would really appreciate your help on this.
The file in folder images/ contains t7 file for each class, for example 'class_00001.t7' has dimension 40 x1024 x 10, I think 40 means there are 40 images in class_00001, and the paper said that(section 5), they exacted 1024-dimensional pooling units from GoogLeNet, finally the dimension 10 paper said they do some crop on each image, so it will generate 10 new images for each image. So I guess 40: 40 images in class00001 1024: image feature extracted from GoogleNet 10: new cropped images
But I'm curious about the cropped images and its position, I think if we know these information, maybe we could re-produce the data stored in the t7 file
@GaryLMS in Section 5, Experimental results:
For each image, we extracted middle, upper left, upper right, lower left and lower right crops for the original and horizontally-flipped image, resulting in 10 views per training image
Seems straightforward. Or I missing something?
Could you provide the code of generating 'cvpr2016_cub/images' or 'cvpr2016_cub/bow_c10'? Thank you very much!