Open jingliao132 opened 6 years ago
Correct, the files in text_c10
seem to contain character-level encoding of descriptions. There are ~70 possible integer values, which corresponds perhaps to a lowercase + uppercase alphabet and punctuation. I haven't been able to figure out where the character -> index mapping lives...or if we have to figure it out ourselves...
On the other hand the files in word_c10
contain word-level descriptions of encodings. The vocab_c10.t7
file maps from words to these integers (confirmed by manually translating some of the vectors in word_c10
).
Hello scot! I am confused at how to wrap up .h5 files from .txt(in folder text_c10) when checking the cub data downloading from the link you provided. I open the .h5 file, I found the keys are like 'txt1', 'txt2',..., 'txt10'. Since there are exactly 10 text descriptions in each .txt file, I guess each key value should be corresponding to a text description in .txt. Next, I check the key value of 'txt1', it is a one dimensional tensor with shape (90,): [116, 104, 101, ..., 46]. The 'txt2' shape is (76,). The '90', '76' is very close to the number of alphabets in each text description. I guess the one-dimensional tensor is encoded from an alphabet list(character-level). However, the vocab_c10.t7 is a dictionary contains many words(word-level). It is really weird. How do you encode each text description from .txt to .h5 file? and how do you generate .t7 files(6020110 DoubleTensor) under /text_c10?