Open violetguos opened 4 years ago
I had a look at the author's repo. The data set they use has this kind of corpus structure. Instead of
words words words...[punctuations] [symbols] other words
They built a hash table of index -> word, so all the corpora have the form
[integer index] [integer index for another word] etc etc
In order to reuse their code (even just the LSTM definition in torch), we need to completely process our data into the same form as that one. I don't think it's worth investing that much time and effort.
If we have time: