Data pre-processing - Githubissues

ShuxinLin commented 5 years ago

Hi,

Does lowercasing change the modeling results? I found the data sample 20_newsgroups.txt is before pre-processing. In my case I'd like to do some pre-processing work before feeding into Lda2vec. Thank you!

nateraw commented 5 years ago

You can do any preprocessing that you'd like before using the model, as long as you create the right variables.

doc_lengths.npy - Num unique documents length 1-d array of the number of tokens in each document (indexes should be same as IDs for documents)

embed_matrix.npy - Embedding matrix that you've pretrained (this is optional, but recommended)

freqs.npy - vocabulary length 1-d array of frequency of tokens. Must convert this to a python list due to tensorflow not liking numpy array for this parameter

idx_to_word.pickle - Python dictionary mapping of embed matrix idxs to words

skipgrams.txt - File with x's, y's, and doc ids (x's being context words, y's being target words, doc IDs being unique identifiers to link back to documents for document matrix calculation)

word_to_idx.pickle - Python dictionary mapping of words to embed matrix idxs

Note: You don't have to save these to npy files necessarily, but it helps to not have to run preprocessing every time!

I would suggest running the sample to see what these variables look like, then try your own, if you are feeling up to it! My pipeline is really buggy right now, because I tried to do too much with it. The nlppipeline was kind of a pipeline I was creating for multiple use cases (not just Lda2Vec). As for the effect of lowering the characters, I haven't explored that too much.

Hope this helps!

ShuxinLin commented 5 years ago

Thanks for your speed reply. This helps me a lot!

nateraw / Lda2vec-Tensorflow

Data pre-processing #18