Closed ShuxinLin closed 5 years ago
You can do any preprocessing that you'd like before using the model, as long as you create the right variables.
doc_lengths.npy - Num unique documents length 1-d array of the number of tokens in each document (indexes should be same as IDs for documents)
embed_matrix.npy - Embedding matrix that you've pretrained (this is optional, but recommended)
freqs.npy - vocabulary length 1-d array of frequency of tokens. Must convert this to a python list due to tensorflow not liking numpy array for this parameter
idx_to_word.pickle - Python dictionary mapping of embed matrix idxs to words
skipgrams.txt - File with x's, y's, and doc ids (x's being context words, y's being target words, doc IDs being unique identifiers to link back to documents for document matrix calculation)
word_to_idx.pickle - Python dictionary mapping of words to embed matrix idxs
Note: You don't have to save these to npy files necessarily, but it helps to not have to run preprocessing every time!
I would suggest running the sample to see what these variables look like, then try your own, if you are feeling up to it! My pipeline is really buggy right now, because I tried to do too much with it. The nlppipeline was kind of a pipeline I was creating for multiple use cases (not just Lda2Vec). As for the effect of lowering the characters, I haven't explored that too much.
Hope this helps!
Thanks for your speed reply. This helps me a lot!
Hi,
Does lowercasing change the modeling results? I found the data sample
20_newsgroups.txt
is before pre-processing. In my case I'd like to do some pre-processing work before feeding into Lda2vec. Thank you!