nateraw / Lda2vec-Tensorflow

Tensorflow 1.5 implementation of Chris Moody's Lda2vec, adapted from @meereeum
MIT License
107 stars 40 forks source link

doc_lengths values not purged when purging documents that are too short #30

Closed nateraw closed 5 years ago

nateraw commented 5 years ago

doc_lengths is used when visualizing topics using ldavis. Currently, you'll get an error when trying to visualize the topics saying that its length does not equal num_docs.

This stems from the fact that we purge documents in nlppipe.py if they are too short to create skipgrams. Document lengths corresponding to purged documents are never purged, so you are left with the original length of the input texts instead of the actual number of documents we processed.

nateraw commented 5 years ago

Should remove instantiation of doc_lengths in nlppipe.py to the get_skipgrams function, appending on document lengths only if we getting skipgrams from that document. That way, we won't ever have an issue with the sizes being mismatched.