uwdata / termite-data-server

Data Server for Topic Models
BSD 3-Clause "New" or "Revised" License
121 stars 46 forks source link

Iterable gensim #6

Closed piskvorky closed 10 years ago

piskvorky commented 10 years ago

The strength of gensim is in processing large data, using lazy-loading streams.

I noticed your code puts all documents into RAM (as a plain list), so I changed it to use lazy iteration.

Also it seems you're not doing any preprocessing; LDA can be picky about that. So I added rudimentary preprocessing = lowercasing words & removing stopwords.

I tried to follow your non-PEP8 coding style, for visual consistency.

piskvorky commented 10 years ago

Running the demo on 20newsgroups with gensim only produces empty files (both before/after this patch), but I assume that's a different issue.