Min word count while creating vocab

maciejkula / glove-python

Toy Python implementation of http://www-nlp.stanford.edu/projects/glove/

Apache License 2.0

1.25k stars 319 forks source link

Min word count while creating vocab #86

Open akanshajainn opened 6 years ago

akanshajainn commented 6 years ago

I was trying to train the latest Wikipedia dump size 15gb, obviously it has large corpus and token count (approx 360m). Since the co-occurrence matrix need to live in the memory, I want to provide a min number for Freq count of the word while creating vocab which in turn creates the co-occurrence matrix. I could not find any parameter for that. Also the code is in cython so it's hard to understand for noob like me. Any idea how can I create vocab and co-occurrence making it memory efficient?

AzChaimae commented 5 years ago

I am still looking for a solution for this issue, did you find how you can do this ?