Open Moredread opened 8 years ago
Try decreasing the number of words for the bigrams, right now it's 20,000 but you can probably get away with much less, e.g. 10,000 for a reasonable performance.
That helps a bit, but not enough to stay under 8GB. Gladly I have a bit more available on another machine.
Hi,
I'm not sure if this is normal, but analyzing a corpus of 800MB (ca. 16000 articles) runs out of memory on my machine with 8GB of RAM + 2GB of swap. Can someone with a background in data analysis judge if this is expected?
This might be the main issue for me to scale the database for the physics section of arXiv, as I only have run the analysis on a small portion of it (less than a year for most section, and not all categories that are relevant).
I'll try to profile the memory usage, but I hope the attempt isn't futile. :p