karpathy / arxiv-sanity-preserver

Web interface for browsing, search and filtering recent arxiv submissions
http://www.arxiv-sanity.com/
MIT License
5.11k stars 1.32k forks source link

Analyzing uses too much memory #35

Open Moredread opened 8 years ago

Moredread commented 8 years ago

Hi,

I'm not sure if this is normal, but analyzing a corpus of 800MB (ca. 16000 articles) runs out of memory on my machine with 8GB of RAM + 2GB of swap. Can someone with a background in data analysis judge if this is expected?

This might be the main issue for me to scale the database for the physics section of arXiv, as I only have run the analysis on a small portion of it (less than a year for most section, and not all categories that are relevant).

I'll try to profile the memory usage, but I hope the attempt isn't futile. :p

karpathy commented 8 years ago

Try decreasing the number of words for the bigrams, right now it's 20,000 but you can probably get away with much less, e.g. 10,000 for a reasonable performance.

Moredread commented 8 years ago

That helps a bit, but not enough to stay under 8GB. Gladly I have a bit more available on another machine.