lintool / twitter-tools

Twitter Tools
twittertools.cc
218 stars 100 forks source link

IndexStatuses OOM for very large collections (i.e. Tweets2013) #17

Closed isoboroff closed 11 years ago

isoboroff commented 11 years ago

2013-04-17 07:16:46,041 [main] INFO IndexStatuses - 276300000 statuses indexed 2013-04-17 07:17:10,442 [main] INFO IndexStatuses - 276400000 statuses indexed 2013-04-17 07:17:30,239 [main] INFO IndexStatuses - Total of 276485008 statuses added 2013-04-17 07:17:30,239 [main] INFO IndexStatuses - Merging segments... java.lang.IllegalStateException: this writer hit an OutOfMemoryError; cannot flush at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:2908) at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:2901) at org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1645) at org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1621) at cc.twittertools.search.indexing.IndexStatuses.main(IndexStatuses.java:145)

After this error, the destination directory is empty, so we have to start from scratch.

Solution 1: bump up JVM settings in etc/run.sh Solution 2: avoid OOM better?

isoboroff commented 11 years ago

Crash is at line 156 in IndexStatuses.java. Unsure why we are left with an empty directory.

isoboroff commented 11 years ago

Testing with -Xmx8g in run.sh

stewhdcs commented 11 years ago

Was that successful? A custom MergePolicy for the IndexWriter might be required?

isoboroff commented 11 years ago

-Xmx8G was successful. The final index is 48GB for 276M statuses, taking about 19-20 hours on my old Mac Pro (4 processors, 32GB RAM). It would be nice if IndexStatuses could be a little more robust in memory conditions but that might be hard to catch. At any rate, I'm going to close this issue and add a wishlist issue for memory handling in IndexStatuses.