ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing [LUCENE-8118]

mikemccand commented 6 years ago

Indexing a large collection of about 20 million paragraph-sized documents results in an ArrayIndexOutOfBoundsException in org.apache.lucene.index.TermsHashPerField.writeByte (full stack trace below).

The bug is possibly related to issues described in here and SOLR-10936 – but I am not using SOLR, I am directly using Lucene Core.

The issue can be reproduced using code from GitHub trec-car-tools-example

compile with mvn compile assembly:single
run with java -cp ./target/treccar-tools-example-0.1-jar-with-dependencies.jar edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir

Where paragraphCorpus.cbor is contained in this archive

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536 at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198) at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224) at org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159) at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786) at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430) at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281) at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451) at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532) at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1508) at edu.unh.cs.TrecCarBuildLuceneIndex.main(TrecCarBuildLuceneIndex.java:55)

Legacy Jira details

LUCENE-8118 by Laura Dietz on Jan 05 2018, updated Jul 12 2021 Environment:

Debian/Stretch
java version "1.8.0_144"                                                                                                                                                                                       Java(TM) SE Runtime Environment (build 1.8.0_144-b01)                                                                                                                                                          Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)

Attachments: LUCENE-8118_test.patch

mikemccand commented 6 years ago

Looking at your code it seems that there is only one commit at the end, and your collection is big. What if you try to commit every, let's say, 50k docs?

[Legacy Jira: Diego Ceccarelli on Jan 05 2018]

mikemccand commented 6 years ago

Yes, that works - Thanks, Diego!

I think I could have been helped with an Exception message that indicates "Buffer full, call index.commit!"

[Legacy Jira: Laura Dietz on Jan 05 2018]

mikemccand commented 6 years ago

Issuing unnecessary commits is just masking the issue: you shouldn't see this exception.

[Legacy Jira: Robert Muir (@rmuir) on Jan 05 2018]

mikemccand commented 6 years ago

I agree, that was just a workaround for @laura-dietz :)

[Legacy Jira: Diego Ceccarelli on Jan 05 2018]

mikemccand commented 6 years ago

Robert, that would be even better!

It is difficult to guess what the right interval of issuing a commits is. I understand that some hand tuning might be necessary to get the highest performance for given resource constraints. If the issue is a buffer that is filling up, it would be helpful to have some form of an emergency auto-commit.

[Legacy Jira: Laura Dietz on Jan 05 2018]

mikemccand commented 6 years ago

It is nothing like that, it is simply a bug.

[Legacy Jira: Robert Muir (@rmuir) on Jan 05 2018]

mikemccand commented 6 years ago

Well, I understand the bug, but not sure what the fix is.

Indexing code implements Iterable etc to pull in the docs, and makes one single call to addDocuments().

This is supposed to be an "atomic add" of multiple documents at once which gives certain guarantees: needed for nested documents and features like that so they document IDs will be aligned in a particular way.

In your case, its too much data, IndexWriter isn't going to be able to do 200M docs in one operation like this.

[Legacy Jira: Robert Muir (@rmuir) on Jan 05 2018]

mikemccand commented 6 years ago

whatever we decide to do, we can be sure that AIOOBE is not the right answer :)

[Legacy Jira: Robert Muir (@rmuir) on Jan 05 2018]

mikemccand commented 6 years ago

I think my mistake was to abuse addDocuments(iterator).

I switched to addDocument(doc) with a commit every so often (see master branch)

[Legacy Jira: Laura Dietz on Jan 05 2018]

mikemccand commented 6 years ago

yeah, but we still need to fix the case where someone passes too many documents for addDocuments to succeed: it needs to be better than AIOOBE.

[Legacy Jira: Robert Muir (@rmuir) on Jan 05 2018]

mikemccand commented 6 years ago

Here's a really bad test, but it works (takes about 2 minutes).

lucene/core$ ant test -Dtestcase=TestIndexWriter -Dtestmethod=testAddDocumentsMassive -Dtests.heapsize=4G

  [junit4] <JUnit4> says مرحبا! Master seed: 1655BF16A8843A6A
   [junit4] Executing 1 suite with 1 JVM.
   [junit4] 
   [junit4] Started J0 PID(22813@localhost).
   [junit4] Suite: org.apache.lucene.index.TestIndexWriter
   [junit4] HEARTBEAT J0 PID(22813@localhost): 2018-01-05T11:27:48, stalled for 71.2s at: TestIndexWriter.testAddDocumentsMassive
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestIndexWriter -Dtests.method=testAddDocumentsMassive -Dtests.seed=1655BF16A8843A6A -Dtests.locale=fr-FR -Dtests.timezone=Asia/Oral -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] ERROR    121s | TestIndexWriter.testAddDocumentsMassive <<<
   [junit4]    > Throwable #1: java.lang.ArrayIndexOutOfBoundsException: -65536
   [junit4]    >    at __randomizedtesting.SeedInfo.seed([1655BF16A8843A6A:2B0B86082D338FEA]:0)
   [junit4]    >    at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
   [junit4]    >    at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224)
   [junit4]    >    at org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:80)
   [junit4]    >    at org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:171)
   [junit4]    >    at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185)
   [junit4]    >    at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
   [junit4]    >    at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
   [junit4]    >    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
   [junit4]    >    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281)
   [junit4]    >    at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:452)
   [junit4]    >    at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1530)
   [junit4]    >    at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1506)
   [junit4]    >    at org.apache.lucene.index.TestIndexWriter.testAddDocumentsMassive(TestIndexWriter.java:2994)
   [junit4]    >    at java.lang.Thread.run(Thread.java:745)
   [junit4]   2> NOTE: leaving temporary files on disk at: /home/rmuir/workspace/lucene-solr/lucene/build/core/test/J0/temp/lucene.index.TestIndexWriter_1655BF16A8843A6A-001
   [junit4]   2> NOTE: test params are: codec=Asserting(Lucene70), sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@2a213e8b), locale=fr-FR, timezone=Asia/Oral
   [junit4]   2> NOTE: Linux 4.4.0-104-generic amd64/Oracle Corporation 1.8.0_45 (64-bit)/cpus=8,threads=1,free=543468648,total=2733637632
   [junit4]   2> NOTE: All tests run in this JVM: [TestIndexWriter]
   [junit4] Completed [1/1 (1!)] in 121.55s, 1 test, 1 error <<< FAILURES!
   [junit4] 
   [junit4] 
   [junit4] Tests with failures [seed: 1655BF16A8843A6A]:
   [junit4]   - org.apache.lucene.index.TestIndexWriter.testAddDocumentsMassive
   [junit4] 
   [junit4] 
   [junit4] JVM J0:     0.38 ..   122.56 =   122.18s
   [junit4] Execution time total: 2 minutes 2 seconds
   [junit4] Tests summary: 1 suite, 1 test, 1 error

BUILD FAILED
/home/rmuir/workspace/lucene-solr/lucene/common-build.xml:1512: The following error occurred while executing this line:
/home/rmuir/workspace/lucene-solr/lucene/common-build.xml:1038: There were test failures: 1 suite, 1 test, 1 error [seed: 1655BF16A8843A6A]

Total time: 2 minutes 5 seconds

[Legacy Jira: Robert Muir (@rmuir) on Jan 05 2018]

mikemccand commented 6 years ago

the test had to work hard to hit AIOOBE instead of OOM.

I think most users that do something like this will hit OOM which is just as confusing and bad. it may technically be a different problem but due to the names of the methods and the apis, i think its easy someone will hit it too. Seems like add/updateDocuments need some sanity checks...

[Legacy Jira: Robert Muir (@rmuir) on Jan 05 2018]

mikemccand commented 6 years ago

Note that committing only once at the end is entirely normal and often exactly the right choice.

It's hard to know how to fix this – we could add a best effort check that if the RAM usage of that one in-memory segment (DWPT) exceeds the hard limit (IWC.setRAMPerThreadHardLimitMB) we throw a better exception?

[Legacy Jira: Michael McCandless (@mikemccand) on Jan 06 2018]

mikemccand commented 6 years ago

Well, I think a simple limit can work. For this API, e.g a simple counter, throw exc if the iterator has over 100k docs.

[Legacy Jira: Robert Muir (@rmuir) on Jan 06 2018]

mikemccand commented 6 years ago

Well, I think a simple limit can work. For this API, e.g a simple counter, throw exc if the iterator has over 100k docs.

+1

[Legacy Jira: Michael McCandless (@mikemccand) on Jan 06 2018]

mikemccand commented 6 years ago

OOMs are complicated in general because once you hit one, there's a very real risk that you won't be able to recover anyway (even constructing a new exception message typically requires memory allocation and this just goes on and on in a vicious cycle). I remember thinking about it a lot in the early days of randomizedrunner, but without any constructive conclusions. I tried preallocating stuff in advance (not possible in all cases) and workarounds like keeping a memory buffer that is made reclaimable on OOM (so that there's some memory available before we hit the next one)... these are hacks more than solutions and they don't always work anyway (as in when you have background heap-competing threads...).

I like Java, but it starts to show its wrinkles. :(

[Legacy Jira: Dawid Weiss (@dweiss) on Jan 06 2018]

mikemccand commented 6 years ago

Dawid, Michael, my computer has plenty of RAM, which is why I never see an OOM exception and always get the AIOOBE.

[Legacy Jira: Laura Dietz on Jan 06 2018]

mikemccand commented 6 years ago

Dawid it is not complicated in this case. It is trivial to fix.

Again to explain:

With addDocument you don't hit OOM and you dont need a huge heap. just keep indexing documents and lucene will flush to disk appropriately.
With addDocumentS it will try to add anything you pass all atomically as one "transaction".

There are a couple problems here. First is the method's name (addDocuments is not the plural form of addDocument, its something totally different alltogether. It needs to be addDocumentsAtomic or addDocumentsBlock or something else, anything else. Its also missing bounds checks which is why you see the AIOOBE, those need to be added.

[Legacy Jira: Robert Muir (@rmuir) on Jan 06 2018]

mikemccand commented 6 years ago

+1

[Legacy Jira: Laura Dietz on Jan 06 2018]

mikemccand commented 3 years ago

Here's another stacktrace, but slightly different from the one in the original post. My crash happens not with addDocuments, but with addDocument. Maximum heap size is 16 GB. Lucene version is 8.5.2.

Note: I got this stacktrace from a user, and have no access to the data being indexed, so my testing ability is very limited.

org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:681) at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:695) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1591) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1213) at com.docfetcherpro.model.TreeFolderWrapper$.addDoc(TreeModelWrapper.scala:189) at com.docfetcherpro.model.TreeNodeWrapper.addDoc(TreeModelWrapper.scala:533) at com.docfetcherpro.model.TreeNodeWrapper.update(TreeModelWrapper.scala:315) at com.docfetcherpro.model.TreeUpdate$.updateNodePair(TreeUpdate.scala:333) at com.docfetcherpro.model.TreeUpdate$.$anonfun$update$6(TreeUpdate.scala:137) at com.docfetcherpro.model.TreeUpdate$.$anonfun$update$6$adapted(TreeUpdate.scala:133) at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:985) at scala.collection.immutable.List.foreach(List.scala:431) at scala.collection.generic.TraversableForwarder.foreach(TraversableForwarder.scala:38) at scala.collection.generic.TraversableForwarder.foreach$(TraversableForwarder.scala:38) at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:47) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:984) at com.docfetcherpro.model.TreeUpdate$.update(TreeUpdate.scala:133) at com.docfetcherpro.model.IndexActor.index1(IndexActor.scala:127) at com.docfetcherpro.model.IndexActor.$anonfun$index$1(IndexActor.scala:18) at com.docfetcherpro.util.MethodActor$$anon$3.run(MethodActor.scala:86) at com.docfetcherpro.util.MethodActor.com$docfetcherpro$util$MethodActor$$threadLoop(MethodActor.scala:185) at com.docfetcherpro.util.MethodActor$$anon$2.run(MethodActor.scala:67) Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of bounds for length 71428 at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198) at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:221) at org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:80) at org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:121) at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:178) at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:862) at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:442) at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:406) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:250) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:495) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1213) at com.docfetcherpro.model.TreeFolderWrapper$.addDoc(TreeModelWrapper.scala:189) at com.docfetcherpro.model.TreeNodeWrapper.addDoc(TreeModelWrapper.scala:533) at com.docfetcherpro.model.TreeNodeWrapper.update(TreeModelWrapper.scala:310) ... 15 more

[Legacy Jira: Nam-Quang Tran on Jun 06 2021]

mikemccand commented 3 years ago

Update on my previous post: After some back and forth with the reporting user, it turned out that the crash was caused by some specific files named "delete.pdf". Unfortunately, the user deleted these files before I could get my hands on them. But sharing them may have been problematic anyway, as they were some old bank statements. In any case, it seems that feeding the contents of some bad PDF files to Lucene can cause AIOOBEs.

[Legacy Jira: Nam-Quang Tran on Jun 08 2021]

mikemccand commented 3 years ago

Now I had another user report to me the crash I posted above. This time I was able to pinpoint the files that were causing the crash. They were just large binary files (> 2 GB). So, I think the crash can now be reproduced relatively easily: Set a high maximum heap size (16 GB in my case), then shove a giant binary file into Lucene.

[Legacy Jira: Nam-Quang Tran on Jul 12 2021]

mikemccand / stargazers-migration-test

ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing [LUCENE-8118] #119

Legacy Jira details