Indexing fails with "too many tokens for field" when using custom term frequencies [LUCENE-8947]

mikemccand commented 5 years ago

We are using custom term frequencies (LUCENE-7854) to index per-token scoring signals, however for one document that had many tokens and those tokens had fairly large (~998,000) scoring signals, we hit this exception:

2019-08-05T21:32:37,048 [ERROR] (LuceneIndexing-3-thread-3) com.amazon.lucene.index.IndexGCRDocument: Failed to index doc: java.lang.IllegalArgumentException: too many tokens for field "foobar"
at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:825)
at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297)
at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450)
at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291)
at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264)

This is happening in this code in DefaultIndexingChain.java:

  try {
    invertState.length = Math.addExact(invertState.length, invertState.termFreqAttribute.getTermFrequency());
  } catch (ArithmeticException ae) {
    throw new IllegalArgumentException("too many tokens for field \"" + field.name() + "\"");
  }

Where Lucene is accumulating the total length (number of tokens) for the field. But total length doesn't really make sense if you are using custom term frequencies to hold arbitrary scoring signals? Or, maybe it does make sense, if user is using this as simple boosting, but maybe we should allow this length to be a long?

Legacy Jira details

LUCENE-8947 by Michael McCandless (@mikemccand) on Aug 06 2019, resolved Jan 16 2021 Pull requests: https://github.com/apache/lucene-solr/pull/2080, https://github.com/apache/lucene-solr/pull/2080

mikemccand commented 5 years ago

Changing it to a long might be challenging for norms, since the current encoding relies on the fact that the length is an integer. Are you using norms, I guess not? Maybe we could skip computing the field length when norms are disabled?

[Legacy Jira: Adrien Grand (@jpountz) on Aug 14 2019]

mikemccand commented 5 years ago

Indeed we disable norms ... that’s a good idea to skip length accumulation when norms are disabled. I’ll give that a shot.

[Legacy Jira: Michael McCandless (@mikemccand) on Aug 14 2019]

mikemccand commented 4 years ago

I open a PR to fix this issue https://github.com/apache/lucene-solr/pull/2080.

[Legacy Jira: Duan Li on Nov 13 2020]

mikemccand commented 4 years ago

Thanks @dxl360, I'll look!

[Legacy Jira: Michael McCandless (@mikemccand) on Nov 13 2020]

mikemccand commented 3 years ago

It turns out we cannot find a safe way to fix this, so, users must not try to write too many, too large custom term frequencies such that their sum overflows a Java int in a single field / document.

[Legacy Jira: Michael McCandless (@mikemccand) on Jan 16 2021]

mikemccand / stargazers-migration-test

Indexing fails with "too many tokens for field" when using custom term frequencies [LUCENE-8947] #944

Legacy Jira details