Closed mikemccand closed 3 years ago
Changing it to a long might be challenging for norms, since the current encoding relies on the fact that the length is an integer. Are you using norms, I guess not? Maybe we could skip computing the field length when norms are disabled?
[Legacy Jira: Adrien Grand (@jpountz) on Aug 14 2019]
Indeed we disable norms ... that’s a good idea to skip length accumulation when norms are disabled. I’ll give that a shot.
[Legacy Jira: Michael McCandless (@mikemccand) on Aug 14 2019]
I open a PR to fix this issue https://github.com/apache/lucene-solr/pull/2080.
[Legacy Jira: Duan Li on Nov 13 2020]
Thanks @dxl360
, I'll look!
[Legacy Jira: Michael McCandless (@mikemccand) on Nov 13 2020]
It turns out we cannot find a safe way to fix this, so, users must not try to write too many, too large custom term frequencies such that their sum overflows a Java int
in a single field / document.
[Legacy Jira: Michael McCandless (@mikemccand) on Jan 16 2021]
We are using custom term frequencies (LUCENE-7854) to index per-token scoring signals, however for one document that had many tokens and those tokens had fairly large (~998,000) scoring signals, we hit this exception:
This is happening in this code in
DefaultIndexingChain.java
:Where Lucene is accumulating the total length (number of tokens) for the field. But total length doesn't really make sense if you are using custom term frequencies to hold arbitrary scoring signals? Or, maybe it does make sense, if user is using this as simple boosting, but maybe we should allow this length to be a
long
?Legacy Jira details
LUCENE-8947 by Michael McCandless (@mikemccand) on Aug 06 2019, resolved Jan 16 2021 Pull requests: https://github.com/apache/lucene-solr/pull/2080, https://github.com/apache/lucene-solr/pull/2080