Open mikemccand opened 6 years ago
I like this idea a lot. It's progress over perfection and it would simplify the accounting in IW dramatically (on the other hand I think it's nice to have this accounting for assertion purposes ie. just to make sure we have correct counts)!!
[Legacy Jira: Simon Willnauer (@s1monw) on May 18 2018]
I have thought about this, I am personally against the idea because we won't be able to merge segments that large, hence creating a really big trap.
[Legacy Jira: Robert Muir (@rmuir) on May 18 2018]
Also I think the IW accounting needs to stay. Considering we can reasonably merge segments of ~ 1B docs then i think it makes sense to up the limit to 16B or so, but any higher gets into trappy territory. Strongly feel it can't be "unlimited" as long as a single segment is limited.
But I'm concerned this small increase is worth the complexity cost: both on users and on the code: it certainly won't make things any simpler. Also I can see people complaining about what seems like an "arbitrary" limit in the code, even though its no more arbitrary than 2B. But we could try it out and see what it looks like?
[Legacy Jira: Robert Muir (@rmuir) on May 18 2018]
we could try it out and see what it looks like?
+1 I'd be curious to know how much of a rabbit hole this change would end up being.
[Legacy Jira: Adrien Grand (@jpountz) on May 18 2018]
Part of the rabbit hole would be the number of segments. TMP has a default segment size cap of 5G for instance. We could certainly up that or create a new merge policy for indexes with lots of docs...
On a separate note I've seen instances of terabyte-scale indexes on disk. Allowing that to grow by a factor of 8 would be another part of the rabbit hole.
That said, I'm not against the idea at all. I'm pretty sure operational issues would pop out, but that's progress...
[Legacy Jira: Erick Erickson (@ErickErickson) on Feb 14 2020]
I would like to start discussing removing the limit of ~2B documents that we have for indices, while still enforcing it at the segment level for practical reasons.
Postings, stored fields, and all other codec APIs would keep working on integers to represent doc ids. Only top-level doc ids and numbers of documents would need to move to a long. I say "only" because we now mostly consume indices per-segment, but there is still a number of places where we identify documents by their top-level doc ID like
IndexReader#document
, top-docs collectors, etc.Legacy Jira details
LUCENE-8321 by Adrien Grand (@jpountz) on May 18 2018, updated Feb 14 2020