Allow composite readers to have more than 2B documents [LUCENE-8321]

mikemccand / stargazers-migration-test

Testing Lucene's Jira -> GitHub issues migration

0 stars 0 forks source link

Allow composite readers to have more than 2B documents [LUCENE-8321] #321

Open mikemccand opened 6 years ago

mikemccand commented 6 years ago

I would like to start discussing removing the limit of ~2B documents that we have for indices, while still enforcing it at the segment level for practical reasons.

Postings, stored fields, and all other codec APIs would keep working on integers to represent doc ids. Only top-level doc ids and numbers of documents would need to move to a long. I say "only" because we now mostly consume indices per-segment, but there is still a number of places where we identify documents by their top-level doc ID like IndexReader#document, top-docs collectors, etc.

Legacy Jira details

LUCENE-8321 by Adrien Grand (@jpountz) on May 18 2018, updated Feb 14 2020

mikemccand commented 6 years ago

I like this idea a lot. It's progress over perfection and it would simplify the accounting in IW dramatically (on the other hand I think it's nice to have this accounting for assertion purposes ie. just to make sure we have correct counts)!!

[Legacy Jira: Simon Willnauer (@s1monw) on May 18 2018]

mikemccand commented 6 years ago

I have thought about this, I am personally against the idea because we won't be able to merge segments that large, hence creating a really big trap.

[Legacy Jira: Robert Muir (@rmuir) on May 18 2018]

mikemccand commented 6 years ago

Also I think the IW accounting needs to stay. Considering we can reasonably merge segments of ~ 1B docs then i think it makes sense to up the limit to 16B or so, but any higher gets into trappy territory. Strongly feel it can't be "unlimited" as long as a single segment is limited.

But I'm concerned this small increase is worth the complexity cost: both on users and on the code: it certainly won't make things any simpler. Also I can see people complaining about what seems like an "arbitrary" limit in the code, even though its no more arbitrary than 2B. But we could try it out and see what it looks like?

[Legacy Jira: Robert Muir (@rmuir) on May 18 2018]

mikemccand commented 6 years ago

we could try it out and see what it looks like?

+1 I'd be curious to know how much of a rabbit hole this change would end up being.

[Legacy Jira: Adrien Grand (@jpountz) on May 18 2018]

mikemccand commented 4 years ago

Part of the rabbit hole would be the number of segments. TMP has a default segment size cap of 5G for instance. We could certainly up that or create a new merge policy for indexes with lots of docs...

On a separate note I've seen instances of terabyte-scale indexes on disk. Allowing that to grow by a factor of 8 would be another part of the rabbit hole.

That said, I'm not against the idea at all. I'm pretty sure operational issues would pop out, but that's progress...

[Legacy Jira: Erick Erickson (@ErickErickson) on Feb 14 2020]