mikemccand / stargazers-migration-test

Testing Lucene's Jira -> GitHub issues migration
0 stars 0 forks source link

CheckIndex: pre-exorcise document id salvage [LUCENE-8961] #958

Open mikemccand opened 5 years ago

mikemccand commented 5 years ago

The CheckIndex tool supports the exorcising of corrupt segments from an index.

This ticket proposes to add an extra option which could first be used to potentially salvage the document ids of the segment(s) about to be exorcised. Re-ingestion for those documents could then be arranged so as to repair the data damage caused by the exorcising.


Legacy Jira details

LUCENE-8961 by Christine Poerschke (@cpoerschke) on Sep 02 2019, updated Oct 01 2019 Attachments: LUCENE-8961.patch (versions: 2)

mikemccand commented 5 years ago

Attached outline work-in-progress patch:

[Legacy Jira: Christine Poerschke (@cpoerschke) on Sep 02 2019]

mikemccand commented 5 years ago

This feels too unsafe to me for CheckIndex. For instance, what if idField is the corrupt field, you could end up with missing ids or the wrong ids? I'm fine with adding more information to the CheckIndex status in order to make it easier to do this kind of hacks on top of CheckIndex, but I'd like to keep CheckIndex something that is rock solid.

[Legacy Jira: Adrien Grand (@jpountz) on Sep 02 2019]

mikemccand commented 5 years ago

Thanks @jpountz for your input.

The latest attached patch facilitates potential salvaging of terms by making the CheckIndex class extensible so that developer's own deriving classes could:

It seems to me to be a rather awkward change though and if out-of-the-box CheckIndex would not support id salvaging then a stand-alone tool just for that purpose might be a cleaner solution? Either way, I won't have bandwidth to pursue this further in the near future i.e. just sharing things 'as is' in case it might help others in the meantime.

[Legacy Jira: Christine Poerschke (@cpoerschke) on Sep 03 2019]

mikemccand commented 5 years ago

Agreed it is awkward. When I said "on top of CheckIndex", I was rather thinking of running CheckIndex programmatically and then looking at the return value to understand what segments might need salvaging. A separate stand-alone tool sounds good to me too.

[Legacy Jira: Adrien Grand (@jpountz) on Sep 09 2019]

mikemccand commented 5 years ago

... When I said "on top of CheckIndex", I was rather thinking of running CheckIndex programmatically and then looking at the return value to understand what segments might need salvaging. ...

Ah, thanks for clarifying that!

Okay, let me take this opportunity then to jot down some code pointers for when this is being returned to in the future:

[Legacy Jira: Christine Poerschke (@cpoerschke) on Oct 01 2019]