Open mikemccand opened 5 years ago
Attached outline work-in-progress patch:
-skipCheckIntegrity
option would allow the tool to proceed past the initial integrity checks (which would fail e.g. due to footer checksum failure)-idField F
option would identify the field from which terms are to be salvaged[Legacy Jira: Christine Poerschke (@cpoerschke) on Sep 02 2019]
This feels too unsafe to me for CheckIndex. For instance, what if idField is the corrupt field, you could end up with missing ids or the wrong ids? I'm fine with adding more information to the CheckIndex status in order to make it easier to do this kind of hacks on top of CheckIndex, but I'd like to keep CheckIndex something that is rock solid.
[Legacy Jira: Adrien Grand (@jpountz) on Sep 02 2019]
Thanks @jpountz for your input.
The latest attached patch facilitates potential salvaging of terms by making the CheckIndex
class extensible so that developer's own deriving classes could:
It seems to me to be a rather awkward change though and if out-of-the-box CheckIndex
would not support id salvaging then a stand-alone tool just for that purpose might be a cleaner solution? Either way, I won't have bandwidth to pursue this further in the near future i.e. just sharing things 'as is' in case it might help others in the meantime.
[Legacy Jira: Christine Poerschke (@cpoerschke) on Sep 03 2019]
Agreed it is awkward. When I said "on top of CheckIndex", I was rather thinking of running CheckIndex programmatically and then looking at the return value to understand what segments might need salvaging. A separate stand-alone tool sounds good to me too.
[Legacy Jira: Adrien Grand (@jpountz) on Sep 09 2019]
... When I said "on top of CheckIndex", I was rather thinking of running CheckIndex programmatically and then looking at the return value to understand what segments might need salvaging. ...
Ah, thanks for clarifying that!
Okay, let me take this opportunity then to jot down some code pointers for when this is being returned to in the future:
CheckIndex
can be read-only as well as read-write).[Legacy Jira: Christine Poerschke (@cpoerschke) on Oct 01 2019]
The CheckIndex tool supports the exorcising of corrupt segments from an index.
This ticket proposes to add an extra option which could first be used to potentially salvage the document ids of the segment(s) about to be exorcised. Re-ingestion for those documents could then be arranged so as to repair the data damage caused by the exorcising.
Legacy Jira details
LUCENE-8961 by Christine Poerschke (@cpoerschke) on Sep 02 2019, updated Oct 01 2019 Attachments: LUCENE-8961.patch (versions: 2)