This ticket outlines what we would like to get done as part of the Scoop MVP in terms of making it easier for non-developers to pull, fix, and reingest batches.
I see a small number of high-level "things" we can probably make happen without violating the NDNP spec or making massive changes to ONI.
Each option could have its own dedicated UI instead of trying to force a general purpose "editor" feature in.
Stuff we have to keep in mind for all situations
The implementation, no matter which situation we see, will differ depending on batch status:
live means we still have the issues' files on disk, and nothing has been archived. This is the easiest scenario.
live_archived means we still have files on disk, but the batch has been archived. We still have stuff on disk, so we can fairly easily regenerate batches or allow re-curation of issues, but we'll have to at least mention the archive needs a fix. NCA can't directly touch the dark archive.
live_done is the toughest: we don't have files on disk anymore. We'll have to pull files from the live batch, which means we won't have TIFFs. This could complicate things a lot. Might have to require the user to actually copy down the archived batch or something.
We'll need to present batch information in all cases. For MVP, we only support batches that were generated by NCA, as processing non-NCA batches is a bigger task. If there's time, we could create a batch reader of some kind, but that's probably not likely.
Question: how to handle dark archives? Just make a note that people will have to fix that themselves?
Situations
Issue Removal
Some number of issues need to be removed from a batch, but most are fine. Maybe issues have higher-quality replacements or maybe they need metadata re-entered.
Show a list of issues and let the user pick which ones shouldn't be in the batch. Similar to the existing batch QC view, but for a live issue.
Using the live files somehow, generate a new batch with just the remaining files. Same name as the live batch, but one version higher.
Show user how to purge the old batch and load the new one.
Pull the bad issues into NCA for re-curation or just outright deletion. Similar, again, to the batch QC page.
Bulk edit
There is some kind of "search and replace" operation we need to run. It might span multiple batches. There are likely a variety of filters, not just a simple replace of every value matching some search.
Some examples:
All issues with LCCN A after publication date B need their LCCN changed to C
All issues with LCCN A need their MARC Org Code changed to B
These cases make a lot more sense to generate a batch patch rather than trying to pull issues and stuff them back into NCA. Especially the second case, given how MOCs work in NCA.
A batch patch will probably be something we need to standardize in some way. We'll probably want a general-purpose script that reads some kind of list of filters and directives, then finds and fixes batches appropriately. We'll need to document how to apply these on a reingest of data, and make it clear users with archived batches will need to preserve the patches with exactly the same amount of care they preserve their batches.
Delete Batch
An entire batch needs to be pulled and all issues just need to get back into NCA for some reason. Maybe it's a small batch that shouldn't have gone out yet (embargo rules were bad) or maybe the issues all need bulk edits, but in a way that just doesn't work well with whatever "batch patch" we come up with.
This scenario is the most time-consuming for users and would need some warnings. All issues would go back into NCA. They could keep their metadata, or be destroyed, but they're all basically treated as if they never were in a batch. They will have to be rebatched the same way any other issues are.
This ticket outlines what we would like to get done as part of the Scoop MVP in terms of making it easier for non-developers to pull, fix, and reingest batches.
I see a small number of high-level "things" we can probably make happen without violating the NDNP spec or making massive changes to ONI.
Each option could have its own dedicated UI instead of trying to force a general purpose "editor" feature in.
Stuff we have to keep in mind for all situations
The implementation, no matter which situation we see, will differ depending on batch status:
live
means we still have the issues' files on disk, and nothing has been archived. This is the easiest scenario.live_archived
means we still have files on disk, but the batch has been archived. We still have stuff on disk, so we can fairly easily regenerate batches or allow re-curation of issues, but we'll have to at least mention the archive needs a fix. NCA can't directly touch the dark archive.live_done
is the toughest: we don't have files on disk anymore. We'll have to pull files from the live batch, which means we won't have TIFFs. This could complicate things a lot. Might have to require the user to actually copy down the archived batch or something.We'll need to present batch information in all cases. For MVP, we only support batches that were generated by NCA, as processing non-NCA batches is a bigger task. If there's time, we could create a batch reader of some kind, but that's probably not likely.
Question: how to handle dark archives? Just make a note that people will have to fix that themselves?
Situations
Issue Removal
Some number of issues need to be removed from a batch, but most are fine. Maybe issues have higher-quality replacements or maybe they need metadata re-entered.
Bulk edit
There is some kind of "search and replace" operation we need to run. It might span multiple batches. There are likely a variety of filters, not just a simple replace of every value matching some search.
Some examples:
These cases make a lot more sense to generate a batch patch rather than trying to pull issues and stuff them back into NCA. Especially the second case, given how MOCs work in NCA.
A batch patch will probably be something we need to standardize in some way. We'll probably want a general-purpose script that reads some kind of list of filters and directives, then finds and fixes batches appropriately. We'll need to document how to apply these on a reingest of data, and make it clear users with archived batches will need to preserve the patches with exactly the same amount of care they preserve their batches.
Delete Batch
An entire batch needs to be pulled and all issues just need to get back into NCA for some reason. Maybe it's a small batch that shouldn't have gone out yet (embargo rules were bad) or maybe the issues all need bulk edits, but in a way that just doesn't work well with whatever "batch patch" we come up with.
This scenario is the most time-consuming for users and would need some warnings. All issues would go back into NCA. They could keep their metadata, or be destroyed, but they're all basically treated as if they never were in a batch. They will have to be rebatched the same way any other issues are.