mjordan / islandora_workbench

A command-line tool for managing content in an Islandora 2 repository
MIT License
26 stars 38 forks source link

Allow `--check` option to report more than one error at a time #268

Open McFateM opened 3 years ago

McFateM commented 3 years ago

I did a brief search here but didn't see any open or closed issues like this one so... Would it be possible to have the --check option issue messages as it does now, but continue past the first encountered error so that more potential problems can be reported with each run?

When running ICG tests of I8 with Workbench I frequently found that a particular input Google Sheet geared for one I8 instance was incompatible with another instance due to multiple columns of metadata. If it were possible to continue checking beyond the first error it would save time when checking data before ingest.

mjordan commented 3 years ago

Thanks Mark, it would be possible to do this but will require touching the code in quite a few places. It's a good idea.

McFateM commented 3 years ago

I figured it could get dicey so I'd like to suggest an alternate approach, maybe without so many code changes...

In my own implementation of IMI I've tweaked the code to treat any row or column with a hashtag (#) in the first character as a "comment", it gets read but excluded from processing. This allows me to experiment with CSV row and column headings without having to duplicate sheets or remove information entirely.

It also opens the door for iterative processing of data IF the process uses an "intermediate" or temporary copy of the CSV data, AND has the ability to modify that intermediate copy. The notion is that a first pass might find an error like an unrecognized field, in my tests I have one named marks_workbench_notes. The idea is that the process reports the error just as it does now, but writes something like # Error 01: marks_workbench_notes in place of the original column heading, then resets and runs again, treating the errant column as a "comment" on subsequent passes. Rinse and repeat until no new errors are reported in the loop.

mjordan commented 2 years ago

Dropping a note here from a convo in slack: it would be useful to provide a "strict" mode (the default, which would require --check to run before committing) and a non-strict mode which would preserve the current functionality. This strict mode may or may not be the same as the feature requested in the issue, but it's related.

mjordan commented 2 years ago

Setting exit_on_first_missing_file_during_check to false will tell Workbench to check for the existence of all "file" column values before exiting.

mjordan commented 2 years ago

Alternative to "strict", provide a max_allowed_errors option that would let user determine when --check should exit.

mjordan commented 1 year ago

Related issue - #457.

mjordan commented 1 year ago

Some prep work for this issue. In 94f0a09 I've added comments ("rows_with_missing_files") to (the 27) places where --check should use the approach I used here.

Also added strict_check config option, with true default to preserve existing behavior. If users want to let --check accumulate errors, this should be set to false in the configuration file. We can flip the default when all the applicable checks have been converted to use the "rows_with_missing_files" approach.

Looking at this now, I don't think adding a "max_allowed_errors" option as suggested above will be useful.

mjordan commented 1 year ago

Note to @mjordan: When second check that uses strict_check is in place, add an entry to the top of https://mjordan.github.io/islandora_workbench_docs/configuration/#miscellaneous-settings.

mjordan commented 1 year ago

Other notes to @mjordan:

mjordan commented 1 year ago

Now that Workbench uses an SQLite database, we can use it to accumulate errors/warnings and present them to the user after --check completes running.

noahwsmith commented 1 year ago

I've proposed a simpler implementation to achieve the same outcome over in #620 - just skipping the sys.exit calls and relying on log entries.