mjordan / islandora_workbench

A command-line tool for managing content in an Islandora 2 repository
MIT License
24 stars 39 forks source link

Workbench --check is not flagging non-EDTF dates in all cases and allows them to be ingested #387

Open dara2 opened 2 years ago

dara2 commented 2 years ago

Workbench --check is not flagging non-EDTF dates and is ingesting the data into EDTF fields (which then display incorrectly on the front-end). An example from our CSV for "Date Created" is 7/1/51 and it goes into the Date Created field as 7/1/51 and is displayed on the front-end as 7 to 1. This isn't happening in all cases, oddly. One row of the CSV was flagged as not having an EDTF-formatted date (it flagged 8/1/70 as non-EDTF). We are running Drupal 9.3.0, and Workbench was updated within the past 2 weeks.

mjordan commented 2 years ago

--check is catching both 7/1/51 and 8/1/70 for me, although it errors out on the first one. Once you correct that data, and rerun --check, it errors out on the second.

The check for EDTF validity only happens during --check. Would you prefer that Workbench checks also when it is about to populate the field, and if the date value is not valid, skips populating the Drupal field and logs that it skipped that value? If so, we can do that.

dara2 commented 2 years ago

Ah, so perhaps the --check errored out on that first date, and the person running it didn't rerun the check after fixing the first issue? That makes sense. I think then it's just human error, since --check is flagging them. I could go either way on whether Workbench shouldn't ingest the date if it's not valid.

mjordan commented 2 years ago

That sounds like what happened, but Workbench should minimize or even eliminate the impact of human error - that's why --check exists in the first place.

Resolving #268 would make Workbench not error out after the first --check failure. If that was in place, the user in the case above would have seen that there were more errors in the CSV and probably would not have proceeded to ingest. However, implementing that feature for the entirety of all possible errors found by --check is a very substantial amount of labour. I think it will be much quicker, and also much safer in terms of not allowing invalid data to get added to Drupal, option would be to do what I describe above: validate CSV values not only during --check but also during ingest/update and skip populating and log values that don't validate. All of the validation code used during --check can be reused. Let me look at this option over the weekend to see what's involved.

mjordan commented 2 years ago

I'll add, I don't mean that validating CSV data outside of --check implies that running --check would become redundant or unnecessary. Even with runtime validation in place, it would still be better to find and fix bad CSV data before committing data to Drupal than to unknowingly let Workbench create 1000 nodes and not populate a field because of simple error in the CSV data for that field.