uga-libraries / hub-monitoring

Scripts for summarizing and validating content on the Digital Production Hub, the UGA Libraries' centralized storage for digital objects that are not suitable for our digital preservation system.
Creative Commons Attribution Share Alike 4.0 International
1 stars 0 forks source link

Differences in Preservation Log Columns #63

Closed amhanson9 closed 2 months ago

amhanson9 commented 5 months ago

Before automatically updating the preservation_log.txt, verify that it has the expected columns. There are legacy files with different names and missing the collection and accession number needed for the new row in the log. Columns are Date, Electronic Media Identifier, Action, Staff

The script is getting the collection and accession number from the first two columns of the last row in preservation_log.txt, so if they aren't what is expected, it puts the wrong information in those columns. And if the new information is a different number of columns, it will cause a ParseError if the log is ever read into pandas again.

amhanson9 commented 5 months ago

@emkaser , my current solution if the header row does not have the expected values is to print an error (like we do if there is no preservation log) and not update the log. You'd need to run validation again after fixing the log.

If you plan to fix the log, I could still add the data from validating to the log (maybe with the standard header row, so it is clear what the data is) as well as printing, so you'd have the validation information when you reformat the log.