pod4lib / aggregator

POD Aggregator, f.k.a. the POD Data Lake
https://pod.stanford.edu
Apache License 2.0
9 stars 3 forks source link

[Spike] expose invalid MARC/MARCXML records for analysis #172

Open cbeer opened 4 years ago

cbeer commented 4 years ago

propose a way to flag, request (and list?) invalid/malformed/etc MARC21/MARCXML records in a dump to replace the manual approach of download the whole file and try to match up some records with a honeybadger report.

This is potentionally tricky because... if we blow up trying to read the record... we can't read the record to extract it in the first place?

birkin commented 3 years ago

@cbeer -- i had to do something like this long ago (in python), processing marc21 files. I'll see if I can find the code in case the logic is useful. But as I remember it, I couldn't use the normal (pymarc?) syntax of reading the file via looping through each record -- specifically because it would blow up when coming across an invalid marc-record. What worked, IIRC, was using seek to grab the expected record, flowing the text into a marc-object. I don't think I was validating any piece of the marc internally -- I was just checking that the grabbed text could be perceived as valid marc. The nice thing about that was that while looping through, I could create a list of invalid marc-records.

birkin commented 3 years ago

@cbeer -- in case this is useful...

...but that doesn't allow for record-level exception handling, which I needed cuz of some invalid marc records. So...

There may be better ways of addressing this, but this worked.

cbeer commented 3 years ago

Thanks -- I think our current approach with ruby-marc for MARC21 files (read the first 5 bytes of the leader, then the rest of the record, and process each record-blob separately). Have you had to deal with something similar for MARCXML?