Open cbeer opened 4 years ago
@cbeer -- i had to do something like this long ago (in python), processing marc21 files. I'll see if I can find the code in case the logic is useful. But as I remember it, I couldn't use the normal (pymarc?) syntax of reading the file via looping through each record -- specifically because it would blow up when coming across an invalid marc-record. What worked, IIRC, was using seek to grab the expected record, flowing the text into a marc-object. I don't think I was validating any piece of the marc internally -- I was just checking that the grabbed text could be perceived as valid marc. The nice thing about that was that while looping through, I could create a list of invalid marc-records.
@cbeer -- in case this is useful...
(after file-open)
reader = MARCReader(fh)
for record in reader:
...but that doesn't allow for record-level exception handling, which I needed cuz of some invalid marc records. So...
I get a record, storing the seek-pointer location to an attribute along the way.
When encountering an invalid marc record, the handler uses seek to grab/log the errant part of the record, and then resets the pointer so the next iteration works properly.
There may be better ways of addressing this, but this worked.
Thanks -- I think our current approach with ruby-marc for MARC21 files (read the first 5 bytes of the leader, then the rest of the record, and process each record-blob separately). Have you had to deal with something similar for MARCXML?
propose a way to flag, request (and list?) invalid/malformed/etc MARC21/MARCXML records in a dump to replace the manual approach of download the whole file and try to match up some records with a honeybadger report.
This is potentionally tricky because... if we blow up trying to read the record... we can't read the record to extract it in the first place?