[Spike] expose invalid MARC/MARCXML records for analysis

pod4lib / aggregator

POD Aggregator, f.k.a. the POD Data Lake

https://pod.stanford.edu

Apache License 2.0

9 stars 3 forks source link

[Spike] expose invalid MARC/MARCXML records for analysis #172

Open cbeer opened 4 years ago

cbeer commented 4 years ago

propose a way to flag, request (and list?) invalid/malformed/etc MARC21/MARCXML records in a dump to replace the manual approach of download the whole file and try to match up some records with a honeybadger report.

This is potentionally tricky because... if we blow up trying to read the record... we can't read the record to extract it in the first place?

birkin commented 3 years ago

@cbeer -- i had to do something like this long ago (in python), processing marc21 files. I'll see if I can find the code in case the logic is useful. But as I remember it, I couldn't use the normal (pymarc?) syntax of reading the file via looping through each record -- specifically because it would blow up when coming across an invalid marc-record. What worked, IIRC, was using seek to grab the expected record, flowing the text into a marc-object. I don't think I was validating any piece of the marc internally -- I was just checking that the grabbed text could be perceived as valid marc. The nice thing about that was that while looping through, I could create a list of invalid marc-records.

birkin commented 3 years ago

@cbeer -- in case this is useful...

the manager, for context, showing iterating through the marc-file. Note that the normal way of iterating is...
```
(after file-open)
reader = MARCReader(fh)
for record in reader:
```

...but that doesn't allow for record-level exception handling, which I needed cuz of some invalid marc records. So...

I get a record, storing the seek-pointer location to an attribute along the way.
When encountering an invalid marc record, the handler uses seek to grab/log the errant part of the record, and then resets the pointer so the next iteration works properly.

There may be better ways of addressing this, but this worked.

cbeer commented 3 years ago

Thanks -- I think our current approach with ruby-marc for MARC21 files (read the first 5 bytes of the leader, then the rest of the record, and process each record-blob separately). Have you had to deal with something similar for MARCXML?