pod4lib / aggregator

POD Aggregator, f.k.a. the POD Data Lake
https://pod.stanford.edu
Apache License 2.0
9 stars 3 forks source link

Johns Hopkins: MARC files contain non-printing control characters not allowed in XML #829

Open corylown opened 2 years ago

corylown commented 2 years ago

The normalized MARC XML files generated by POD from MARC21 files submitted by Johns Hopkins contain non-printing control characters that are not valid in XML. The visible symptom in the UI is that the records count is displayed as ??? for the XML version of the normalized files. Parsing these files with normal MARC tooling such as Marc Edit or ruby-marc raise errors because the files are not valid. I'm not sure what we can do about this in POD since the problem is with the files submitted to POD. While the records will be available for downstream consumers they are likely to run into problems making use of these files since they are not valid.

It's possible we could filter out non-printing characters during the normalized file writing process. Will need to investigate.

Example record:

JHU 001/bib number: 9638894 contains two instances of \x07 in the MARC 505$a

cc @bobpersing

JohnMarkOckerbloom commented 2 years ago

It'd be best if possible if Johns Hopkins fixes this on their end. (I'm not even sure if they know it's happening.) Has @bobpersing talked about this with them?

bobpersing commented 2 years ago

I've emailed Jing at Johns Hopkins.