internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.11k stars 1.34k forks source link

Create unit test to detect marc unicode encoding issues #8798

Open cdrini opened 7 months ago

cdrini commented 7 months ago

Here is a recent import from IA into OL:

The long-withstanding issue ( #135 ) of mysterious characters like ©♭ appearing in the Open Library record!

The purpose of this issue is to create a unit test of the smallest possible piece that is breaking. Likely, that is the piece that takes in the MARC record. That way this error should never resurface!

Stakeholders

@hornc

hornc commented 7 months ago

@cdrini The source of this particular character issue is that the source record has an incorrect encoding flag in the MARC binary. It claims to be MARC-8 encoded, but the data is UTF-8 encoded... treating a UTF-8 é as if it were MARC-8 produces ©♭

For some reason, the MARC XML shows the correct UTF-8 encoding in the leader and content.

I think OL is doing the correct operations with bad data. Looking into why archive.org got incorrect data for the MARC binary (but not MARC XML) would be useful. It seems all actions on this item are recent.

hornc commented 7 months ago

A similar item scanned around the same time has accents displayed correctly: https://openlibrary.org/books/OL50976370M/Suppl%C3%A9ment_de_l'Abreg%C3%A9_de_toute_la_m%C3%A9decine_pratique_ou_tome_VI_de_cet_ouvrage_..._premiere_partie

cdrini commented 7 months ago

Here are some other recent ones:

I'm not sure what the pattern is that caused these to regress, but can we perhaps sniff the file and look for certain characters? Or use the marc xml instead of the binary?

tfmorris commented 7 months ago

The file that imported correctly https://openlibrary.org/books/OL50976370M doesn't have a binary MRC file, just a MARC XML file. I'm not sure what conditions cause that in the processing pipeline.

The additional 5 files identified by @cdrini follow the same pattern as identified by @hornc for the first example.

For some reason, the MARC XML shows the correct UTF-8 encoding in the leader and content.

Where are these files being sourced/derived from? Clearly something in the pipeline is broken. Interestingly, the two MARC files are the oldest files in the directory https://archive.org/download/b30530921_0001

Having said that MARCedit displays all the "broken" binary MARC files without any problem, so it must have some heuristic to override the encoding flag.

LeadSongDog commented 6 months ago

Further, the same is appearing in author names, such as: https://openlibrary.org/search?q=author%3A©+AND+ia%3A*&mode=everything or simply https://openlibrary.org/search/authors?q

tfmorris commented 6 months ago

Having said that MARCedit displays all the "broken" binary MARC files without any problem, so it must have some heuristic to override the encoding flag.

I confirmed with the author of MARCedit that he uses a heuristic for encoding detection because MARC encoding flag isn't reliable.