internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.11k stars 1.34k forks source link

One bulk MARC record has off-by one errors in its index, causing IA and OL import problems #2877

Open hornc opened 4 years ago

hornc commented 4 years ago

This is a bit of a complicated one, @judec brought the symptoms to my attention from an archive.org side.

A recent example that shows all the aspects of this issue (will generally require admin access to view all of the logs -- sorry)

From https://openlibrary.org/admin/imports there was this recent error: image

Checking the archive.org record, https://archive.org/details/isbn_9780834200692_e0j5 we see that it has an id of isbn_<isbn> but does NOT have an isbn metadata field populated (this is bad for archive.org metadata) -- it also shows a corrupt MARCXML: https://archive.org/download/isbn_9780834200692_e0j5/isbn_9780834200692_e0j5_archive_marc.xml

(all the <!-- No separator at end of field length=40 --> in the XML are put there by yaz-marcdump and indicate that the binary MARC directory is corrupt)

This corrupt source binary MARC is why the OL import failed.

Result:

The root cause of this kind of issue seems to be all of these items are matching records in one bulk MARC item: https://archive.org/details/marc_upei this looks like it has a consistent directory off-by-one issue throughout the many 1000s of records in there.

I have a script from ages ago that can fix off-by-errors in a single binary MARC record, and I recently discovered code in OL that does the same thing: https://github.com/internetarchive/openlibrary/blob/970b31b69e36651a57a59ba8421d8420f1742236/openlibrary/catalog/marc/fast_parse.py#L194-L202

Proposal & Constraints

The last point is important because if the record start and end points don't change, the repaired MARC collection can be a drop in replacement, and all references to individual records will still be accurate.

If the record start and ends change, then the whole MARC will need to be deleted and a new collection re-generated or re-acquired.

Another approach Simply delete all references to the corrupt MARC as source records from OL data, OL source_record information is where these MARCs are being passed to archive.org.

The repairing off-by-one errors in MARC is what allowed us to import these corrupt records in the first place, but other systems are falling over on the same data. Pymarc and yaz-marcdump fail to parse these. (relates to comments on #2865 re. using Pymarc)

Related files

Stakeholders

pabloab commented 4 years ago

I got those "No separator at end of field" errors while trying to generate a MRC from a TXT with fields bigger than what ISO2709 allow. Truncating those lines solved the problem. On Koha IRC someone mention this may also happen if there is any dirty \r carrier return characters. HTH.

hornc commented 3 years ago

https://github.com/hornc/marcia/blob/master/fixindex.py has code to "fix" the indexes of a single binary MARC record. Modifying that to work with a MARC collection will hopefully resolve the issues.

Example, where it currently only works on fixing the first record:

yaz-marcdump <(head -c1027 fix-attempt.mrc )
01027cam a2200241   4500
008 700115s1969    inu           00100 eng  
035    $a (Sirsi) AAA-0001
090    $a HN373.M213 $i 37348000000018
100 10 $a Maier, Hans, $d 1931-
240 10 $a Revolution und Kirche. $l English.
245 10 $a Revolution and church : $b the early history of Christian democracy, 1789-1901 / $c Translated by Emily M. Schossberger.
260 0  $a Notre Dame : $b University of Notre Dame Press, $c 1969.
300    $a xiv, 326 p. ; $c 24 cm.
440  0 $a Studies in Christian democracy ; $n v4.
500    $a Translation of Revolution und Kirche.
504    $a Bibliography: p. 298-314.
650  0 $a Christian democracy $z Europe $x History $y 18th century.
650  0 $a Christian democracy $z Europe $x History $y 19th century.
650  0 $a Church and state $x Catholic Church $x History $y 18th century.
650  0 $a Church and state $x Catholic Church $x History $y 19th century.
948    $a 02/08/1991 $b 09/12/2001
949    $a HN373.M213 $w LC $m UPEI $z NOITEM
901    $a 1 $b System $c 1

compared to original:

yaz-marcdump <(head -c1027 marc-for-openlibrary-bigset.mrc )
01027cam a2200241   4500
008 700115s1969    inu           00100 eng 
(No separator at end of field length=40)
035   $a (Sirsi) AAA-00
(No separator at end of field length=20)
090 1 $   $a HN373.M213 $i 37348000000
(No separator at end of field length=30)
100 18
(Separator but not at end of field length=23)
240 31
(Separator but not at end of field length=36)
245 is $. 
(Separator but not at end of field length=120)
260 rg $r .
(Separator but not at end of field length=56)
300 c1 $6 9.
(Separator but not at end of field length=25)
440 c2 $  cm.
(Separator but not at end of field length=41)
500 y  $n v4.
(Separator but not at end of field length=41)
504 d  $i rche.
(Separator but not at end of field length=29)
650 .  $9 8-314.
(Separator but not at end of field length=55)
650 th $c entury.
(Separator but not at end of field length=55)
650 9t $  century.
(Separator but not at end of field length=61)
650 18 $h  century.
(Separator but not at end of field length=61)
948 y1 $t h century.
(Separator but not at end of field length=26)
949 99 $b 09/12/2001
(Separator but not at end of field length=32)
901 LC $m UPEI $z NOITE
(No separator at end of field length=16)

EDIT cmd to fix and convert in one line:

  yaz-marcdump <(fixindex.py marc-for-openlibrary-bigset.mrc | head -c1027 ) | cat