Open hornc opened 4 years ago
I got those "No separator at end of field
" errors while trying to generate a MRC from a TXT with fields bigger than what ISO2709 allow. Truncating those lines solved the problem. On Koha IRC someone mention this may also happen if there is any dirty \r
carrier return characters. HTH.
https://github.com/hornc/marcia/blob/master/fixindex.py has code to "fix" the indexes of a single binary MARC record. Modifying that to work with a MARC collection will hopefully resolve the issues.
Example, where it currently only works on fixing the first record:
yaz-marcdump <(head -c1027 fix-attempt.mrc )
01027cam a2200241 4500
008 700115s1969 inu 00100 eng
035 $a (Sirsi) AAA-0001
090 $a HN373.M213 $i 37348000000018
100 10 $a Maier, Hans, $d 1931-
240 10 $a Revolution und Kirche. $l English.
245 10 $a Revolution and church : $b the early history of Christian democracy, 1789-1901 / $c Translated by Emily M. Schossberger.
260 0 $a Notre Dame : $b University of Notre Dame Press, $c 1969.
300 $a xiv, 326 p. ; $c 24 cm.
440 0 $a Studies in Christian democracy ; $n v4.
500 $a Translation of Revolution und Kirche.
504 $a Bibliography: p. 298-314.
650 0 $a Christian democracy $z Europe $x History $y 18th century.
650 0 $a Christian democracy $z Europe $x History $y 19th century.
650 0 $a Church and state $x Catholic Church $x History $y 18th century.
650 0 $a Church and state $x Catholic Church $x History $y 19th century.
948 $a 02/08/1991 $b 09/12/2001
949 $a HN373.M213 $w LC $m UPEI $z NOITEM
901 $a 1 $b System $c 1
compared to original:
yaz-marcdump <(head -c1027 marc-for-openlibrary-bigset.mrc )
01027cam a2200241 4500
008 700115s1969 inu 00100 eng
(No separator at end of field length=40)
035 $a (Sirsi) AAA-00
(No separator at end of field length=20)
090 1 $ $a HN373.M213 $i 37348000000
(No separator at end of field length=30)
100 18
(Separator but not at end of field length=23)
240 31
(Separator but not at end of field length=36)
245 is $.
(Separator but not at end of field length=120)
260 rg $r .
(Separator but not at end of field length=56)
300 c1 $6 9.
(Separator but not at end of field length=25)
440 c2 $ cm.
(Separator but not at end of field length=41)
500 y $n v4.
(Separator but not at end of field length=41)
504 d $i rche.
(Separator but not at end of field length=29)
650 . $9 8-314.
(Separator but not at end of field length=55)
650 th $c entury.
(Separator but not at end of field length=55)
650 9t $ century.
(Separator but not at end of field length=61)
650 18 $h century.
(Separator but not at end of field length=61)
948 y1 $t h century.
(Separator but not at end of field length=26)
949 99 $b 09/12/2001
(Separator but not at end of field length=32)
901 LC $m UPEI $z NOITE
(No separator at end of field length=16)
EDIT cmd to fix and convert in one line:
yaz-marcdump <(fixindex.py marc-for-openlibrary-bigset.mrc | head -c1027 ) | cat
This is a bit of a complicated one, @judec brought the symptoms to my attention from an archive.org side.
A recent example that shows all the aspects of this issue (will generally require admin access to view all of the logs -- sorry)
From https://openlibrary.org/admin/imports there was this recent error:
Checking the archive.org record, https://archive.org/details/isbn_9780834200692_e0j5 we see that it has an id of
isbn_<isbn>
but does NOT have anisbn
metadata field populated (this is bad for archive.org metadata) -- it also shows a corrupt MARCXML: https://archive.org/download/isbn_9780834200692_e0j5/isbn_9780834200692_e0j5_archive_marc.xml(all the
<!-- No separator at end of field length=40 -->
in the XML are put there byyaz-marcdump
and indicate that the binary MARC directory is corrupt)This corrupt source binary MARC is why the OL import failed.
Result:
The root cause of this kind of issue seems to be all of these items are matching records in one bulk MARC item: https://archive.org/details/marc_upei this looks like it has a consistent directory off-by-one issue throughout the many 1000s of records in there.
I have a script from ages ago that can fix off-by-errors in a single binary MARC record, and I recently discovered code in OL that does the same thing: https://github.com/internetarchive/openlibrary/blob/970b31b69e36651a57a59ba8421d8420f1742236/openlibrary/catalog/marc/fast_parse.py#L194-L202
Proposal & Constraints
marc-for-openlibrary-bigset_original_with_offset_errors.mrc
)The last point is important because if the record start and end points don't change, the repaired MARC collection can be a drop in replacement, and all references to individual records will still be accurate.
If the record start and ends change, then the whole MARC will need to be deleted and a new collection re-generated or re-acquired.
Another approach Simply delete all references to the corrupt MARC as source records from OL data, OL
source_record
information is where these MARCs are being passed to archive.org.The repairing off-by-one errors in MARC is what allowed us to import these corrupt records in the first place, but other systems are falling over on the same data. Pymarc and yaz-marcdump fail to parse these. (relates to comments on #2865 re. using Pymarc)
Related files
Stakeholders