Closed LouisStAmour closed 5 years ago
Thanks for this detailed bug report.
But that's one heck of a workaround ...
Yes, it just a bug.
So for whatever reason, the ~/go/bin/marcdump tool has valid offsets and valid marc record numbers, but the record numbers don't match the offsets it found!
Ok, I have a hypothesis (some concurrency issue) - which will take me a bit to test thoroughly.
Thanks! Glad to help. I agree, re: concurrency. I was going to try reducing the concurrency by using command-line args but the option wasn’t there for marcmap the way it was for marctojson. It just occurred to me I could have recompiled. I’ll try that in a bit.
I've an idea -- the source code here uses yaz and awk, and as it happens, the fallback (full parsing) happens to fail early because there are up to 2 "001" fields in the MARC records. I never noticed because the JSON output always shows one.
$ ~/go/bin/marcmap endecabibs.marc
2018/02/11 21:30:53 invalid 001 field count (2) for u1033898
The above command was modified from the original to force printing the first 001 record, and to force on fallback = true
.
$ ~/go/bin/marcmap endecabibs.marc | grep 'u1033898'
u1033898 19460042 1033
^C
$ dd skip=19460042 count=1033 if=endecabibs.marc of=u1033898.marc bs=1
1033+0 records in
1033+0 records out
1033 bytes transferred in 0.015943 secs (64793 bytes/sec)
We can see some of the differences here between yaz and marcdump:
$ ~/go/bin/marcdump u1033898.marc
001 u1033898
001 5
003 SIRSI
005 20040220111428.0
008 040120s2003 gw a 000 0 eng
010 [ ] [(a) 2004296092]
015 [ ] [(a) GBA3-V5181]
016 [7 ] [(a) 968430341], [(2) GyFmDB]
020 [ ] [(a) 379132876X]
035 [ ] [(a) (Sirsi) 9416693]
035 [ ] [(a) 9416693]
040 [ ] [(a) UKM], [(c) UKM], [(d) OHX], [(d) CPL], [(d) DLC]
042 [ ] [(a) lccopycat]
050 [00] [(a) NB553.S32], [(b) A4 2003]
090 [ ] [(a) 730.92 S11 S11.3]
100 [1 ] [(a) Saint-Phalle, Niki de,], [(d) 1930-2002]
245 [10] [(a) Niki de Saint Phalle :], [(b) my art, my dreams /], [(c) edited by Carla Schulz-Hoffmann ; preface by Pierre Restany ; with contributions by Pierre Descargues ... [et al.].]
246 [3 ] [(a) My art, my dreams]
260 [ ] [(a) Munich ;], [(a) New York :], [(b) Prestel,], [(c) c2003.]
300 [ ] [(a) 159 p. :], [(b) ill. (some col.) ;], [(c) 28 cm.]
500 [ ] [(a) Errata slip inserted.]
600 [10] [(a) Saint-Phalle, Niki de,], [(d) 1930-2002], [(v) Catalogs.]
700 [1 ] [(a) Schulz-Hoffmann, Carla.]
700 [1 ] [(a) Descargues, Pierre.]
596 [ ] [(a) 110]
$ yaz-marcdump u1033898.marc
01033cam a2200325 a 4500
001 u1033898
001 5
003 SIRSI
005 20040220111428.0
008 040120s2003 gw a 000 0 eng
010 $a 2004296092
015 $a GBA3-V5181
016 7 $a 968430341 $2 GyFmDB
020 $a 379132876X
035 $a (Sirsi) 9416693
035 $a 9416693
040 $a UKM $c UKM $d OHX $d CPL $d DLC
042 $a lccopycat
050 00 $a NB553.S32 $b A4 2003
090 $a 730.92 S11 S11.3
100 1 $a Saint-Phalle, Niki de, $d 1930-2002
245 10 $a Niki de Saint Phalle : $b my art, my dreams / $c edited by Carla Schulz-Hoffmann ; preface by Pierre Restany ; with contributions by Pierre Descargues ... [et al.].
246 3 $a My art, my dreams
260 $a Munich ; $a New York : $b Prestel, $c c2003.
300 $a 159 p. : $b ill. (some col.) ; $c 28 cm.
500 $a Errata slip inserted.
600 10 $a Saint-Phalle, Niki de, $d 1930-2002 $v Catalogs.
700 1 $a Schulz-Hoffmann, Carla.
700 1 $a Descargues, Pierre.
596 $a 110
I can easily see how maybe 30 of these might have accumulated, pushing the offsets off:
$ ~/go/bin/marcmap endecabibs.marc | grep 'u1033898' -C 3
u1033889 19457689 811
u1033892 19458500 725
u1033897 19459225 817
u1033898 19460042 1033
5 19461075 561
u1033899 19461636 944
u1033900 19462580 528
^C
The "5" should have been u1033899.
I've attached the broken MARC record here, and two others, for testing: broken.marc.gz
$ ~/go/bin/marcmap broken.marc
u1033898 0 1033
5 1033 561
u1033899 1594 944
@LouisStAmour, excellent catch.
Theoretically, all fields, except 001 (Control Number) and 005 (Date and Time of Latest Transaction), and subfields may be repeated. -- http://www.loc.gov/marc/specifications/specrecstruc.html#repeat
Emphasis mine. Real-world vs theory.
As a workaround I added a -safe
flag to marcmap
which forces fallback to slower (but safer) record parsing to find the identifiers. If multiple identifiers are present, only the first one is picked (at the moment).
Example:
$ marcmap fixtures/issue-5.mrc
u1033898 0 1033
5 1033 561
u1033899 1594 944
$ marcmap -safe fixtures/issue-5.mrc
u1033898 0 1033
u1033899 1033 561
u1033900 1594 944
$ marcmap -h
Usage: marcmap [OPTIONS] MARCFILE
-cpuprofile string
write cpu profile to file
-o string
output to sqlite3 file
-safe
use slower, but safer methods to extract record identifiers
-v prints current program version
If that's ok for you and fixes the problem, I would release a minor marctools update.
Closing this for now. If this issue remains with the latest version, please open another issue. Thanks!
This feels like an out-of-order bug, but near the start of a very large (2.4 million record) MARC file, the latest trunk copy of marcmap was working flawlessly:
But later on, I was trying to find MARC records near an offset that was giving me grief when trying to validate: https://github.com/pkiraly/metadata-qa-marc In trying to troubleshoot a NullPointerException with that tool, all I knew is I wanted to dump just next MARC record after u2407795. So I went to find it:
Okay, looks good I thought. So I went to dump the records:
And I'm totally confused. Because u2407795.marc ended up being u2407808 and u2407796.marc was u2407809 and u2407798.marc was u240781.
Time to try and figure this out using other tools ;-)
So for whatever reason, the ~/go/bin/marcdump tool has valid offsets and valid marc record numbers, but the record numbers don't match the offsets it found!
Well, since we know the offsets are correct even if the names aren't, let's use the offsets! ;-)
And now some more record dumping, this time "u2407769" (actually u2407795), "u2407772" (actually u2407796) and "u2407774" (actually u2407798):
Mission accomplished! =D
But that's one heck of a workaround ...
Still, very happy to have these tools! I've been playing with them all day.