ubleipzig / marctools

Various MARC command line utilities.
GNU General Public License v3.0
35 stars 4 forks source link

Multiple 001 fields cause marcmap IDs to silently mismatch marcmap index values #5

Closed LouisStAmour closed 5 years ago

LouisStAmour commented 6 years ago

This feels like an out-of-order bug, but near the start of a very large (2.4 million record) MARC file, the latest trunk copy of marcmap was working flawlessly:

$ ~/go/bin/marcmap endecabibs.marc | grep 'u1017436'
u1017436    8762603 580
^C
$ dd skip=8762603 count=580 if=endecabibs.marc of=u1017436.marc bs=1
580+0 records in
580+0 records out
580 bytes transferred in 0.004539 secs (127781 bytes/sec)
$ ~/go/bin/marcdump u1017436.marc 
001 u1017436
003 SIRSI
008                                 0  chi  
010 [  ] [(o) 027382]
035 [  ] [(a) (Sirsi) 9842135]
090 [  ] [(a) INTERNATIONAL HE]
100 [1 ] [(6) 880-01], [(a) He, Yunshi]
245 [10] [(6) 880-02], [(a) HOCC live in unity 2006], [(h) [sound recording] :], [(b) we stand as one.]
246 [30] [(a) We stand as one]
300 [  ] [(a) 2 sound discs +], [(e) 1 DVD.]
546 [  ] [(a) In Cantonese.]
902 [  ] [(a) CD]
880 [1 ] [(6) 100-01/], [(1) ], [(a) 何韻詩]
880 [10] [(6) 245-02/], [(1) ], [(a) HOCC live in unity 2006: we stand as(粵)]
596 [  ] [(a) 30 31 42]

But later on, I was trying to find MARC records near an offset that was giving me grief when trying to validate: https://github.com/pkiraly/metadata-qa-marc In trying to troubleshoot a NullPointerException with that tool, all I knew is I wanted to dump just next MARC record after u2407795. So I went to find it:

$ ~/go/bin/marcmap endecabibs.marc | grep 'u2407795' -C 3
u2407782    998027560   1678
u2407784    998029238   1886
u2407788    998031124   1952
u2407795    998033076   1789
u2407796    998034865   1587
u2407798    998036452   1004
u2407799    998037456   1805

Okay, looks good I thought. So I went to dump the records:

$ dd skip=998033076 count=1789 if=endecabibs.marc of=bad-u2407795.marc bs=1 && dd skip=998034865 count=1587 if=endecabibs.marc of=bad-u2407796.marc bs=1 && dd skip=998036452 count=1004 if=endecabibs.marc of=bad-u2407798.marc bs=1
$ ~/go/bin/marcdump bad-u2407795.marc
001 u2407808
003 SIRSI
005 20080503      .0
007 sd*fungnnmnned
008 070829s2007    oncnnn j            eng d
028 [  ] [(a) TB 17,355]
040 [  ] [(a) CaOTBNL], [(b) eng]
043 [  ] [(a) a-cc---]
055 [01] [(a) DS778.7]
055 [ 3] [(a) DS778.7], [(b) Z46 2007]
082 [0 ] [(a) 951.05609], [(2) 22]
090 [  ] [(a) 951.05609 ZHA ZHA]
100 [1 ] [(a) Zhang, Ange.]
245 [10] [(a) Red land, Yellow River], [(h) [sound recording] :], [(b) a story from the Cultural Revolution /], [(c) by Ange Zhang.]
260 [  ] [(a) Toronto :], [(b) CNIB,], [(c) 2007.]
300 [  ] [(a) 1 sound disc (1 hrs., 15 min.) :], [(b) digital ;], [(c) 12 cm.]
500 [  ] [(a) Some descriptions of violence.]
500 [  ] [(a) Female reader.]
506 [  ] [(a) Restricted to PRINT DISABLED Patrons.]
511 [0 ] [(a) Read by Inez Somerville.]
520 [  ] [(a) In 1966, Zhang was a teen in Beijing when Mao Zedong began the Cultural Revolution. Though he was the son of a "bad guy" (a famous writer), he became swept up in the revolution, until the violence and his father's arrest made him question its goals. In 1968 was sent to a small village to learn how to farm, where he discovered his true calling - art.  Grades 2-4 and older readers. 2004.]
534 [  ] [(c) Toronto : Groundwood Books, 2004.]
538 [  ] [(a) Digital audio book in DAISY format.]
538 [  ] [(a) Requires DAISY talking book player.]
538 [  ] [(a) Recording standard: Daisy 2.02 standard.]
538 [  ] [(a) MP3 compression.]
538 [  ] [(a) Compression rate: 32.]
600 [10] [(a) Zhang, Ange], [(x) Childhood and youth], [(v) Juvenile literature.]
650 [ 0] [(a) DAISY talking books]
651 [ 0] [(a) China], [(x) History], [(y) Cultural Revolution, 1966-1976], [(v) Personal narratives], [(v) Juvenile literature.]
700 [1 ] [(a) Somerville, Inez.]
902 [  ] [(a) DAISY Talking Book]
596 [  ] [(a) 50]
$ ~/go/bin/marcdump bad-u2407796.marc
001 u2407809
003 SIRSI
005 20080503      .0
007 sd*fungnnmnned
008 070829s2007    oncnnn j            eng d
028 [  ] [(a) TB 17,356]
040 [  ] [(a) CaOTBNL], [(b) eng]
043 [  ] [(a) n-cn---]
055 [02] [(a) FC26*]
055 [ 3] [(a) FC26 W6], [(b) M33 2007]
082 [  ] [(a) j920.72/0971], [(2) 22]
090 [  ] [(a) 920.72097 MACL]
100 [1 ] [(a) MacLeod, Elizabeth.]
245 [14] [(a) The kids book of great Canadian women], [(h) [sound recording] /], [(c) by Elizabeth MacLeod.]
260 [  ] [(a) Toronto :], [(b) CNIB,], [(c) 2007.]
300 [  ] [(a) 1 sound disc (3 hrs., 52 min.) :], [(b) digital ;], [(c) 12 cm..]
500 [  ] [(a) Female reader.]
506 [  ] [(a) Restricted to PRINT DISABLED Patrons.]
511 [0 ] [(a) Read by Angela Willson.]
520 [  ] [(a) From artists and inventors to astronauts and engineers, Canadian women have played an extraordinary role in the development of Canada. Meet more than 130 women and read about their amazing feats in exploration, science, the arts, politics and many other fields. Some made their mark hundreds of years ago, while others are changing Canada today. Grades 4-7. 2006.]
534 [  ] [(c) Toronto : Kids Can Press, 2006.]
538 [  ] [(a) Digital audio book in DAISY format.]
538 [  ] [(a) Requires DAISY talking book player.]
538 [  ] [(a) Recording standard: Daisy 2.02 standard.]
538 [  ] [(a) MP3 compression.]
538 [  ] [(a) Compression rate: 32.]
650 [ 0] [(a) DAISY talking books]
650 [ 0] [(a) Women], [(z) Canada], [(v) Biography], [(v) Juvenile literature.]
700 [1 ] [(a) Willson, Angela.]
902 [  ] [(a) DAISY Talking Book]
596 [  ] [(a) 50]

And I'm totally confused. Because u2407795.marc ended up being u2407808 and u2407796.marc was u2407809 and u2407798.marc was u240781.

Time to try and figure this out using other tools ;-)

$ brew install grep
$ ggrep -baron u2407795 endecabibs.marc
1:998018305:u2407795
$ ggrep -baron u2407796 endecabibs.marc
1:998019182:u2407796
$ ggrep -baron u2407797 endecabibs.marc
$ ggrep -baron u2407798 endecabibs.marc
1:998020114:u2407798

So for whatever reason, the ~/go/bin/marcdump tool has valid offsets and valid marc record numbers, but the record numbers don't match the offsets it found!

Well, since we know the offsets are correct even if the names aren't, let's use the offsets! ;-)

$ ~/go/bin/marcmap endecabibs.marc | egrep '\t998018' -C 3
u2407761    998014495   1698
u2407763    998016193   747
u2407767    998016940   1160
u2407769    998018100   805
u2407772    998018905   860
u2407774    998019765   1629
u2407775    998021394   1759
u2407779    998023153   784

And now some more record dumping, this time "u2407769" (actually u2407795), "u2407772" (actually u2407796) and "u2407774" (actually u2407798):

$ dd skip=998018100 count=805 if=endecabibs.marc of=u2407795.marc bs=1 && dd skip=998018905 count=860 if=endecabibs.marc of=u2407796.marc bs=1 && dd skip=998019765 count=1629 if=endecabibs.marc of=u2407798.marc bs=1
805+0 records in
805+0 records out
805 bytes transferred in 0.007088 secs (113573 bytes/sec)
860+0 records in
860+0 records out
860 bytes transferred in 0.010046 secs (85606 bytes/sec)
1629+0 records in
1629+0 records out
1629 bytes transferred in 0.015619 secs (104296 bytes/sec)

Mission accomplished! =D

$ ~/go/bin/marcdump u2407795.marc
001 u2407795
003 SIRSI
008 080401n               j      000 0 chi u
090 [  ] [(a) 372.7044 ZHU]
100 [1 ] [(6) 880-01], [(a) Zhu, Huilan.]
245 [  ] [(6) 880-02], [(a) Jie ti shu xue :], [(b) 5 shui di er jie /], [(c) zhuo ze: Zhu Huilan (Han).]
440 [ 0] [(6) 880-03], [(a) You er yuan/xue qian ban shi yong de shu xue shu]
546 [  ] [(a) Text in simplified Chinese characters.]
650 [ 0] [(a) Mathematics], [(x) Study and teaching (Early childhood)]
650 [ 0] [(a) Art in mathematics education], [(v) Juvenile literature.]
650 [ 4] [(a) Chinese books (Simplified characters)], [(v) Juvenile literature.]
596 [  ] [(a) 66]
880 [  ] [(6) 100-01], [(a) 朱慧兰.]
880 [  ] [(6) 245-02], [(a) 阶梯数学 :], [(b) 5岁第2阶 /], [(c) 作者: 朱慧兰(韩).]
880 [  ] [(6) 440-03], [(a) 幼儿园/学前班适用的数学书]
$ ~/go/bin/marcdump u2407796.marc
001 u2407796
003 SIRSI
005 20080331162830.0
008 070524s2007    cc a          001 0 chi  
010 [  ] [(a)   2007929507]
090 [  ] [(a) 004.16 IPH]
245 [00] [(6) 880-01], [(a) iPhone the Bible wan jia sheng jing.]
246 [30] [(6) 880-02], [(a) Wan jia sheng jing]
246 [31] [(6) 880-03], [(a) iPhone the bible]
546 [  ] [(a) Text in traditional Chinese characters.]
630 [00] [(a) iPhoto (Computer file)]
650 [ 0] [(a) iPhone (Smartphone)]
650 [ 0] [(a) Cell phones.]
650 [ 0] [(a) Digital music players.]
650 [ 0] [(a) Pocket computers.]
650 [ 4] [(a) Chinese books (Traditional characters)]
596 [  ] [(a) 4]
880 [  ] [(6) 245-01/$1], [(a) iphone the Bible 玩家聖經.]
880 [  ] [(6) 246-02/$1], [(a) 玩家聖經]
880 [30] [(6) 246-03], [(6) 880-03], [(a) iphone wan jia sheng jing]
880 [  ] [(6) 246-03/$1], [(a) iphonewan 玩家聖經]

But that's one heck of a workaround ...

Still, very happy to have these tools! I've been playing with them all day.

miku commented 6 years ago

Thanks for this detailed bug report.

But that's one heck of a workaround ...

Yes, it just a bug.

So for whatever reason, the ~/go/bin/marcdump tool has valid offsets and valid marc record numbers, but the record numbers don't match the offsets it found!

Ok, I have a hypothesis (some concurrency issue) - which will take me a bit to test thoroughly.

LouisStAmour commented 6 years ago

Thanks! Glad to help. I agree, re: concurrency. I was going to try reducing the concurrency by using command-line args but the option wasn’t there for marcmap the way it was for marctojson. It just occurred to me I could have recompiled. I’ll try that in a bit.

LouisStAmour commented 6 years ago

I've an idea -- the source code here uses yaz and awk, and as it happens, the fallback (full parsing) happens to fail early because there are up to 2 "001" fields in the MARC records. I never noticed because the JSON output always shows one.

$ ~/go/bin/marcmap endecabibs.marc
2018/02/11 21:30:53 invalid 001 field count (2) for u1033898

The above command was modified from the original to force printing the first 001 record, and to force on fallback = true.

$ ~/go/bin/marcmap endecabibs.marc | grep 'u1033898'
u1033898    19460042    1033
^C

$ dd skip=19460042 count=1033 if=endecabibs.marc of=u1033898.marc bs=1
1033+0 records in
1033+0 records out
1033 bytes transferred in 0.015943 secs (64793 bytes/sec)

We can see some of the differences here between yaz and marcdump:

$ ~/go/bin/marcdump u1033898.marc 
001 u1033898
001 5
003 SIRSI
005 20040220111428.0
008 040120s2003    gw a          000 0 eng  
010 [  ] [(a)   2004296092]
015 [  ] [(a) GBA3-V5181]
016 [7 ] [(a) 968430341], [(2) GyFmDB]
020 [  ] [(a) 379132876X]
035 [  ] [(a) (Sirsi)  9416693]
035 [  ] [(a) 9416693]
040 [  ] [(a) UKM], [(c) UKM], [(d) OHX], [(d) CPL], [(d) DLC]
042 [  ] [(a) lccopycat]
050 [00] [(a) NB553.S32], [(b) A4 2003]
090 [  ] [(a) 730.92 S11 S11.3]
100 [1 ] [(a) Saint-Phalle, Niki de,], [(d) 1930-2002]
245 [10] [(a) Niki de Saint Phalle :], [(b) my art, my dreams /], [(c) edited by Carla Schulz-Hoffmann ; preface by Pierre Restany ; with contributions by Pierre Descargues ... [et al.].]
246 [3 ] [(a) My art, my dreams]
260 [  ] [(a) Munich ;], [(a) New York :], [(b) Prestel,], [(c) c2003.]
300 [  ] [(a) 159 p. :], [(b) ill. (some col.) ;], [(c) 28 cm.]
500 [  ] [(a) Errata slip inserted.]
600 [10] [(a) Saint-Phalle, Niki de,], [(d) 1930-2002], [(v) Catalogs.]
700 [1 ] [(a) Schulz-Hoffmann, Carla.]
700 [1 ] [(a) Descargues, Pierre.]
596 [  ] [(a) 110]
$ yaz-marcdump u1033898.marc 
01033cam a2200325 a 4500
001 u1033898
001 5
003 SIRSI
005 20040220111428.0
008 040120s2003    gw a          000 0 eng  
010    $a   2004296092
015    $a GBA3-V5181
016 7  $a 968430341 $2 GyFmDB
020    $a 379132876X
035    $a (Sirsi)  9416693
035    $a 9416693
040    $a UKM $c UKM $d OHX $d CPL $d DLC
042    $a lccopycat
050 00 $a NB553.S32 $b A4 2003
090    $a 730.92 S11 S11.3
100 1  $a Saint-Phalle, Niki de, $d 1930-2002
245 10 $a Niki de Saint Phalle : $b my art, my dreams / $c edited by Carla Schulz-Hoffmann ; preface by Pierre Restany ; with contributions by Pierre Descargues ... [et al.].
246 3  $a My art, my dreams
260    $a Munich ; $a New York : $b Prestel, $c c2003.
300    $a 159 p. : $b ill. (some col.) ; $c 28 cm.
500    $a Errata slip inserted.
600 10 $a Saint-Phalle, Niki de, $d 1930-2002 $v Catalogs.
700 1  $a Schulz-Hoffmann, Carla.
700 1  $a Descargues, Pierre.
596    $a 110

I can easily see how maybe 30 of these might have accumulated, pushing the offsets off:

$ ~/go/bin/marcmap endecabibs.marc | grep 'u1033898' -C 3
u1033889    19457689    811
u1033892    19458500    725
u1033897    19459225    817
u1033898    19460042    1033
5   19461075    561
u1033899    19461636    944
u1033900    19462580    528
^C

The "5" should have been u1033899.

I've attached the broken MARC record here, and two others, for testing: broken.marc.gz

$ ~/go/bin/marcmap broken.marc 
u1033898    0   1033
5   1033    561
u1033899    1594    944
miku commented 6 years ago

@LouisStAmour, excellent catch.

Theoretically, all fields, except 001 (Control Number) and 005 (Date and Time of Latest Transaction), and subfields may be repeated. -- http://www.loc.gov/marc/specifications/specrecstruc.html#repeat

Emphasis mine. Real-world vs theory.

As a workaround I added a -safe flag to marcmap which forces fallback to slower (but safer) record parsing to find the identifiers. If multiple identifiers are present, only the first one is picked (at the moment).

Example:

$ marcmap fixtures/issue-5.mrc
u1033898    0   1033
5   1033    561
u1033899    1594    944

$ marcmap -safe fixtures/issue-5.mrc
u1033898    0   1033
u1033899    1033    561
u1033900    1594    944

$ marcmap -h
Usage: marcmap [OPTIONS] MARCFILE
  -cpuprofile string
        write cpu profile to file
  -o string
        output to sqlite3 file
  -safe
        use slower, but safer methods to extract record identifiers
  -v    prints current program version

If that's ok for you and fixes the problem, I would release a minor marctools update.

miku commented 5 years ago

Closing this for now. If this issue remains with the latest version, please open another issue. Thanks!