pkiraly / qa-catalogue

QA catalogue – a metadata quality assessment tool for library catalogue records (MARC, PICA)
GNU General Public License v3.0
78 stars 17 forks source link

java.lang.NullPointerException at SubfieldDefinition.java:79 #46

Closed LouisStAmour closed 4 years ago

LouisStAmour commented 6 years ago

Hi, trying this tool out on some Toronto Public Library MARC21 records. Managed to get through a million records before I hit a snag:

NullPointerException at https://github.com/pkiraly/metadata-qa-marc/blob/9e3ed1fd7b7303369d598f634cd5fcab6c4cc151/src/main/java/de/gwdg/metadataqa/marc/definition/SubfieldDefinition.java#L79

I've attached the offending MARC record here: u2407796.marc.gz

$ java -cp $HOME/.m2/repository/de/gwdg/metadataqa/metadata-qa-marc/0.2-SNAPSHOT/metadata-qa-marc-0.2-SNAPSHOT-jar-with-dependencies.jar de.gwdg.metadataqa.marc.cli.Validator -m MARC21 u2407796.marc

Feb 11, 2018 7:25:17 PM de.gwdg.metadataqa.marc.cli.Validator beforeIteration
INFO: marcVersion: MARC21, MARC21
limit: -1
offset: -1
MARC files: u2407796.marc
id: null
defaultRecordType: null
fixAlephseq: false
marcxml: false
lineSeparated: false
summary: false
fileName: validation-report.txt
format: TEXT

Feb 11, 2018 7:25:17 PM de.gwdg.metadataqa.marc.cli.Validator beforeIteration
INFO: output: validation-report.txt
Feb 11, 2018 7:25:17 PM de.gwdg.metadataqa.marc.cli.RecordIterator start
INFO: marcVersion: MARC21, MARC21
Feb 11, 2018 7:25:17 PM de.gwdg.metadataqa.marc.cli.RecordIterator start
INFO: processing: u2407796.marc
[main] INFO org.reflections.Reflections - Reflections took 300 ms to scan 1 urls, producing 1 keys and 283 values 
Feb 11, 2018 7:25:18 PM de.gwdg.metadataqa.marc.cli.RecordIterator start
SEVERE: java.lang.NullPointerException
java.lang.NullPointerException
    at de.gwdg.metadataqa.marc.definition.SubfieldDefinition.getPath(SubfieldDefinition.java:79)
    at de.gwdg.metadataqa.marc.DataField.validate(DataField.java:393)
    at de.gwdg.metadataqa.marc.MarcRecord.validate(MarcRecord.java:284)
    at de.gwdg.metadataqa.marc.cli.Validator.processRecord(Validator.java:117)
    at de.gwdg.metadataqa.marc.cli.RecordIterator.start(RecordIterator.java:86)
    at de.gwdg.metadataqa.marc.cli.Validator.main(Validator.java:51)

$ ~/go/bin/marcdump u2407796.marc

001 u2407796
003 SIRSI
005 20080331162830.0
008 070524s2007    cc a          001 0 chi  
010 [  ] [(a)   2007929507]
090 [  ] [(a) 004.16 IPH]
245 [00] [(6) 880-01], [(a) iPhone the Bible wan jia sheng jing.]
246 [30] [(6) 880-02], [(a) Wan jia sheng jing]
246 [31] [(6) 880-03], [(a) iPhone the bible]
546 [  ] [(a) Text in traditional Chinese characters.]
630 [00] [(a) iPhoto (Computer file)]
650 [ 0] [(a) iPhone (Smartphone)]
650 [ 0] [(a) Cell phones.]
650 [ 0] [(a) Digital music players.]
650 [ 0] [(a) Pocket computers.]
650 [ 4] [(a) Chinese books (Traditional characters)]
596 [  ] [(a) 4]
880 [  ] [(6) 245-01/$1], [(a) iphone the Bible 玩家聖經.]
880 [  ] [(6) 246-02/$1], [(a) 玩家聖經]
880 [30] [(6) 246-03], [(6) 880-03], [(a) iphone wan jia sheng jing]
880 [  ] [(6) 246-03/$1], [(a) iphonewan 玩家聖經]
pkiraly commented 6 years ago

Dear Louis,

thanks for the bug report. I'll investigate it in the next days, and let you know what I found. Do you use records from this datasource: https://opendata.tplcs.ca/ or it is directly from your local catalog?

Best, Péter

pkiraly commented 6 years ago

This is a bit tricky. The problem is that in 880 subfield $6 is not repeatable. This field is a reference field, which means, that is should be handled differently than any other fields.

880 [  ] [(6) 245-01/$1], [(a) iphone the Bible 玩家聖經.]

should first transform to

245 [  ] [(a) iphone the Bible 玩家聖經.]

then we can analyze. But in the case of

880 [30] [(6) 246-03], [(6) 880-03], [(a) iphone wan jia sheng jing]

there are two $6-s, and for the second one 880 should be tranfsormed to 880 which is problematic.

I am still thinking how to solve this problem...

pkiraly commented 6 years ago

Dear @LouisStAmour ,

I fixed the issue. The tool skip the checking of the field content, and places an error message like this:

880$6   ambiguous linkage   There are multiple $6   https://www.loc.gov/marc/bibliographic/bd880.html (2 times)

I also found that the file has almost 200 unparsable MARC records, which also caused some problem. I improved the error handling on that side as well.

Please check if it works for you, and please let me know the result.

Best, Péter