pkiraly / qa-catalogue

QA catalogue – a metadata quality assessment tool for library catalogue records (MARC, PICA)
GNU General Public License v3.0
76 stars 18 forks source link

Crash with NullPointerException while validating an MRC file #356

Open pabloab opened 8 months ago

pabloab commented 8 months ago

Recently found this project after searching for regex patterns for each MARC 21 subfield. A little overwhelmed about all its features. I start trying to get a report of a set of 102964 records of a MARC file from a Koha (v22.05).

It stays processing for a couple of seconds and then starts sending all the records contents to stdout. Then it crashes with a NullPointerException.

$ ./validate --summary --marcFormat ISO --schemaType MARC21 --defaultEncoding UTF-8 koha-2023-11-10.mrc

Nov 13, 2023 5:59:47 PM de.gwdg.metadataqa.marc.cli.ValidatorCli beforeIteration
INFO: schemaType: MARC21
marcVersion: MARC21, MARC21
marcFormat: ISO, Binary (ISO 2709)
dataSource: FILE, from file
limit: -1
offset: -1
MARC files: koha-2023-11-10.mrc
id: null
defaultRecordType: null
fixAlephseq: false
fixAlma: false
alephseq: false
marcxml: false
lineSeparated: false
outputDir: .
trimId: false
ignorableFields: 
allowableRecords: 
ignorableRecords: 
defaultEncoding: UTF-8
alephseqLineType: null
details: true
summary: true
detailsFileName: validation-report.txt
summaryFileName: null
format: simple text
emptyLargeCollectors: false

Nov 13, 2023 5:59:47 PM de.gwdg.metadataqa.marc.cli.ValidatorCli beforeIteration
INFO: details output: ./validation-report.txt
Nov 13, 2023 5:59:47 PM de.gwdg.metadataqa.marc.cli.utils.RecordIterator start
INFO: marcVersion: MARC21, MARC21
Nov 13, 2023 5:59:47 PM de.gwdg.metadataqa.marc.cli.utils.RecordIterator processFile
INFO: processing: koha-2023-11-10.mrc
[main] INFO org.reflections.Reflections - Reflections took 119 ms to scan 1 urls, producing 3 keys and 445 values
Nov 13, 2023 6:00:02 PM de.gwdg.metadataqa.marc.cli.utils.RecordIterator processContent
SEVERE: No record number at 89353, last known ID: PAPER-28569

[....]

999   $c102477$d102477
952   $00$10$2udc$40$50$6504482_S6999$73$9307558$aBC$bBC$cDEP$d2023-11-10$eDiego Lisandro Sonzogni Mazzaro$i91451$l0$o504.4(82) S6999$p91451$r2023-11-10$w2023-11-10$yDEP$�$�$�$�flex$�$�DO$�$�

Nov 13, 2023 5:40:28 PM de.gwdg.metadataqa.marc.cli.utils.RecordIterator processFile
INFO: Finished processing file. Processed 102,125 records.
Nov 13, 2023 5:40:28 PM de.gwdg.metadataqa.marc.cli.ValidatorCli afterIteration
INFO: printCounter
Nov 13, 2023 5:40:28 PM de.gwdg.metadataqa.marc.cli.ValidatorCli afterIteration
INFO: printSummary
Nov 13, 2023 5:40:28 PM de.gwdg.metadataqa.marc.cli.ValidatorCli afterIteration
INFO: printCategoryCounts
Nov 13, 2023 5:40:28 PM de.gwdg.metadataqa.marc.cli.ValidatorCli afterIteration
INFO: printTypeCounts
Nov 13, 2023 5:40:28 PM de.gwdg.metadataqa.marc.cli.ValidatorCli afterIteration
INFO: printTotalCounts
Nov 13, 2023 5:40:28 PM de.gwdg.metadataqa.marc.cli.ValidatorCli afterIteration
INFO: printCollector
Exception in thread "main" java.lang.NullPointerException: file
        at java.base/java.util.Objects.requireNonNull(Objects.java:246)
        at org.apache.commons.io.FileUtils.openOutputStream(FileUtils.java:2444)
        at org.apache.commons.io.FileUtils.writeStringToFile(FileUtils.java:3540)
        at de.gwdg.metadataqa.marc.cli.ValidatorCli.printToFile(ValidatorCli.java:465)
        at de.gwdg.metadataqa.marc.cli.ValidatorCli.print(ValidatorCli.java:459)
        at de.gwdg.metadataqa.marc.cli.ValidatorCli.printCollectorEntry(ValidatorCli.java:445)
        at de.gwdg.metadataqa.marc.cli.ValidatorCli.printCollector(ValidatorCli.java:312)
        at de.gwdg.metadataqa.marc.cli.ValidatorCli.afterIteration(ValidatorCli.java:294)
        at de.gwdg.metadataqa.marc.cli.utils.RecordIterator.start(RecordIterator.java:91)
        at de.gwdg.metadataqa.marc.cli.ValidatorCli.main(ValidatorCli.java:107)

It seems it doesn't consider a subfield code could be some Unicode char like a Greek letter (alpha, beta, gamma...):

    <subfield code="o">504.4(82) S6999</subfield>
    <subfield code="p">91451</subfield>
    <subfield code="r">2023-11-10</subfield>
    <subfield code="w">2023-11-10</subfield>
    <subfield code="y">DEP</subfield>
    <subfield code="&#x3B4;">flex</subfield>
    <subfield code="&#x3C3;">DO</subfield>
  </datafield>
</record>
pkiraly commented 8 months ago

Dear @pabloab,

thanks for give QA catalogue a try. Which version of the software do you use, is it a release or did you build it from the source code? (I guess it a released one). Is this file downloadable from somewhere, or could you upload some records? If you do not want to make it available in the issue, you can send me in email: kirunews x gmail. So far I did not worked with records having Greek characters as subfield code.

I guess the problem is cased by this line:

FileUtils.writeStringToFile(file, content, Charset.defaultCharset(), true)

Do you know what is the default character set on your machine? I think we should use UTF-8 instead.

And out of curiosity; does UBA stands for Universidad de Buenos Aires?

pabloab commented 8 months ago

I'm using v0.6.0, using the wget/unzip installation. locale is en_US.UTF-8.

I exported a new mrc with just one record, and get the same error:

$ ./validate --summary --marcFormat ISO --schemaType MARC21 --defaultEncoding UTF-8 /tmp/koha3bis.mrc

Nov 13, 2023 7:36:38 PM de.gwdg.metadataqa.marc.cli.ValidatorCli beforeIteration
INFO: schemaType: MARC21
marcVersion: MARC21, MARC21
marcFormat: ISO, Binary (ISO 2709)
dataSource: FILE, from file
limit: -1
offset: -1
MARC files: /tmp/koha3bis.mrc
id: null
defaultRecordType: null
fixAlephseq: false
fixAlma: false
alephseq: false
marcxml: false
lineSeparated: false
outputDir: .
trimId: false
ignorableFields: 
allowableRecords: 
ignorableRecords: 
defaultEncoding: UTF-8
alephseqLineType: null
details: true
summary: true
detailsFileName: validation-report.txt
summaryFileName: null
format: simple text
emptyLargeCollectors: false

Nov 13, 2023 7:36:38 PM de.gwdg.metadataqa.marc.cli.ValidatorCli beforeIteration
INFO: details output: ./validation-report.txt
Nov 13, 2023 7:36:38 PM de.gwdg.metadataqa.marc.cli.utils.RecordIterator start
INFO: marcVersion: MARC21, MARC21
Nov 13, 2023 7:36:38 PM de.gwdg.metadataqa.marc.cli.utils.RecordIterator processFile
INFO: processing: koha3bis.mrc
[main] INFO org.reflections.Reflections - Reflections took 146 ms to scan 1 urls, producing 3 keys and 445 values
Nov 13, 2023 7:36:38 PM de.gwdg.metadataqa.marc.cli.utils.RecordIterator processFile
INFO: Finished processing file. Processed 1 records.
Nov 13, 2023 7:36:38 PM de.gwdg.metadataqa.marc.cli.ValidatorCli afterIteration
INFO: printCounter
Nov 13, 2023 7:36:38 PM de.gwdg.metadataqa.marc.cli.ValidatorCli afterIteration
INFO: printSummary
Nov 13, 2023 7:36:38 PM de.gwdg.metadataqa.marc.cli.ValidatorCli afterIteration
INFO: printCategoryCounts
Nov 13, 2023 7:36:38 PM de.gwdg.metadataqa.marc.cli.ValidatorCli afterIteration
INFO: printTypeCounts
Nov 13, 2023 7:36:38 PM de.gwdg.metadataqa.marc.cli.ValidatorCli afterIteration
INFO: printTotalCounts
Nov 13, 2023 7:36:38 PM de.gwdg.metadataqa.marc.cli.ValidatorCli afterIteration
INFO: printCollector
Exception in thread "main" java.lang.NullPointerException: file
    at java.base/java.util.Objects.requireNonNull(Objects.java:246)
    at org.apache.commons.io.FileUtils.openOutputStream(FileUtils.java:2444)
    at org.apache.commons.io.FileUtils.writeStringToFile(FileUtils.java:3540)
    at de.gwdg.metadataqa.marc.cli.ValidatorCli.printToFile(ValidatorCli.java:465)
    at de.gwdg.metadataqa.marc.cli.ValidatorCli.print(ValidatorCli.java:459)
    at de.gwdg.metadataqa.marc.cli.ValidatorCli.printCollectorEntry(ValidatorCli.java:445)
    at de.gwdg.metadataqa.marc.cli.ValidatorCli.printCollector(ValidatorCli.java:312)
    at de.gwdg.metadataqa.marc.cli.ValidatorCli.afterIteration(ValidatorCli.java:294)
    at de.gwdg.metadataqa.marc.cli.utils.RecordIterator.start(RecordIterator.java:91)
    at de.gwdg.metadataqa.marc.cli.ValidatorCli.main(ValidatorCli.java:107)
$ cat validation-report.txt

"id","MarcPath","categoryId","typeId","type","message","url","instances","records"
2,931,3,9,undefined field,931,,1,1
7,999,3,9,undefined field,999,,1,1
1,691,3,9,undefined field,691,,1,1
5,976,3,9,undefined field,976,,1,1
6,997,3,9,undefined field,997,,1,1
4,962,3,9,undefined field,962,,1,1
3,942,3,9,undefined field,942,,1,1
$ yaz-marcdump  /tmp/koha3bis.mrc

01222cam a22004217a 4500
001 BIBLO-1
005 20230517170929.0
008 000201m19291951nyua|d|f |||| 00| 0|spa|d
044    $a xxu
080    $a 535 $b W759
100 1  $a Winchell, Alexander Newton $4 aut $e autor
245 10 $a Elements of optical mineralogy : $b an introduction to microscopic petrography
250    $a 4th. ed.
260    $a New York, NY : $b Wiley, $c 1929-1951
300    $a 3 v. : $b il., diagrs., tablas (algunas col.)
541    $c DO $a Dr. Ruben Cucchi $n V2E8
562    $e 3V1, 8V2, 3V3
653 10 $a MINERALOGIA
653 10 $a CRISTALOGRAFIA
653 10 $a MINERALES
653 10 $a MINERALES ISOTROPOS
653 10 $a MINERALOGIA OPTICA
653 10 $a OXIDOS
653 10 $a CARBONATOS
653 10 $a MINERALES OPACOS
653 10 $a MINERALES ANISOTROPOS
653 10 $a MINERALES BIRREFRIGENTES
653 10 $a NITRATOS
653 10 $a SORATOS
653 10 $a SULFATOS
653 10 $a FOSFATOS
691  7 $2 fcen-at $a geologia
931    $a PALEO $b PALEONTOLOGIA
942    $2 udc $n 0
962    $a info:eu-repo/semantics/book $a info:ar-repo/semantics/libro $b info:eu-repo/semantics/publishedVersion
976    $a AEX
997    $a MONOGRAF
999    $c 1 $d 1

Yes, stands for Universidad de Buenos Aires. Glad you know about us :smile:

pkiraly commented 8 months ago

@pabloab Thanks! I tested it. It really throws an exepction with 0.6.0 release, but it was fixed in 0.7.0, and also works well with the current developing version. So my suggestion is to use 0.7.0, or - if you would like to keep update with the latest features the current source code.

My knowledge about Universidad de Buenos Aires is quite limited, but I know that one of my favorite authors, Jorge Luis Borges was a professor of English at your university before he was appointed as a director of the national library. The teaching activities (such as a seminar about the Saxon language) and teaching subjects (the thoughts of his favorite English writers) appeared in his writings here and there. But it is a good time to learn more about the university itself!

pabloab commented 8 months ago

I first tried to install v0.7.0, changing the wget line, the 6 for a 7. Now, after a closer look, I notice that URL point to other repo, metadata-qa-marc, which have v0.6.0 but not v0.7.0 (in turn the older version is not present on qa-catalogue).

I tried with v0.7.0 and indeed it doesn't crash. I had other issues that maybe I could file aside:


I also really like Borges (I recently revisited an interview). I was lucky enough to be a professor for a some years at that same campus, Puán, which now has its own film (from what I see in the trailer it captures the academic interns quite well).

We also have a copy of H. P. Lovecraft's Necronomicon. Of course, I made sure it has its MARC record :wink:

pkiraly commented 8 months ago

These are a number of different things:

  1. wget: my mistake, I am fixing it.

  2. "type" errors

    no type has been detected. Leader: '01276ca  a22002773a 4500'.

Here the problem is that in order to process the control fields (mainly 008) we should figure out the type of the record from Leader/06 (Type of record) and Leader/07 (Bibliographic level). There are some possible valid combinations of these two characters, "a " in this case is not among them. You can add an extra flag to all analyses: --defaultRecordType BOOKS which set the default record type IF the above error happens.

  1. logging: we use java.util.logging.Logger, it could be configured to separate different messages. I am thinking about that. In the common-script file which I mostly use the strerr and stdout is intentionally redirected to the same place - for me it is easier to check everything in one place, but you are right, there might be different expectations.

  2. "A feature request would be to add a space around subfield codes, like the default line mode MARC output format of yaz-marcdump." Could you put an example output? In which file it happens?

Borges: many thanks! I was not aware of that interview. I like a lot another one from the same time: https://www.youtube.com/watch?v=bNxzQSheCkc, this was done in Eglish for a US TV show. Borges said interesting things, like that Latin America did not produce literature which would be interesting for the rest of the world - it was some years before Marquez' Nobel prize, and the big success of other Latin American writers (Llosa, Cortasar etc.). Does Borges have a sculpture or some other memorial at Puán? The film seems to be interesting - the situation is quite typical in academic world.

pabloab commented 8 months ago

AFAIK there is no Borges statue on Puan. No one doubts his talent as a writer, but his politics opinions (which he himself says shouldn't be taken into account) are at the opposite extreme from a vast majority, especially there.