metafacture / metafacture-core

Core package of the Metafacture tool suite for metadata processing.
https://metafacture.org
Apache License 2.0
69 stars 34 forks source link

Incompatible StreamReceiver output by marc modules due to inconsistent leader handling #454

Closed TobiasNx closed 2 months ago

TobiasNx commented 2 years ago

While the documentation of encode-marc21 states that it is compatible with the output of handle-marc-xml and decode-marc21, this is not factual due to inconsistent leader handling by decode-marc21, handle-marc-xml, encode-marc21 and encode-marcxml.

e.g.: We cannot transform marc21-> marcxml or the other way around. even marc21 -> marc21 is not so easy. See here This creates the same error as if it would process marc-xml.

Functional review: @TobiasNx Code review: @blackwinter


Behaviour of Flux-Modules:

decode-marc21 changes the leader to their specific function of the position: See here

---
leader:
  status: "p"
  type: "a"
  bibliographicLevel: "m"
  typeOfControl: " "
  characterCodingScheme: "a"
  encodingLevel: " "
  catalogingForm: "c"
  multipartLevel: " "
"001": "946638705"
"003": "DE-101"
"005": "20070429135622.0"
"007": "tu"
"008": "960123s2004    gw |||||r|||| 00||||eng  "
"015  ":
  a: "05,A03,2104"

with option emitleaderaswhow="true" the leader-element is an toplevel and sublevel field See here

---
leader:
  leader: "02602pam a2200529 c 4500"
"001": "946638705"
"003": "DE-101"
"005": "20070429135622.0"
"007": "tu"
"008": "960123s2004    gw |||||r|||| 00||||eng  "
"015  ":
  a: "05,A03,2104"
  z: "96,N47,0454"
  "2": "dnb"
"0167 ":

handle-marc-xml keeps the leader as an own field: See here:

---
type: "Bibliographic"
leader: "00000naa a2200000uc 4500"
"001": "1106253078"
"003": "DE-101"
"005": "20171202230117.0"
"007": "cr||||||||||||"
"008": "160712s2016    gw |||||o|||| 00||||eng  "
"0167 ":
  "2": "DE-101"
  a: "1106253078"
"022  ":

encode-marcxml can handle the result of decode-marc21(emitleaderaswhole="true") but cannot if the leader is ommited in multiple fields results in leader with multiple fields.

Then re result looks like this:

    <marc:record>
        <marc:leader>p</marc:leader>
        <marc:leader>a</marc:leader>
        <marc:leader>m</marc:leader>
        <marc:leader> </marc:leader>
        <marc:leader>a</marc:leader>
        <marc:leader> </marc:leader>
        <marc:leader>c</marc:leader>
        <marc:leader> </marc:leader>

It seems that there is no control if there is only one leader.


encode-marc21 cannot handle data from handle-marcxml: see

Error is:

org.metafacture.framework.FormatException: invalid tag format for reference field
    at org.metafacture.biblio.iso2709.RecordBuilder.checkValidReferenceFieldTag (RecordBuilder.java:260)
        org.metafacture.biblio.iso2709.RecordBuilder.appendReferenceField (RecordBuilder.java:244)
        org.metafacture.biblio.iso2709.RecordBuilder.appendReferenceField (RecordBuilder.java:224)
        org.metafacture.biblio.marc21.Marc21Encoder.processTopLevelLiteral (Marc21Encoder.java:254)
        org.metafacture.biblio.marc21.Marc21Encoder.literal (Marc21Encoder.java:186)
        org.metafacture.biblio.marc21.MarcXmlHandler.endElement (MarcXmlHandler.java:135)

Also not from decode-marc21(emitleaderaswhole="true") see

The error is:

org.metafacture.framework.FormatException: literal must only contain a single character:leader
    at org.metafacture.biblio.marc21.Marc21Encoder.processLiteralInLeader (Marc21Encoder.java:195)
        org.metafacture.biblio.marc21.Marc21Encoder.literal (Marc21Encoder.java:183)
        org.metafacture.biblio.marc21.Marc21Decoder.emitLeader (Marc21Decoder.java:254)
        org.metafacture.biblio.marc21.Marc21Decoder.process (Marc21Decoder.java:221)
        org.metafacture.biblio.marc21.Marc21Decoder.process (Marc21Decoder.java:136)

So besides inconsistencies it is difficult to transform marc21-> marcxml or the other way around. even marc21 -> marc21 is not so easy. See here This creates the same error as if it would process marc-xml.

TobiasNx commented 2 years ago

I would suggest the following changes:

blackwinter commented 2 years ago

Just a minor observation:

add the option emitleaderasentity="true"

Wouldn't it make more sense to use the same option emitleaderaswhole (with default true)?

TobiasNx commented 2 years ago

Just a minor observation:

add the option emitleaderasentity="true"

Wouldn't it make more sense to use the same option emitleaderaswhole (with default true)?

Or like that.

TobiasNx commented 2 years ago

@dr0i would be nice if the handle-marc-xml-module would support the emitleaderaswhole= option soon. it would help to make the almaFix especially the handling of leader-fields for the facets more readable and one would have less fuzz with variabes: https://github.com/hbz/lobid-resources/blob/4172bfef38c45e422cff14cfac56c6d81e7b8b67/src/main/resources/alma/alma.fix#L1-L11

TobiasNx commented 1 year ago

I found this again. We cannot just transform marc21 -> marcxml or the other way around marcxml -> marc21 due to the inconsistent leader handling. We additionally need to transform the data with a fix. But even this does not work:

https://metafacture.org/playground/?flux=%22https%3A//raw.githubusercontent.com/metafacture/metafacture-core/master/metafacture-runner/src/main/dist/examples/read/marc21/10.marc21%22%0A%7C+open-http%0A%7C+as-lines%0A%7C+decode-marc21%28emitleaderaswhole%3D%22true%22%29%0A%7C+fix%28transformationFile%29%0A%7C+encode-marc21%0A%7C+print%0A%3B&transformation=move_field%28%22leader.leader%22%2C%22@leader%22%29%0Amove_field%28%22@leader%22%2C%22leader%22%29

TobiasNx commented 1 year ago

Also docu states wrongly: https://github.com/metafacture/metafacture-core/blob/2cec78959d2c84ba6e408402680413098d9010eb/metafacture-biblio/src/main/java/org/metafacture/biblio/marc21/Marc21Encoder.java#L56-L57

TobiasNx commented 3 months ago

@dr0i as we talked about with I.W. transformation marc21 -> marcxml is needed.

TobiasNx commented 2 months ago

Found two workarounds for: decode-marc21(emitLeaderAsWhole="true") -> encode-marc21: See here.

handle-marcXml -> encode-marc21: See here.

dr0i commented 2 months ago

I try to condense the issues. I will give the scenarios references ([a,b,c ...] so we can easily refer to them :

a) marc21 -> marc21 works ( just do | decode-marc21(emitLeaderAsWhole="false")) b) marc21-> marcxml works (just do | decode-marc21(emitleaderaswhole="true"))
c) handle-marcxml -> encode-marc21 doesn't work

For c) we have to think about a solution: The Marc21Encoder expects (in method processLiteralInLeader) that a leader consists of single literals which consists as a Byte (a leader entity with many values). I.e. a leader cannot be one String. See https://github.com/metafacture/metafacture-core/commit/6d04d6976c98eb7173c773b2f4ddca3b7e0037d3 for introducing this and also the motivation to do so (which I don't understand - I mean we see there are problems coming with the removing of parsing/producing the leader as one String.)).

We could solve c) by: ca) "would be nice if the handle-marc-xml -module would support the emitleaderaswhole= option soon". We would allow emitleaderaswhole=false which would set them as a single Byte array or
cb) encode-marc21 would be able (again) to cope with a single leader String.

I think cb) would be the best , because as a sideeffect we wouldn't need to tell in a) emitleaderaswhole=false as it would also cope emitleaderaswhole=true.

TobiasNx commented 2 months ago

I think we touch reasons for the change of handling of the leader here #524. Changes in the records when transforming marc21->marc21 (XML and binary) also need changes in the leader since part of the leader are generated based on the number of signs, indicators, elements, subfields. Otherwise the leader and the record are not valid.

dr0i commented 2 months ago

Note: went with cb) as fix.