metafacture / metafacture-core

Core package of the Metafacture tool suite for metadata processing.
https://metafacture.org
Apache License 2.0
69 stars 34 forks source link

`encode-marcxml` should combine split leader fields of `decode-marc21` to create valid marc xml #527

Closed TobiasNx closed 1 week ago

TobiasNx commented 2 months ago

encode-marcxml creates invalid marc data when processing decoded-marc21 with separated leader elements.

It creates invalid marc xml data. See the leader spec here in the schema: https://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd

encode-marcxml should behave in the same way as encode-marc21, if the data is seperated the encoder should combine them.

Status quo: https://test.metafacture.org/playground/?flux=%22https%3A//raw.githubusercontent.com/metafacture/metafacture-core/master/metafacture-runner/src/main/dist/examples/read/marc21/10.marc21%22%0A%7C+open-http%0A%7C+as-lines%0A%7C+decode-marc21%0A%7C+encode-marcxml%0A%7C+print%0A%3B

decode-marc21(emitLeaderAsWhole="false")` -> encode-marcxml => the separated elements are kept:

        <marc:leader>p</marc:leader>
        <marc:leader>a</marc:leader>
        <marc:leader>m</marc:leader>
        <marc:leader> </marc:leader>
        <marc:leader>a</marc:leader>
        <marc:leader> </marc:leader>
        <marc:leader>c</marc:leader>
        <marc:leader> </marc:leader>

leader should be combined.

Originally posted by @TobiasNx in https://github.com/metafacture/metafacture-core/issues/526#issuecomment-2068705357

dr0i commented 2 months ago

Asking @blackwinter @fsteeg @hagbeck @maipet @TobiasNx : this would change the default behaviour - how should we procede:

a) just fix it and mention the fix as we normally do b) note also in https://github.com/metafacture/metafacture-core/blob/master/metafacture-biblio/src/main/java/org/metafacture/biblio/marc21/MarcXmlEncoder.java the change of the behaviour. Somewhere else? c) should we also provide a possibility for backward compatibility, i.e. allowing the creation of the invalid MARC XML? d) do we have to increase the major release version (to 7.0.0) ?

TobiasNx commented 2 months ago

In my opinion this should similar be handled as https://github.com/metafacture/metafacture-core/pull/401. It is a bug, It should not be backwards compatible.

dr0i commented 2 months ago

decode-marc21(emitLeaderAsWhole="false") (where this prarameter set to false is the default) outputs not all necessary elements. We could: a) enhance decode-marc21 to output all the subfields that are needed to compose the leader in encode-marxml. Then enhance encode-marxml to parse these subfields. b) make decode-marc21(emitleaderaswhole="true") as default. And bail out in encode-marxml when leader is encountered as separated subfields with message like e.g. "Please provide one $marcleader</marc:leader>"

I would really like to go with b) which saves us (and the code) much extra work in comparison to a). It's true then that still decode-marc21(emitLeaderAsWhole="false") would not work (well - it would not generate a wrong MARC XML but no MARC XML at all) but as this would no longer be the default I think it's very unlikely that someone explicitly sets the parameter (decode-marc21(emitLeaderAsWhole="false")) - and if so, the Exception message would be clear enough. WDYT?

TobiasNx commented 2 months ago

In my opinion decode-marc21 should not be changed at all. I thought you could reuse something from encode-marc21 for encode-marcxml, because encode-marc21 can handle the "missing" elements. (Probably with some JAVA stuff here)

All leader info should already be available either since they are a) static, b) can be generated by the encoder (as #524) or c) provided by the decode-marc21.

LeaderPos Name Variant
00-04 Record length b) generated by encoder as in #524
05 record status c) leader.status
06 record type c) leader.type
07 bibliographic level c) leader.bibliographicLevel
08 type of control c) leader.typeOfControl
09 character coding scheme c) leader.characterCodingScheme
10 Indicator count a) always 2
11 Subfield code count a) always 2
12-16 Base address of data b) generated by encoder as in #524
17 encoding level c) leader.encodingLevel
18 descriptive cataloging form c) leader.catalogingForm
19 multipart resource record level c) leader.multipartLevel
20 Length of the length-of-field portion a) always 4
21 Length of the starting-character-position portion a) always 5
22 Length of the implementation-defined portion a) always 0
23 Undefined a) always 0

encode-marc21 can create the leader even with the missing leader positions since the missing elements are either generated (see #524) or static see https://github.com/metafacture/metafacture-core/blob/bd719d2ee87834461fd8594a7a1b69db49658e02/metafacture-biblio/src/main/java/org/metafacture/biblio/marc21/Marc21Constants.java#L31-37

dr0i commented 2 months ago

I know this all. My proposal is, if it's true that we always need decode-marc21 to pipe to encode-marcxml we could

go with b) which saves us (and the code) much extra work in comparison to a).

TobiasNx commented 2 months ago

Talked to @dr0i and @maipet the easy solution is instead of generating the Leader Pos 00-04 and Pos 12-16 is just to set them to 0 if the leader is provided incomplete and in separated elements: like this: leader = 0000[leader.status][leader.type][leader.bibliographicLevel][leader.typeOfControl][leader.characterCodingScheme]2200000[leader.encodingLevel][leader.catalogingForm][leader.multipartLevel]4500

That also echos voices from the literature:

„The construction of the element in a MARCXML record has some minor nuances. Typically, a MARCXML element takes, as content, the leader string unchanged from the way it would appear in a traditional MARC record serialization for magnetic tape, including any spaces. However, implementers should recognize, especially when creating MARCXML records from scratch rather than transforming existing MARC records, that a few character positions of the element are ambiguous in a MARCXML serialization. In traditional MARC, the first five characters of the leader string (character positions 0 to 4) are meant to hold the length of the record. They can be used for this purpose in MARCXML, but XML-applications processing MARCXML records do not use this information; they have other (more reliable) options to determine the „length of an XML-serialized MARC record or to move on to the next element in a MARCXML instance. If transformed from a traditional MARC record, these first five characters often are left as is (i.e., representing the length of the record when it was serialized according to the tape-based traditional MARC record format). In other instances, these first five characters of the leader may be set to all zeros in the MARCXML record (the authors’ personal preference). Similar logic is applied to leader character positions 12 to 16, which are designed to hold the base address for a traditionally serialized MARC record on tape. Finally, leader character positions 20 to 23 make up the entry map for the MARC record directory. Since directory entries are not relevant to the MARCXML serialization of the record, these character positions are either left as they would appear in the traditional MARC serialization (always “4500”) or left blank (i.e., four spaces).“

from: XML for Catalogers and Metadata Librarians by Timothy W. Cole and Myung-Ja Han, https://dl.acm.org/doi/10.5555/2531760

and examples from DNB, ALMA and others: https://d-nb.info/1165051362 https://lobid.org/marcxml/990045392320206443 https://wiki.deutsche-digitale-bibliothek.de/pages/viewpage.action?pageId=78513719

dr0i commented 2 months ago

We discussed offline: we have to properly parse the leader, e.g. because the order of the leader's subfields cannot be ensured when e.g. using a fix to build the leader. The easiest way ensuring a valid MARC21 as XML is to reuse the eloborate mechanisms encode-marc21 provides. A wrapper could be implemented as abbrevation for doing this . So, we would make this:

| decode-marc21 | encode-marc21 | decode-marc21(emitleaderaswhole="true") | encode-marcxml

to this:

| decode-marc21 | encode-marcxml

i.e. encode-marcxml does | encode-marc21 | decode-marc21(emitleaderaswhole="true") | encode-marcxml. Never mind performance. We can tackle that if someone complains - and maybe that's not even an issue.

TobiasNx commented 1 week ago

It seems that the initial usecase is not solved. The leader seems to come out wrong: https://test.metafacture.org/playground/?flux=%22https%3A//raw.githubusercontent.com/metafacture/metafacture-core/master/metafacture-runner/src/main/dist/examples/read/marc21/10.marc21%22%0A%7C+open-http%0A%7C+as-lines%0A%7C+decode-marc21%0A%7C+encode-marcxml%0A%7C+print%0A%3B

        <marc:leader>pam a c </marc:leader>
        <marc:leader>pam a c pam a c </marc:leader>
        <marc:leader>pam a c pam a c pam a c </marc:leader>
        <marc:leader>pam a c pam a c pam a c pam a c </marc:leader>
dr0i commented 1 week ago

@blackwinter can you have a look? Unfortunately I've already released this buggy version. note to self: don't forget to always review functional

TobiasNx commented 1 week ago

Also the changes introduced a new bug: https://test.metafacture.org/playground/?flux=%22https%3A//raw.githubusercontent.com/metafacture/metafacture-core/master/metafacture-runner/src/main/dist/examples/read/marc21/10.marc21%22%0A%7C+open-http%0A%7C+as-lines%0A%7C+decode-marc21%28emitLeaderAsWhole%3D%22true%22%29%0A%7C+encode-marcxml%0A%7C+print%0A%3B (this worked before)

        <marc:leader>02602pam a2200529 c 4500</marc:leader>
        <marc:leader>02602pam a2200529 c 450001387pam a2200349 c 4500</marc:leader>
        <marc:leader>02602pam a2200529 c 450001387pam a2200349 c 450001266pam a2200361 c 4500</marc:leader>
        ...
TobiasNx commented 1 week ago

handle-marcxml -> encode-marcxml still works :) https://metafacture.org/playground/?flux=%22https%3A//raw.githubusercontent.com/jorol/processing-marc/master/files/loc.mrc.xml%22%0A%7C+open-http%28accept%3D%22application/xml%22%29%0A%7C+decode-xml%0A%7C+handle-marcxml%0A%7C+encode-marcxml%0A%7C+print%0A%3B

blackwinter commented 1 week ago

The leader seems to come out wrong

Also the changes introduced a new bug

These are actually the same bug. It's fixed in 32beffb.

blackwinter commented 1 week ago

note to self: don't forget to always review functional

Absolutely!

TobiasNx commented 1 week ago

I tested it locally the leader is not repeated and concatenated anymore which is great!!

It seems that the initial usecase is not solved. The leader seems to come out wrong: test.metafacture.org/playground/?flux=%22https%3A//raw.githubusercontent.com/metafacture/metafacture-core/master/metafacture-runner/src/main/dist/examples/read/marc21/10.marc21%22%0A%7C+open-http%0A%7C+as-lines%0A%7C+decode-marc21%0A%7C+encode-marcxml%0A%7C+print%0A%3B

      <marc:leader>pam a c </marc:leader>
      <marc:leader>pam a c pam a c </marc:leader>
      <marc:leader>pam a c pam a c pam a c </marc:leader>
      <marc:leader>pam a c pam a c pam a c pam a c </marc:leader>

now

                <marc:leader>nam a c </marc:leader>
                <marc:leader>nam a c </marc:leader>
                <marc:leader>nam a c </marc:leader>

But this still outputs a wrong leader because it misses the counted elements: https://github.com/metafacture/metafacture-core/issues/524

It should look something like: <marc:leader>02602pam a2200529 c 4500</marc:leader> It should have 24 characters.

blackwinter commented 1 week ago

But this still outputs a wrong leader because it misses the counted elements:

You haven't set ensureCorrectMarc21Xml="true".

TobiasNx commented 1 week ago

Great: with this option this looks good!! <marc:leader>01269nam a2200373 c 4500</marc:leader>

TobiasNx commented 1 week ago

+1