Closed TobiasNx closed 1 week ago
Asking @blackwinter @fsteeg @hagbeck @maipet @TobiasNx : this would change the default behaviour - how should we procede:
a) just fix it and mention the fix as we normally do
b) note also in https://github.com/metafacture/metafacture-core/blob/master/metafacture-biblio/src/main/java/org/metafacture/biblio/marc21/MarcXmlEncoder.java the change of the behaviour. Somewhere else?
c) should we also provide a possibility for backward compatibility, i.e. allowing the creation of the invalid MARC XML?
d) do we have to increase the major release version (to 7.0.0
) ?
In my opinion this should similar be handled as https://github.com/metafacture/metafacture-core/pull/401. It is a bug, It should not be backwards compatible.
decode-marc21(emitLeaderAsWhole="false")
(where this prarameter set to false
is the default) outputs not all necessary elements. We could:
a) enhance decode-marc21
to output all the subfields that are needed to compose the leader
in encode-marxml
.
Then enhance encode-marxml
to parse these subfields.
b) make decode-marc21(emitleaderaswhole="true")
as default. And bail out in encode-marxml
when leader
is encountered as separated subfields with message like e.g. "Please provide one
I would really like to go with b) which saves us (and the code) much extra work in comparison to a).
It's true then that still decode-marc21(emitLeaderAsWhole="false")
would not work (well - it would not generate a wrong MARC XML but no MARC XML at all) but as this would no longer be the default I think it's very unlikely that someone explicitly sets the parameter (decode-marc21(emitLeaderAsWhole="false")
) - and if so, the Exception message would be clear enough.
WDYT?
In my opinion decode-marc21
should not be changed at all.
I thought you could reuse something from encode-marc21
for encode-marcxml
, because encode-marc21
can handle the "missing" elements. (Probably with some JAVA stuff here)
All leader info should already be available either since they are a) static, b) can be generated by the encoder (as #524) or c) provided by the decode-marc21
.
LeaderPos | Name | Variant |
---|---|---|
00-04 | Record length | b) generated by encoder as in #524 |
05 | record status | c) leader.status |
06 | record type | c) leader.type |
07 | bibliographic level | c) leader.bibliographicLevel |
08 | type of control | c) leader.typeOfControl |
09 | character coding scheme | c) leader.characterCodingScheme |
10 | Indicator count | a) always 2 |
11 | Subfield code count | a) always 2 |
12-16 | Base address of data | b) generated by encoder as in #524 |
17 | encoding level | c) leader.encodingLevel |
18 | descriptive cataloging form | c) leader.catalogingForm |
19 | multipart resource record level | c) leader.multipartLevel |
20 | Length of the length-of-field portion | a) always 4 |
21 | Length of the starting-character-position portion | a) always 5 |
22 | Length of the implementation-defined portion | a) always 0 |
23 | Undefined | a) always 0 |
encode-marc21
can create the leader even with the missing leader positions since the missing elements are either generated (see #524) or static see
https://github.com/metafacture/metafacture-core/blob/bd719d2ee87834461fd8594a7a1b69db49658e02/metafacture-biblio/src/main/java/org/metafacture/biblio/marc21/Marc21Constants.java#L31-37
I know this all. My proposal is, if it's true that we always need decode-marc21
to pipe to encode-marcxml
we could
go with b) which saves us (and the code) much extra work in comparison to a).
Talked to @dr0i and @maipet the easy solution is instead of generating the Leader Pos 00-04 and Pos 12-16 is just to set them to 0 if the leader is provided incomplete and in separated elements: like this: leader = 0000[leader.status][leader.type][leader.bibliographicLevel][leader.typeOfControl][leader.characterCodingScheme]2200000[leader.encodingLevel][leader.catalogingForm][leader.multipartLevel]4500
That also echos voices from the literature:
„The construction of the
element in a MARCXML record has some minor nuances. Typically, a MARCXML element takes, as content, the leader string unchanged from the way it would appear in a traditional MARC record serialization for magnetic tape, including any spaces. However, implementers should recognize, especially when creating MARCXML records from scratch rather than transforming existing MARC records, that a few character positions of the element are ambiguous in a MARCXML serialization. In traditional MARC, the first five characters of the leader string (character positions 0 to 4) are meant to hold the length of the record. They can be used for this purpose in MARCXML, but XML-applications processing MARCXML records do not use this information; they have other (more reliable) options to determine the „length of an XML-serialized MARC record or to move on to the next element in a MARCXML instance. If transformed from a traditional MARC record, these first five characters often are left as is (i.e., representing the length of the record when it was serialized according to the tape-based traditional MARC record format). In other instances, these first five characters of the leader may be set to all zeros in the MARCXML record (the authors’ personal preference). Similar logic is applied to leader character positions 12 to 16, which are designed to hold the base address for a traditionally serialized MARC record on tape. Finally, leader character positions 20 to 23 make up the entry map for the MARC record directory. Since directory entries are not relevant to the MARCXML serialization of the record, these character positions are either left as they would appear in the traditional MARC serialization (always “4500”) or left blank (i.e., four spaces).“
from: XML for Catalogers and Metadata Librarians by Timothy W. Cole and Myung-Ja Han, https://dl.acm.org/doi/10.5555/2531760
and examples from DNB, ALMA and others: https://d-nb.info/1165051362 https://lobid.org/marcxml/990045392320206443 https://wiki.deutsche-digitale-bibliothek.de/pages/viewpage.action?pageId=78513719
We discussed offline: we have to properly parse the leader, e.g. because the order of the leader
's subfields cannot be ensured when e.g. using a fix
to build the leader
. The easiest way ensuring a valid MARC21
as XML
is to reuse the eloborate mechanisms encode-marc21
provides. A wrapper could be implemented as abbrevation for doing this .
So, we would make this:
| decode-marc21 | encode-marc21 | decode-marc21(emitleaderaswhole="true") | encode-marcxml
to this:
| decode-marc21 | encode-marcxml
i.e. encode-marcxml
does | encode-marc21 | decode-marc21(emitleaderaswhole="true") | encode-marcxml
.
Never mind performance. We can tackle that if someone complains - and maybe that's not even an issue.
It seems that the initial usecase is not solved. The leader seems to come out wrong: https://test.metafacture.org/playground/?flux=%22https%3A//raw.githubusercontent.com/metafacture/metafacture-core/master/metafacture-runner/src/main/dist/examples/read/marc21/10.marc21%22%0A%7C+open-http%0A%7C+as-lines%0A%7C+decode-marc21%0A%7C+encode-marcxml%0A%7C+print%0A%3B
<marc:leader>pam a c </marc:leader>
<marc:leader>pam a c pam a c </marc:leader>
<marc:leader>pam a c pam a c pam a c </marc:leader>
<marc:leader>pam a c pam a c pam a c pam a c </marc:leader>
@blackwinter can you have a look? Unfortunately I've already released this buggy version. note to self: don't forget to always review functional
Also the changes introduced a new bug: https://test.metafacture.org/playground/?flux=%22https%3A//raw.githubusercontent.com/metafacture/metafacture-core/master/metafacture-runner/src/main/dist/examples/read/marc21/10.marc21%22%0A%7C+open-http%0A%7C+as-lines%0A%7C+decode-marc21%28emitLeaderAsWhole%3D%22true%22%29%0A%7C+encode-marcxml%0A%7C+print%0A%3B (this worked before)
<marc:leader>02602pam a2200529 c 4500</marc:leader>
<marc:leader>02602pam a2200529 c 450001387pam a2200349 c 4500</marc:leader>
<marc:leader>02602pam a2200529 c 450001387pam a2200349 c 450001266pam a2200361 c 4500</marc:leader>
...
handle-marcxml -> encode-marcxml still works :) https://metafacture.org/playground/?flux=%22https%3A//raw.githubusercontent.com/jorol/processing-marc/master/files/loc.mrc.xml%22%0A%7C+open-http%28accept%3D%22application/xml%22%29%0A%7C+decode-xml%0A%7C+handle-marcxml%0A%7C+encode-marcxml%0A%7C+print%0A%3B
The leader seems to come out wrong
Also the changes introduced a new bug
These are actually the same bug. It's fixed in 32beffb.
note to self: don't forget to always review functional
Absolutely!
I tested it locally the leader is not repeated and concatenated anymore which is great!!
It seems that the initial usecase is not solved. The leader seems to come out wrong: test.metafacture.org/playground/?flux=%22https%3A//raw.githubusercontent.com/metafacture/metafacture-core/master/metafacture-runner/src/main/dist/examples/read/marc21/10.marc21%22%0A%7C+open-http%0A%7C+as-lines%0A%7C+decode-marc21%0A%7C+encode-marcxml%0A%7C+print%0A%3B
<marc:leader>pam a c </marc:leader> <marc:leader>pam a c pam a c </marc:leader> <marc:leader>pam a c pam a c pam a c </marc:leader> <marc:leader>pam a c pam a c pam a c pam a c </marc:leader>
now
<marc:leader>nam a c </marc:leader>
<marc:leader>nam a c </marc:leader>
<marc:leader>nam a c </marc:leader>
But this still outputs a wrong leader because it misses the counted elements: https://github.com/metafacture/metafacture-core/issues/524
It should look something like: <marc:leader>02602pam a2200529 c 4500</marc:leader>
It should have 24 characters.
But this still outputs a wrong leader because it misses the counted elements:
You haven't set ensureCorrectMarc21Xml="true"
.
Great: with this option this looks good!! <marc:leader>01269nam a2200373 c 4500</marc:leader>
+1
encode-marcxml creates invalid marc data when processing
decoded-marc21
with separated leader elements.It creates invalid marc xml data. See the leader spec here in the schema: https://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd
encode-marcxml
should behave in the same way asencode-marc21
, if the data is seperated the encoder should combine them.Status quo: https://test.metafacture.org/playground/?flux=%22https%3A//raw.githubusercontent.com/metafacture/metafacture-core/master/metafacture-runner/src/main/dist/examples/read/marc21/10.marc21%22%0A%7C+open-http%0A%7C+as-lines%0A%7C+decode-marc21%0A%7C+encode-marcxml%0A%7C+print%0A%3B
decode-marc21(emitLeaderAsWhole="false")` -> encode-marcxml => the separated elements are kept:
leader should be combined.
Originally posted by @TobiasNx in https://github.com/metafacture/metafacture-core/issues/526#issuecomment-2068705357