metafacture / metafacture-core

Core package of the Metafacture tool suite for metadata processing.
https://metafacture.org
Apache License 2.0
69 stars 34 forks source link

Allow USMARC character encoding in Marc21Decoder #452

Closed TobiasNx closed 1 year ago

TobiasNx commented 2 years ago

This does not work since we the decode-marc21 needs UTF-8

At that moment we can't solve this problem. So we would need an additional modul that could transform string encoding from one encoding to another.

idea would be something like: stringEncodingSwitcher(in="ASCII" out="UTF-8")

dr0i commented 2 years ago

As @blackwinter noted , it would not help to convert the data into a new character-encoding - the problem is rather that the 'characterCodingScheme' (Pos. 09) in the data is not set. The Marc21Decoder checks if this is set to a, but in usmarc this is empty. Removing this check from the Marc21Decoder I got the following output:

leader: status: "c" type: "a" bibliographicLevel: "m" typeOfControl: " " characterCodingScheme: " " [...]

So we could:

What shall we do?

dr0i commented 2 years ago

For the MARC-8 character encoding see also https://en.wikipedia.org/wiki/MARC-8. I would vote to first go with the simplest character encoding: USMARC. Also, no option is needed. WDYT?

blackwinter commented 2 years ago

USMARC is not a character set, it's the precursor to MARC 21:

MARC 21 is a result of the combination of the United States and Canadian MARC formats (USMARC and CAN/MARC).

We should probably investigate why Marc21Decoder only supports the Character coding scheme UCS/Unicode (current module initially introduced by 3b24df5, while the UTF-8 check was already present in MarcDecoder - although optional) and what needs to be done in order to add support for MARC-8.

dr0i commented 2 years ago

Just a guess - as MARC-8 comes not as out-of-the-box library AND there was (still is - besides the USMARC (which is just usascii, no?)) no demand for it, it was ignored by implementers. If we would really want to support it fully, we may want to predate e.g. https://github.com/xbib/marc.

blackwinter commented 2 years ago

USMARC (which is just usascii, no?)

No, see e.g. here if you're curious ;)

[ETA: But why would we concern ourselves with USMARC anyway? We're talking about the Marc21Decoder, aren't we?]

we may want to predate

"predate"?

e.g. https://github.com/xbib/marc

I have no idea if this would be suitable (and sufficiently compatible).

blackwinter commented 2 years ago

there was [...] no demand for it

Which begs the question if there's actual demand now - after (almost exactly) 6 years? Was this request based on a concrete use case or was it just for completeness sake?

TobiasNx commented 2 years ago

The initial issue was about a character encoding modul, the example was a USMARC case. There was no demand for USMARC other than the concrete example which I picked up from a Catmandu test. I thought it was a general encoding problem. I therefore suggested a general modul for character encoding.

@blackwinter hinted that it is an decode-marc21 problem in the chat and @dr0i changed the isssue to USMARC support.

For me this is not urgent.

dr0i commented 2 years ago

"predate"?

Uh, I meant "depredate"

I have no idea if this would be suitable (and sufficiently compatible).

That's what I mean with "depredate", copy 'n paste code, not reusing the whole thing. But you are right, it would mean some work.

But as @TobiasNx said, it's about reuse catmandu's tests. My impression is that it could be enough to

allow an empty characterCodingScheme (Pos 09)

and we could at least decode these records. That would not enable handling MARC-8 character sets (completely) but it would a be a low hanging fruit to start with (and, maybe, enough for all times, because there ma be no "real" demand).

(BTW, besides this issue may be of a rather academic interest, I appreciate the excursion.)

blackwinter commented 2 years ago

It wasn't clear to me that this issue referred to being able to run Catmandu tests. That's why it's usually beneficial to state one's intention instead of assuming what the solution should be ;)

That would not enable handling MARC-8 character sets (completely) but it would a be a low hanging fruit to start with

So we would accept MARC-8 without actually supporting it? What would the outcome be? (*) Would it satisfy @TobiasNx's original goal?

(*) It's easy to test: Just modify the input to pretend it was UCS/Unicode. That could also be a generic workaround in this case: match(pattern="\\A(.{9}) ", replacement="$1a")

TobiasNx commented 1 year ago

@dr0i and @blackwinter the suggested workaround seems to work. Thanks.