metafacture / metafacture-documentation

The central place for documentation about metafacture
http://metafacture.github.io/metafacture-documentation/
Apache License 2.0
3 stars 3 forks source link

Encoding problems in MF-in-5-min playground example #29

Closed acka47 closed 5 months ago

acka47 commented 10 months ago

I just went through https://github.com/metafacture/metafacture-documentation/blob/master/MF-in-5-min.md. In the MARC example, there are problems when I run it in the MF playground:

image

Is there a way to fix this in the flux or is it a Playground problem?

TobiasNx commented 10 months ago

https://raw.githubusercontent.com/metafacture/metafacture-core/master/metafacture-runner/src/main/dist/examples/read/marc21/10.marc21

grafik

Not sure how to fix this with Metafacture. The encoding problem seems to be already given by the source data. Perhaps @blackwinter or @dr0i can help here.

dr0i commented 10 months ago

Not sure - but it strikes me odd that this is exactly what the input already shows. So: a) the input needs a special treatment in the first place (file shows:

$ file metafacture-runner/src/main/dist/examples/read/marc21/10.marc21 metafacture-runner/src/main/dist/examples/read/marc21/10.marc21: MARC21 Bibliographic`

(but does not show the Umlaut properly in my UTF8 terminal )) or

b) the input file is not stored correctly (broken characters instead of UTF8 or enhanced ASCII - opens the question what character set the MAR21 should use)

TobiasNx commented 5 months ago

I found some examples without encoding problems:

https://raw.githubusercontent.com/metafacture/metafacture-tutorial/main/data/sample.marc21

https://metafacture.org/playground/?flux=%22https%3A//raw.githubusercontent.com/metafacture/metafacture-tutorial/main/data/sample.marc21%22%0A%7C+open-http%0A%7C+as-lines%0A%7C+decode-marc21%0A%7C+fix%28transformationFile%29%0A%7C+encode-csv%0A%7C+print%0A%3B&transformation=set_array%28%22title%22%29%0Acopy_field%28%22245%3F%3F.%3F%22%2C%22title.%24append%22%29%0Ajoin_field%28%22title%22%29%0Acopy_field%28%22001%22%2C%22id%22%29%0Aretain%28%22title%22%2C+%22id%22%29

@acka47 should I use these instead?

Phu2 commented 5 months ago

The input file seems to be MARC-8 encoded. From the spec

In a MARC-8-encoded MARC 21 record, Leader character position 9 (Character coding scheme) must contain a space character (20(hex)).

Conversion from MARC-8 to Unicode can be done with tools like yaz-marcdump or MarcEdit.

TobiasNx commented 5 months ago

The input file seems to be MARC-8 encoded. From the spec

In a MARC-8-encoded MARC 21 record, Leader character position 9 (Character coding scheme) must contain a space character (20(hex)).

Conversion from MARC-8 to Unicode can be done with tools like yaz-marcdump or MarcEdit.

The example is quite old. At least MF does not say it is not UTF-8. Other MARC-8 examples like the one in PyMarc throw errors when being transformed with MF and need a workaround.