Closed gregorbg closed 1 year ago
As we definitely won't have the time to implement this with Aleph/MAB-based data before the migration to Alma, we need to know how this looks in MARC21. To easily do this by myself, #1316 needs to be resolved.
In MARC it looks like:
...
<datafield tag="880" ind1="1" ind2=" ">
<subfield code="6">100-02/Jpan</subfield>
<subfield code="a">黒田, 清揚</subfield>
</datafield>
<datafield tag="245" ind1="1" ind2="0">
<subfield code="6">880-01</subfield>
<subfield code="a">Sora kara mita nihon</subfield>
<subfield code="c">Kuroda, Kiyoaki ; Sasaki, Atsurô kyôcho</subfield>
</datafield>
<datafield tag="880" ind1="1" ind2="0">
<subfield code="6">245-01/Jpan</subfield>
<subfield code="a">空から見た日本</subfield>
<subfield code="9">F:331</subfield>
</datafield>
<datafield tag="880" ind1="1" ind2="0">
<subfield code="6">245-01/Jpan</subfield>
<subfield code="c">黒田清揚 ;佐々木敦朗 共著</subfield>
<subfield code="9">F:359</subfield>
</datafield>
...
I investigated some more, and the ALMA MARC linking of transliterated / romanized content in the "main fields" (100, 245, etc.) to their foreign script "parallel fields" (all occurences of field 880) is very... challenging :sweat_smile:
As you can see from the above example, (almost?) all MARC datafields can have subfield code 6
to specify a link to field 880
. These links are enumerated, and the part after the dash (for example, 880-01
) is an auto-increment -- it is not related to ind1
and ind2
. To find the matching Japanese script, you have to search for the matching occurence of datafield 880
that has subfield 6
set to the same "auto increment" as the source field that you're interested in. The part after the slash then tells you which foreign language script you're dealing with.
But that's not all! As it seems from other examples, one 880
datafield can contain multiple subfields with corresponding foreign scripts. The good news is that these are matched by subfield codes that are identical to the original "main field". But essentially, in order to figure out foreign script transliterations, you have to look up and iterate over 880
subfield 6
values, parse them, and apply some logic to find out which bit of information you eventually have to retrieve.
Let's say you're interested in finding out the Japanese script for the title of the book linked above. The title as per the romanised transcription in datafield 245
is Sora kara mita nihon
. The steps to figure this out are as follows:
6
in the datafield that contains the original information -- in this case datafield 245
.880-01
-- we need to look for number 01
, so the relevant parallel field should be labeled as 245-01/language
.880
until you find the one where subfield 6
matches the specification from step (2) above -- conveniently enough for this example, it appears right below the original in this specific XML export.subfield
that contains the original information you're interested in -- in this case, we're looking for transliterations of datafield 245
subfield a
, so we care about the first occurence of datafield 880
in the code snippet as it also contains a subfield with code a
.空から見た日本
<datafield tag="245" ind1="1" ind2="0">
<subfield code="6">880-01</subfield>
<subfield code="a">Sora kara mita nihon</subfield>
<subfield code="c">Kuroda, Kiyoaki ; Sasaki, Atsurô kyôcho</subfield>
</datafield>
<datafield tag="880" ind1="1" ind2="0">
<subfield code="6">245-01/Jpan</subfield>
<subfield code="a">空から見た日本</subfield>
<subfield code="9">F:331</subfield>
</datafield>
<datafield tag="880" ind1="1" ind2="0">
<subfield code="6">245-01/Jpan</subfield>
<subfield code="c">黒田清揚 ;佐々木敦朗 共著</subfield>
<subfield code="9">F:359</subfield>
</datafield>
Ironically, transliterations for several subfields of one datafield may or may not be grouped together as one 880
"block". (I assume this has to do with the migration transformations from Aleph MAB2 data where 245a
and 245c
are originally different datafields, but 264a+b+c
are not). This is also why step (4) above is marked as "optional".
6
in datafield 264
-- result 880-05
.880-05
(this also makes it pretty clear that those are not ind1
and ind2
as there is no 880
with ind2 == 5
) -- we need to look for 880
with subfield 6
set to 264-05
880
that contain subfield 6
which looks like 264-05/language
.a
content Ôsaka
matches 大阪
, subfield b
content Hoikusha
matches 保育社
and subfield c
content Shôwa 41-nen 8-gatsu 1-nichi
matches 昭和41年8月1日
<datafield tag="264" ind1=" " ind2="1">
<subfield code="6">880-05</subfield>
<subfield code="a">Ôsaka</subfield>
<subfield code="b">Hoikusha</subfield>
<subfield code="c">Shôwa 41-nen 8-gatsu 1-nichi</subfield>
</datafield>
<datafield tag="880" ind1=" " ind2="1">
<subfield code="6">264-05/Jpan</subfield>
<subfield code="a">大阪</subfield>
<subfield code="b">保育社</subfield>
<subfield code="c">昭和41年8月1日</subfield>
<subfield code="9">F:419</subfield>
</datafield>
I haven't found any example, but it is conceivable that one resource contains multiple 880
transliterations of the same "main field" in different scripts (for example, think India with its rich culture of regional languages). As the specification string in subfield 6
can only hold one language tag (like 245-01/Jpan
for Japanese), I'd imagine that the different scripts are split into different 880
datafields. So the looping and filtering becomes more complicated as per the first part of my example above. A hypothetical example could look like this
<datafield tag="100" ind1="1" ind2=" ">
<subfield code="6">880-02</subfield>
<subfield code="a">Tolstoi, Lew Nikolajewitsch</subfield>
<subfield code="4">aut</subfield>
</datafield>
<datafield tag="880" ind1="1" ind2=" ">
<subfield code="6">100-02/Jpan</subfield>
<subfield code="a">トルストイ, レフ・ニコラエヴィチ</subfield>
</datafield>
<datafield tag="880" ind1="1" ind2=" ">
<subfield code="6">100-02/Cyrl</subfield>
<subfield code="a">Толстой, Лев Николаевич</subfield>
</datafield>
However I admit that this is an artificially constructed example and I am not sure at all how one would deal with these cases in real life scenarios.
It seems that this feature is absolutely not trivial, not only with the convoluted way the data is represented in ALMA MARC, but also regarding the fact how to make it available to the user in the Lobid APIs. As such, I'd like to point out that this is not a crucial feature in our usecase (so it's no "make or break" situation for our library to use Lobid-Resources in general), but it would still be very nice to have this feature available at some point.
I am open for further discussion and exchange of ideas on how to best integrate this feature, but I absolutely understand if you want to assign a low priority to the implementation. In the meantime, I have implemented a basic matching algorithm for our use cases over at https://github.com/gregorbg/RetroALePH/blob/master/src/main/kotlin/de/uzk/oas/japan/util/BibUtils.kt
P.S. Sorry for the wall of text 😇
Just as a note, here's how we modeled MARC 880 (Alternate Graphic Representation) in Limetrans. We expressly did not match them with their associated fields, though.
That implementation seems like a good first step, in terms of displaying that Alternate Graphic Representations are available. However, it is hard to benefit from this feature if the resulting JSON doesn't tell me what is being transliterated / what the Alternate Graphic Representation refers to.
Sure, there are programs out there that can transliterate Japanese characters to Roman transcription automatically. But in reality, there are different rules for transcribing the same word, and even when sticking with one consistent set of rules there may be subtle differences, like using ô
vs ō
for elongated vowel sounds. So in practice, it is very hard to figure out which Japanese script belongs to which "original field" based on the string literal alone.
This is a feature request. I would like to be able to retrieve information from any book or bibliographical resource that is written in a foreign language in the original script / alphabet.
Context: When cataloguing Japanese books, we usually enter the Japanese details in a romanized transcript, for example
さむらい
becomessamurai
. At the same time, we also fill so-called "parallel fields" in Aleph, that contain the same information in the original script.The "parallel fields" are derived by replacing the first digit N of a MAB2 field number by the N-th letter of the Latin alphabet (where 0=Z) like so:
A
is letter number1
of the Latin alphabet)C
is letter number3
)I know that other linguistic scripts can be implemented the same way (think Arabic, Korean, Sanskrit, etc...) and in fact all of the "parallel fields" need to contain sub-field
6
(the red thing in Aleph, not sure whether "sub-field" is the correct terminology) which specifies a four-letter code likeJpan
for Japanese script.You can see an example at http://lobid.org/hbz01/HT020582812. We generally provide "parallel fields" with original script information wherever possible, in >95% of the cases this affects
A00
,A04
, ...,B00
,B04
, ...,C31
,C59
,D03
,D19
and perhapsZ89
orD51
andD55
. But there are even more fields which do in theory support parallel values, like370
for alternative titles etc.I currently don't have a preference for whether to provide one JSON per script or all scripts in one big JSON. Just being able to retrieve the information would be great :partying_face: