hbz / lobid-resources

Transformation, web frontend, and API for the hbz catalog as LOD
http://lobid.org/resources
Eclipse Public License 2.0
8 stars 7 forks source link

Support for multi-language alphabets/symbols in all applicable fields #1325

Closed gregorbg closed 1 year ago

gregorbg commented 2 years ago

This is a feature request. I would like to be able to retrieve information from any book or bibliographical resource that is written in a foreign language in the original script / alphabet.

Context: When cataloguing Japanese books, we usually enter the Japanese details in a romanized transcript, for example さむらい becomes samurai. At the same time, we also fill so-called "parallel fields" in Aleph, that contain the same information in the original script.

The "parallel fields" are derived by replacing the first digit N of a MAB2 field number by the N-th letter of the Latin alphabet (where 0=Z) like so:

I know that other linguistic scripts can be implemented the same way (think Arabic, Korean, Sanskrit, etc...) and in fact all of the "parallel fields" need to contain sub-field 6 (the red thing in Aleph, not sure whether "sub-field" is the correct terminology) which specifies a four-letter code like Jpan for Japanese script.

You can see an example at http://lobid.org/hbz01/HT020582812. We generally provide "parallel fields" with original script information wherever possible, in >95% of the cases this affects A00, A04, ..., B00, B04, ..., C31, C59, D03, D19 and perhaps Z89 or D51 and D55. But there are even more fields which do in theory support parallel values, like 370 for alternative titles etc.

I currently don't have a preference for whether to provide one JSON per script or all scripts in one big JSON. Just being able to retrieve the information would be great :partying_face:

acka47 commented 2 years ago

As we definitely won't have the time to implement this with Aleph/MAB-based data before the migration to Alma, we need to know how this looks in MARC21. To easily do this by myself, #1316 needs to be resolved.

dr0i commented 2 years ago

In MARC it looks like:

...
<datafield tag="880" ind1="1" ind2=" ">
  <subfield code="6">100-02/Jpan</subfield>
  <subfield code="a">黒田, 清揚</subfield>
</datafield>
<datafield tag="245" ind1="1" ind2="0">
  <subfield code="6">880-01</subfield>
  <subfield code="a">Sora kara mita nihon</subfield>
  <subfield code="c">Kuroda, Kiyoaki ; Sasaki, Atsurô kyôcho</subfield>
</datafield>
<datafield tag="880" ind1="1" ind2="0">
  <subfield code="6">245-01/Jpan</subfield>
  <subfield code="a">空から見た日本</subfield>
  <subfield code="9">F:331</subfield>
</datafield>
<datafield tag="880" ind1="1" ind2="0">
  <subfield code="6">245-01/Jpan</subfield>
  <subfield code="c">黒田清揚 ;佐々木敦朗 共著</subfield>
  <subfield code="9">F:359</subfield>
</datafield>
...
gregorbg commented 1 year ago

I investigated some more, and the ALMA MARC linking of transliterated / romanized content in the "main fields" (100, 245, etc.) to their foreign script "parallel fields" (all occurences of field 880) is very... challenging :sweat_smile:

As you can see from the above example, (almost?) all MARC datafields can have subfield code 6 to specify a link to field 880. These links are enumerated, and the part after the dash (for example, 880-01) is an auto-increment -- it is not related to ind1 and ind2. To find the matching Japanese script, you have to search for the matching occurence of datafield 880 that has subfield 6 set to the same "auto increment" as the source field that you're interested in. The part after the slash then tells you which foreign language script you're dealing with.

But that's not all! As it seems from other examples, one 880 datafield can contain multiple subfields with corresponding foreign scripts. The good news is that these are matched by subfield codes that are identical to the original "main field". But essentially, in order to figure out foreign script transliterations, you have to look up and iterate over 880 subfield 6 values, parse them, and apply some logic to find out which bit of information you eventually have to retrieve.

Example

Let's say you're interested in finding out the Japanese script for the title of the book linked above. The title as per the romanised transcription in datafield 245 is Sora kara mita nihon. The steps to figure this out are as follows:

  1. Look up subfield 6 in the datafield that contains the original information -- in this case datafield 245.
  2. Parse the enumeration in 880-01 -- we need to look for number 01, so the relevant parallel field should be labeled as 245-01/language.
  3. Loop through all datafields with tag 880 until you find the one where subfield 6 matches the specification from step (2) above -- conveniently enough for this example, it appears right below the original in this specific XML export.
  4. (*This is not necessary in all cases, see below) If there are multiple matches, filter them again for the relevant subfield that contains the original information you're interested in -- in this case, we're looking for transliterations of datafield 245 subfield a, so we care about the first occurence of datafield 880 in the code snippet as it also contains a subfield with code a.
  5. Retrieve the information from the matching subfield per step (4) above -- this is the end result that we're interested in: 空から見た日本
<datafield tag="245" ind1="1" ind2="0">
  <subfield code="6">880-01</subfield>
  <subfield code="a">Sora kara mita nihon</subfield>
  <subfield code="c">Kuroda, Kiyoaki ; Sasaki, Atsurô kyôcho</subfield>
</datafield>
<datafield tag="880" ind1="1" ind2="0">
  <subfield code="6">245-01/Jpan</subfield>
  <subfield code="a">空から見た日本</subfield>
  <subfield code="9">F:331</subfield>
</datafield>
  <datafield tag="880" ind1="1" ind2="0">
  <subfield code="6">245-01/Jpan</subfield>
  <subfield code="c">黒田清揚 ;佐々木敦朗 共著</subfield>
  <subfield code="9">F:359</subfield>
</datafield>

Ironically, transliterations for several subfields of one datafield may or may not be grouped together as one 880 "block". (I assume this has to do with the migration transformations from Aleph MAB2 data where 245a and 245c are originally different datafields, but 264a+b+c are not). This is also why step (4) above is marked as "optional".

  1. Look up subfield 6 in datafield 264 -- result 880-05.
  2. Parse the enumeration in 880-05 (this also makes it pretty clear that those are not ind1 and ind2 as there is no 880 with ind2 == 5) -- we need to look for 880 with subfield 6 set to 264-05
  3. Loop through all datafields with tag 880 that contain subfield 6 which looks like 264-05/language.
  4. In this case, there is only one match, and it contains all transliterations grouped together nicely -- subfield a content Ôsaka matches 大阪, subfield b content Hoikusha matches 保育社 and subfield c content Shôwa 41-nen 8-gatsu 1-nichi matches 昭和41年8月1日
<datafield tag="264" ind1=" " ind2="1">
  <subfield code="6">880-05</subfield>
  <subfield code="a">Ôsaka</subfield>
  <subfield code="b">Hoikusha</subfield>
  <subfield code="c">Shôwa 41-nen 8-gatsu 1-nichi</subfield>
</datafield>
<datafield tag="880" ind1=" " ind2="1">
  <subfield code="6">264-05/Jpan</subfield>
  <subfield code="a">大阪</subfield>
  <subfield code="b">保育社</subfield>
  <subfield code="c">昭和41年8月1日</subfield>
  <subfield code="9">F:419</subfield>
</datafield>

Multiple foreign scripts

I haven't found any example, but it is conceivable that one resource contains multiple 880 transliterations of the same "main field" in different scripts (for example, think India with its rich culture of regional languages). As the specification string in subfield 6 can only hold one language tag (like 245-01/Jpan for Japanese), I'd imagine that the different scripts are split into different 880 datafields. So the looping and filtering becomes more complicated as per the first part of my example above. A hypothetical example could look like this

<datafield tag="100" ind1="1" ind2=" ">
  <subfield code="6">880-02</subfield>
  <subfield code="a">Tolstoi, Lew Nikolajewitsch</subfield>
  <subfield code="4">aut</subfield>
</datafield>
<datafield tag="880" ind1="1" ind2=" ">
  <subfield code="6">100-02/Jpan</subfield>
  <subfield code="a">トルストイ, レフ・ニコラエヴィチ</subfield>
</datafield>
<datafield tag="880" ind1="1" ind2=" ">
  <subfield code="6">100-02/Cyrl</subfield>
  <subfield code="a">Толстой, Лев Николаевич</subfield>
</datafield>

However I admit that this is an artificially constructed example and I am not sure at all how one would deal with these cases in real life scenarios.

Conclusion

It seems that this feature is absolutely not trivial, not only with the convoluted way the data is represented in ALMA MARC, but also regarding the fact how to make it available to the user in the Lobid APIs. As such, I'd like to point out that this is not a crucial feature in our usecase (so it's no "make or break" situation for our library to use Lobid-Resources in general), but it would still be very nice to have this feature available at some point.

I am open for further discussion and exchange of ideas on how to best integrate this feature, but I absolutely understand if you want to assign a low priority to the implementation. In the meantime, I have implemented a basic matching algorithm for our use cases over at https://github.com/gregorbg/RetroALePH/blob/master/src/main/kotlin/de/uzk/oas/japan/util/BibUtils.kt

P.S. Sorry for the wall of text 😇

blackwinter commented 1 year ago

Just as a note, here's how we modeled MARC 880 (Alternate Graphic Representation) in Limetrans. We expressly did not match them with their associated fields, though.

gregorbg commented 1 year ago

That implementation seems like a good first step, in terms of displaying that Alternate Graphic Representations are available. However, it is hard to benefit from this feature if the resulting JSON doesn't tell me what is being transliterated / what the Alternate Graphic Representation refers to.

Sure, there are programs out there that can transliterate Japanese characters to Roman transcription automatically. But in reality, there are different rules for transcribing the same word, and even when sticking with one consistent set of rules there may be subtle differences, like using ô vs ō for elongated vowel sounds. So in practice, it is very hard to figure out which Japanese script belongs to which "original field" based on the string literal alone.