uwlib-cams / MARC2RDA

mapping between MARC21 and RDA-RDF
Creative Commons Zero v1.0 Universal
34 stars 2 forks source link

$6 #344

Open CECSpecialistI opened 2 years ago

CECSpecialistI commented 2 years ago

Numeric and Control Subfields

szapoun commented 2 years ago

$6 is used for different scripts. It can be used in any field to provide the same information in another script in an 880 occurrence.

The $6 consists of three parts:

  1. the linking part that notes the precise 880 occurrence that contains the other script. It contains the associated field (linking tag) and two digits (occurrence number) that match the two fields, e.g., 100 1# $6880-01
  2. script identification part that identifies the alternate script used in the field. It follows the linking part containing a code
  3. orientation code to indicate that the script is right-to-left. The left-to-right is the default in MARC21 and it is not written in extenso. So, the code r is mostly used to represent that the script is right-to-left

The format is like that: XXX $6 [linking tag]-[occurrence number] 880 $6 [linking tag]-[occurrence number]/[script identification code]/[field orientation code]

Example: 100 1# $6880-01 $aname in latin script

880 ## $6100-01 $aname in another script

alternatively

880 ## $6100-01/(3/r $aname in another script . Code (3 means that the script is Arabic and code r means that the orientation is right to left

There is a case that an 880 is not paired with any other field. In this case the linking part in $6 would be XXX-00.

Examples from BL

Serial in Greek - Νομισματικά Χρονικά

Serial in Russian - Lingvoritoricheskai︠a︡ paradigma

Book in Russian - Dei︠a︡nii︠a︡ Andrei︠a︡ i Matfii︠a︡ v gorode li︠u︡doedov

szapoun commented 2 years ago

Trying to map $6

The mapping of 880 should be mapped according to the associated field.

Using the following example 100 1# $6880-01 $aname in latin script

880 ## $6100-01 $aname in another script

Both 100 and its associated 880 will be mapped the same way, e.g., work - has author - 100$a // work - has author - 880$a

This is the simplest solution. But, in case the 880$6 has a "script identification code", this information about the script is lost.

I am not sure if an xml script tag exists. What I have found is the xml:lang-script combination that guidelines say to use only when the script is different to the one associated already with the language, see here IANA language subtag registry - check supressed script in each language.

Another thing I am not sure of is if xml:lang-script codes can be used in RDA/RDF properties. e.g.

Aristotle Aristotelēs Αριστοτέλης Maybe of interest would be the BIBFRAME approach. Please check $6 in this [file-downloadable file from BIBFRAME conversions](https://www.loc.gov/bibframe/mtbf/ConvSpec-NumericSubfields-v1.7.docx)
GordonDunsire commented 2 years ago

XML language codes are built-in to RDF. These xml serializations are fine; in other serializations we have:

ex:Work1 rdawd:P10061 "Aristotle"@en .

GordonDunsire commented 2 years ago

In RDA, these strings are treated as appellations: a nomen string of a nomen, with each string represented as a distinct nomen.

RDA allows a simple direct relationship to the nomen string:

ex:Person1 rdaad:P50111 [has name of person] "Aristotle"@en .

Alternatively, for VIAF-level authority control, RDA offers direct relationship to the nomen instance:

ex:Person1 rdaao:P50111 [has name of person] ex:Nomen1 . ex:Nomen1 rdand:P80068 [has nomen string] "Aristotle"@en .

For names and titles (appellations) in transliterations or 'translations', the simple approach results in a cluster of appellation strings:

ex:Person1 rdaad:P50111 "Aristotle"@en . ex:Person1 rdaad:P50111 "Aristotelēs"@el-latn . ex:Person1 rdaad:P50111 "Αριστοτέλης"@el-grek .

The more complex Nomen approach results in a chain/cluster hybrid:

ex:Person1 rdaao:P50111 ex:Nomen3 . ex:Nomen3 rdand:P80068 "Αριστοτέλης"@el-grek . ex:Nomen3 rdano:P80060 [has derivation] ex:Nomen2 . ex:Nomen2 rdand:P80068 "Aristotelēs"@el-latn . ex:Nomen2 rdano:P80060 ex:Nomen1 ex:Nomen1 rdand:P80068 "Aristotle"@en .

[There are many variations; this one starts in the original script, derives a transliteration, and then derives a translation, but the components can be swapped around in the sequence.

The more complex method is more useful in the context of translations and translierations if Person1 uses only one form of name. This is quite rare: "Gordon Dunsire" and "Gordon J. Dunsire" and "G. Dunsire" are typical forms used in scholarly publishing. Aristotle is not a typical case :-)

But what is the typical case for recording a transliteration in MARC 21 legacy records?

pan-zhuo commented 2 years ago

The associated field may not contain the corresponding romanized form of 880, especially for 520.

https://lccn.loc.gov/2021421243 520 ## |6 880-06 |a Detailed summary in vernacular field only. 880 ## |6 520-06/$1 |a "学者的人间情怀"是陈平原的代表作,论及"学术史""走出'五四'""左图右史""述学文体","演说现场","报刊研究"等重要话题,也都点到为止,好在大都日后在专业著作中有所展开.最重要的是,反映了他当时"压在纸背的心情".

pan-zhuo commented 2 years ago

For right-to-left scripts (880 $6 /r), it seems that we can use the i18n namespace to set a base direction.

GordonDunsire commented 2 years ago

Some additional thoughts ...

  1. For non-RDF representations in RDA, the script and language of a nomen can be explicitly assigned:

ex:Nomen3 rdand:P80066 "el" . ex:Nomen3 rdand:P80070 "grek" .

  1. An alternative nomen clustering structure uses "is equivalent to":

ex:Person1 rdaao:P50111 ex:Nomen3 . ex:Nomen3 rdano:P80113 ex:Nomen2 . ex:Nomen3 rdano:P80113 ex:Nomen1 .

This structure might be a better option for a transform: Treat the contents of 880 as an equivalent nomen of whatever is in the primary tag. This preserves the idea that the 880 is derived from the primary value (as a transliteration) but is a softer approach that using a chain of nomens as indicated above. The downside of both approaches is that nomen IRIs have to be minted, but this can be minimised to the IRI of the primary nomen:

ex:Person1 rdaao:P50111 ex:Nomen3 . ex:Nomen3 rdand:P80068 "Αριστοτέλης"@el-grek . ex:Nomen3 rdand:P80113 "Aristotelēs"@el-latn . ex:Nomen3 rdand:P80113 "Aristotle"@en .

The 'simple' cluster model does not preserve a distinction between the primary appellation and the transliteration.

CECSpecialistI commented 2 years ago

I just wrote out a rough draft of what might go into the documentation on $6. I didn't create a visual diagram because the question of how and how reliably we can determine which are primary appellations and which are derivaitons/transliterations is still a bit beyond me. We haven't answered Gordon's question above, either: "But what is the typical case for recording a transliteration in MARC 21 legacy records?" How well can we do here? @szapoun @GordonDunsire

pan-zhuo commented 2 years ago

It gets complicated when you consider that 100 holds an access point for a person.

100 1# $6 880-01 $a Tolstoy, Leo, $c graf, $d 1828-1910, $e author. 880 1# $6 100-01/(N $a Толстой, Лев, $c граф, $d 1828-1910, $e author.

I think it's safe to say:

ex:Person1 rdaao:P50411 [has authorized access point for person] ex:Nomen1 . ex:Nomen1 rdand:P80068 "Tolstoy, Leo, graf, 1828-1910"@en .

Two questions:

  1. Can you say there's another nomen that is 'equivalent to' an authorized access point?

ex:Nomen1 rdand:P80113 "Толстой, Лев, граф, 1828-1910"@ru .

If not, then maybe a variant access point

ex:Person1 rdaad:P50412 [has variant access point for person] "Толстой, Лев, граф, 1828-1910"@ru .

  1. If map to an access point, still keep the 'name of person' or not?

ex:Person1 rdaao:P50111 [has name of person] ex:Nomen2 . ex:Nomen2 rdand:P80068 "Толстой, Лев"@ru . ex:Nomen2 rdand:P80113 "Tolstoy, Leo"@en .

It seems to me that "Толстой, Лев" is structured description (the inverted form), and RDA only allows unstructured description/identifier/IRI for 'name of person'.

CECSpecialistI commented 2 years ago

The 100 is not "name of person", but "access point for person". Since we don't know whether the access point is controlled or uncontrolled here, it's not safe to assume it's the authorized access point. I think it should look like this: ex:Person1 rdaaoP50377 [has access point for person] ex:Nomen 2 . ex:Nomen2 rdand:P80068 "Толстой, Лев, граф, 1828-1910"@ru . ex:Nomen2 rdand:P80060 "Tolstoy, Leo, graf, 1828-1910"@en .

CECSpecialistI commented 2 years ago

I will update the draft documentation to reflect our meeting discussion yesterday today

gerontakos commented 2 years ago

Wait are we using hasDerivation or isEquivalentTo?

gerontakos commented 2 years ago

Also I don't see how we can accurately produce language tags using the bibliographic data (unless the primary field allows an explicit statement representing the language of the value AND it is used in the bib data for that purpose). In Zhuo's example above, for example, with the 100/880: I don't see a representation of the language of the value; I only see a general script identification code. Am I missing something?

CECSpecialistI commented 2 years ago

We should use hasDerivation here in my opinion, right? isEquivalentTo is broader, but still correct. You're right the language tags aren't possible to derive from the MARC data, they should just be script ID's

gerontakos commented 2 years ago

I don't know of script IDs that can be represented without a language representation; scripts, if I remember correctly, are subtags only. That's according to my memory of BCP47 (I'm being lazy and not looking at it).

CECSpecialistI commented 2 years ago

oh, interesting. i don't think we can accurately determine language for these without humans, can we?

gerontakos commented 2 years ago

That's what I'm thinking

pan-zhuo commented 2 years ago

I'm unsure how to determine which nomen is derived from which if 880 is a combination of non-latin and latin data? I assume a cataloger would construct the authorized access point first, but the base form of the AAP is derived from the non-latin data.

700 1# $6 880-07 $a Xu, Hong $c (Librarian), $e author. 880 1# $6 700-07/$1 $a 徐鸿 $c (Librarian), $e author.

gerontakos commented 2 years ago

That suggests use of isEquivalentTo may be more reliable than hasDerivation, no?

CECSpecialistI commented 2 years ago

regular field is derived from 880

gerontakos commented 2 years ago

always?

CECSpecialistI commented 2 years ago

yes i believe so, we discussed at meeting week before last

GordonDunsire commented 2 years ago

If there is doubt, "is equivalent to" should be used as to link the original nomen (object) to the transliterated nomen string (datatype). The examples in Bibliographic Appendix D suggest that there is no absolute differentiation between "agency script" and "bibliographic script".

Ideally, we would want to use the bibliographic script for the nomen object and the agency script for the equivalent nomen string (datatype), but this cannot be distinguished without human intervention.

I think we should assume that the regular field (not 880) is the source of the nomen object. It is what is intended by the cataloguing agency, irrespective of agency or bibliographic script preferences.

This implies that the 880 field is the source of the equivalent nomen strings.

nomenRegular <"has nomen string"> "SES for 880 subfields" .

Issues:

1) The existence of an 880 field with non-zero occurrence number indicates the need to mint a nomen IRI for the regular field, to link to the equivalent 880 nomen string.

Do we want to mint a nomen IRI for all appropriate "regular" fields that do not have a corresponding 880?

This would be counter to the avoidance of minting nomen IRIs unless there is a need to talk about (describe) the nomen.

For example, do we mint a nomen IRI for 264 $a (appellation of place) and 264 $b (appellation of corporate body)?

If not, the process must start with the 880 field.

If there is an 880 occurrence number, find the regular field, mint a nomen IRI with the regular field's nomen string, and relate it to the 880 field's nomen string with "is equivalent to".

If there is more than one 880 field with the same linkage tag/occurrence number, add the second and subsequent 880 nomen strings to the IRI minted for the first 880.

If there is a "00" 880 occurrence number, mint a nomen IRI with the 880's nomen string, assigning it to the entity indicated by the linkage tag.

In all cases, a nomen IRI is only minted if the regular field and subfields contain nomen string values. Some regular field and subfields contain manifestation statement, note, and other unstructured string values that are not appellations.

If the 880 linkage tag is for a field that is not an appellation, generate an additional set of statements using the mappings for the regular field and the content of the 880 field.

That is, a transliteration of an annotation (note, etc.) should be treated as a separate annotation with no extrinsic relationship to the original. Otherwise, the relationship can only be preserved by reifying at least the original statement, if not both.

2) Including the script in an xml language tag will be difficult. It is necessary to add the script code to a language code, so there is a problem when only the script code is recorded. The script is Cyrillic, but is the language Russian or Serbian? This requires automatic detection of languages, or it may be possible to use the language codes in 008 and 041. If there is a code for 'predominant language' (008 35-37) or 'language of text' (041 $a) then this can be mapped to an xml language tag with an appended script code.

If the language cannot be determined, the script cannot be recorded in an xml language attribute. Instead, we are back to minting nomens for the 880 fields so that we can use the RDA element "has script of nomen".

GordonDunsire commented 2 years ago

Some clarification of the discussion on the semantics of "is equivalent to":

This element ignores the kind of appellation/nomen string (name/title, access point, identifier). All it means is that the two related nomens are appellations of the same entity; it says nothing about the appellations themselves and is valid if the related nomens are a name and an identifier, etc. In reification terms:

nomen1 rdano:P80113 nomen2 . nomen1 rdf:subject entity1 . => nomen2 rdf:subject entity1 . nomen1 rdand:P80113 "nomen string 2" . nomen1 rdf:subject entity1 . => entity1 rdaxd:P00017 "nomen string 2" .

Conversely: entity1 rdaxo:P00017 nomen1 . entity1 rdaxo:P00017 nomen2 . => nomen1 rdano:P80113 nomen2 . entity1 rdaxo:P00017 nomen1 . entity1 rdaxd:P00017 "nomen string 2" . => nomen1 rdand:P80113 "nomen string 2" .

The basic model for transliterated nomen strings is: entity1 rdaxo:P00017 nomen1 . nomen1 rdand:P80068 "nomen string 1" . nomen1 rdand:P80113 "nomen string 2" .

Nomen is a reification: entity1 rdaxd:P00017 "nomen string 1" . => nomen1 rdf:subject entity1 . => nomen1 rdf:predicate rdaxd:P00017 . => nomen1 rdf:object "nomen string 1" .

CECSpecialistI commented 2 years ago

So it looks like we will need to leave out at least some script identification codes, where a string won't be associated with a Nomen. What will be done with the orientation codes? I just updated the draft documentation and tried to implement comments

I'm unsure how to determine which nomen is derived from which if 880 is a combination of non-latin and latin data? I assume a cataloger would construct the authorized access point first, but the base form of the AAP is derived from the non-latin data.

700 1# $6 880-07 $a Xu, Hong $c (Librarian), $e author. 880 1# $6 700-07/$1 $a 徐鸿 $c (Librarian), $e author.

This is a really good example. Maybe the most accurate we can be is "hasEquivalent". This also suggests we might need to throw out or be less exact with field orientation and script identification codes, huh?

lake44me commented 2 years ago

These documentations are perhaps appropriate to our discussion today: MARC21 Character Sets https://www.loc.gov/marc/specifications/speccharintro.html
Part 3, Unicode has this to say:

MARC subfield $6 (Linkage) Subfield $6 (Linkage) is used in MARC 21 records to link alternate graphic representations of the same data, to identify the presence of specific scripts in a field, and to flag fields in which the display/print directionality of data is right-to-left (e.g., for Arabic script). The subfield $6 script identification code in MARC-8-encoded MARC 21 records identifies MARC-8 character sets, rather than scripts per se; hence the code is irrelevant in the Unicode environment because the character set is always UCS, which has no script identification code value. The script identification code should be dropped from subfield $6 when converting to Unicode from MARC-8 encoding. The Field Orientation Code, which flags a field as having right-to-left display directionality, should be used in Unicode-encoded MARC 21 records. When present, the Field Orientation code is separated from the subfield $6 tag linkage data by two solidus (slash) characters (002F(hex)).

OCLC Bibliographic Formats and Standards makes reference to the above document in its description of 880 tag and the Control Fields section of $6, but don't explicityly say there that some features described in those documentations are irrelevant for Unicode records. However, they do mention in tag 066 documentation that "Records containing non-MARC-8 characters are expected to be output in the UTF-8 data format. Field 066 does not appear in records exported in UTF-8, and the script code does not appear in field 880 subfield ǂ6. iAnd in their Online Cataloging page, section .2.7 (Character Sets) they state "OCLC implemented use of all UTF-8 Unicode defined characters in 2016." What's confusing is their mention of codes for Unicode scripts - which were used in MARC8 encoded records during the transition period between MARC8 with the original 8 alternate graphic encodings, and the point of full Unicode implementation, after which libraries and their systems were expected to transition to UTF8.

If we specify that our conversion is designed for MARC21 UTF8 Unicode records we could save ourselves from having to deal with the MARC8 script coding. If people with MARC8 want to transform their records to RDA/RDF they would have to Unicodize them first. Policy?

CECSpecialistI commented 2 years ago

Decision documentation has been transferred from now-obsolete working google doc to decisions index: https://github.com/uwlib-cams/MARC2RDA/wiki/Decisions-Index#iih-6

lake44me commented 2 years ago

I realize now that we didn't finish the discussion on MARC-8 vs. Unicode. Crystal had a question why we should avoid MARC-8.

Note that only MARC-8 encoding would have a "script" indication in tag 066, and a script identification code in $6. That would be for the special MARC-8 script encoding for the limited set of languages with those codings. There is no script identifier in $6 when the record is in Unicode.

Practically speaking, I don't think we can deal with the MARC-8 encoding which for all intents and purposes is obsolete today and would require special software to "decode" for display in a web page or editing interface.

Most library systems provided processes to convert MARC8 records to UTF8 Unicode, and MARCEdit provides tools to do so.

CECSpecialistI commented 2 years ago

My knowledge of MARC-8 is limited. Since we aren't mapping the script indications anyhow, does it matter for the $6 mapping specifically? I don't see any harm in specifying somewhere that our mapping is designed for Unicode UTF8...I will put the idea on the agenda for next week's meeting and if no one is opposed that's what we'll do. Thank you for taking the time to explain this to us @lake44me !