lcnetdev / marc2bibframe2

Convert MARC records to BIBFRAME2 RDF
http://www.loc.gov/bibframe/
Creative Commons Zero v1.0 Universal
88 stars 35 forks source link

Language tags and 880 fields #70

Open kiegel opened 6 years ago

kiegel commented 6 years ago

In regard to internationalization, the logic for applying language tags needs work for parallel-script fields (880), e.g. with translations or parallel titles.

Incorrect Language Tags and Script Subtags For example, problems crop up with OCLC #271414, an English translation of a Russian work.

<http://lib.washington.edu/ld/test/99114652250001452#Work880-45> a bf:Work ;
    rdfs:label "Евгений Онегин."@en-cyrl ;

The label is Cyrillic but in Russian, not English.

Work [ a bflc:Relationship ;
            bflc:relation [ a bflc:Relation ;
                    rdfs:label "Container of (expression)"@en-cyrl ] ;
            bf:relatedTo <http://lib.washington.edu/ld/test/99114652250001452#Work880-44> ].

The label is English but not Cyrillic. In general, it is vanishingly rare for a string to be both in the English language and in the Cyrillic script.

OCLC # 793950140, a Chinese translation of a Japanese work.

<http://lib.washington.edu/ld/test/99131426860001452#Work> a bf:Text,
        bf:Work ;
    rdfs:label "Inō Kanori no Taiwan tōsa nikki. Chinese",
        "伊能嘉矩の臺湾踏柤日記. Chinese"@zh-hani .

The title in the label is Japanese, not Chinese.

OCLC # 893875561, a Latvian book with a parallel title in Russian.

[ a bf:ParallelTitle,
                bf:Title,
                bf:VariantTitle ;
            rdfs:label "Заяц и его друзья : латышские народные сказки о животных"@lv-cyrl ;
            bf:mainTitle "Заяц и его друзья"@lv-cyrl ;
            bf:subtitle "латышские народные сказки о животных"@lv-cyrl ]

The title in the label, mainTitle and subtitle is Russian, not Latvian.

Compliance with IETF RFC 5646 Use of language tags should follow the practices given in IETF RFC 5646 [1]. Concerning the script subtag, on page 12 it states “[it] SHOULD be omitted when it adds no distinguishing value to the tag or when the primary or extended language subtag's record in the subtag registry includes a 'Suppress-Script' field listing the applicable script subtag”.

For example, for OCLC # 1779370:

<http://lib.washington.edu/ld/test/99129152590001452#Agent880-32> a bf:Agent,
        bf:Jurisdiction ;
    rdfs:label "Russia. Министерство народнаго просвѣщенія."@ru-cyrl .

Russian has the Suppress-Script field so a script subtag for Cyrillic is prohibited.

Not Good Practice Using a language tag for numeric data in bf:part is not wrong but probably not a good practice.

<http://lib.washington.edu/ld/test/99129152590001452#Instance880-38> a bf:Instance ;
    bf:part "1825-29"@ru-cyrl ;
    bf:title [ a bf:Title ;
            rdfs:label "Записки"@ru-cyrl ] .

[1] https://tools.ietf.org/html/bcp47

kirkhess commented 6 years ago

This is complicated since I think some of this is bad data vs bad conversion. We'll investigate and report back.

osma commented 6 years ago

I've also seen the converter create @ru-cyrl language tags where the -cyrl is redundant and forbidden by BCP 47. I've chosen to ignore them for now.

kirkhess commented 6 years ago

The specs are going to be updated - pretty sure the best solution is to stop adding tags based on 008+$6.

If the marc included the language with the script it would be different and is technically possible, we were also going to look into that as well.