ualbertalib / avalon

University of Alberta's Media Repository based on Avalon
Apache License 2.0
2 stars 2 forks source link

Extend language validation to include ISO 639-3 #119

Closed zschoenb closed 5 years ago

zschoenb commented 7 years ago

Descriptive summary

ISO 639-3 (https://raw.githubusercontent.com/datasets/language-codes/master/data/language-codes-full.csv) could be used to validate languages on ingest. ISO639-3 should provide the inclusiveness necessary for many of our multilingual items.

Expected behavior

Languages such as 'mandarin', 'cantonese', or 'farsi' should validate.

Actual behavior

Marc language codes are being used to validate languages. This list is not very inclusive. For example, "Chinese" and "Persian" stand-in for the above language identifiers.

seanluyk commented 7 years ago

Looks like this is the authoritative version. There's also an explanation of the differences between ISO standards here FYI.

seanluyk commented 7 years ago

P.S. You won't find Cantonese in there - it's classified as Yue, the Mandarin name for Cantonese, which was apparently a controversial choice

zschoenb commented 7 years ago

So it seems... Thanks, Sean.

zschoenb commented 7 years ago

The mappings also appear to be there.

cwant commented 7 years ago

@zschoenb Here is the URL to the current language file (with URIs):

https://github.com/ualbertalib/avalon/blob/master/config/iso639-2.yml

Are there URIs for the 639-3 language set?

zschoenb commented 7 years ago

With 639-3, you get multiple language names assigned to the same language code. What do you think the implications are for production? @cwant

zza: :code: zza :text: Dimili :uri: http://www-01.sil.org/iso639-3/documentation.asp?id=zza zza: :code: zza :text: Kirdki :uri: http://www-01.sil.org/iso639-3/documentation.asp?id=zza zza: :code: zza :text: Zaza :uri: http://www-01.sil.org/iso639-3/documentation.asp?id=zza zza: :code: zza :text: Zazaki :uri: http://www-01.sil.org/iso639-3/documentation.asp?id=zza

seanluyk commented 7 years ago

In the interest of closing this service request, I've had Tai-chun run the batch ingest for this collection with the following language limitations noted: -Mandarin and Cantonese items both are catalogued as Chinese -Plautdietsch is catalogued as Low German -Farsi is catalogued as Persian

I'd like to move this issue to post-launch and we can fix these few items manually after ISO 639-3 is implemented. This is a lower priority for now simply because most collections won't contain items not in ISO 639-2

zschoenb commented 7 years ago

That sounds fine. Thanks Sean. I haven't parsed that new language code yet anyway, as per Chris's involvement in the pushmipullyu sprint.

seanluyk commented 5 years ago

Doing some cleanup