internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.01k stars 1.27k forks source link

Recent imports are adding deprecated language codes (presumably via language name lookups, not just old codes in the import data) #9504

Open hornc opened 3 weeks ago

hornc commented 3 weeks ago

Problem

https://openlibrary.org/books/OL51818714M/Yederasiw_Mastawesha

is a recently imported item that picked up the deprecated Ethiopian language code (the metadata has since been updated), it looks like the language code lookups, converting from language name to a code are using a list of codes with deprecated duplicates, so the resulting code may be the deprecated one (it's probably arbitrary depending on which is listed first?)

How to fix: The Name -> code lookup list should only contain current item codes.

This relates to the 'duplicates in the language drop down list' issue that I thought I saw recently, but cannot find it now. The dropdown and import translation list should both only contain current language codes.

Perhaps the language code config should have a deprecated parameter, and these can be excluded as needed.

Relates to #9002 in that the example shows at least BWB sourced import are using language lookups.

The specific code to change is: https://github.com/internetarchive/openlibrary/pull/9488/files

mekarpeles commented 3 weeks ago

@hornc can you propose a priority for this based on your use cases? Is this happening at a large scale (e.g. how many records being affected)? Is this blocking one of our systems/processes? This would help us prioritize accordingly