internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
4.99k stars 1.25k forks source link

Migrate 60K+ records from 26 obsolete language codes #8139

Closed tfmorris closed 1 month ago

tfmorris commented 11 months ago

There are a large number (over 60,000) of records which use obsolete language codes which makes them hard to search for and also causes the languages to not have translated labels. These records should be updated with their new codes to match the current Library of Congress MARC standards. This will make it possible to find all editions in a given language with a single search instead of two and also provide translated labels for the languages.

The MARC importer should also have its mapping table updated to make sure no new records get imported with deprecated language codes, but I'll create a separate ticket for that.

Evidence / Screenshot (if possible)

Old Code Correct Code Count Name
scr hrv 27816 Croatian
scc srp 18834 Serbian
iri gle 8098 Irish
snh sin 4104 Sinhalese
gag glg 2615 Galician
gae gla 2013 Scottish Gaelic
mol rum 901 Romanian
esp epo 844 Esperanto
tag tgi 822 Tagalog
far fao 480 Faroese
tar tat 449 Tatar
fri fry 395 Frisian
mla mlg 325 Malagasy
sho sna 285 Shona
lan oci 262 Occitan (post 1500)
sso sot 233 Sotho
tsw tsn 164 Tswana
eth gez 111 Ethiopic
cam khm 94 Khmer
gal orm 84 Oromo
gua grn 65 Guarani
taj tgk 62 Tajik
swz ssw 59 Swazi
lap smi 56 Sami
int ina 29 Interlingua (International Auxiliary Language Association)
sao smo 12 Samoan

Note there are also 217 records with the obsolete code esk for "Eskimo Languages" but these will need to be recataloged with the correct individual languages because there is no current equivalent. In many cases, it may be possible to infer the correct language from the subjects which are assigned (e.g. Inuktitut language)

Stakeholders

@mekarpeles

tfmorris commented 1 month ago

Although 60,000 is a lot in human terms, it's a trivial number for a computer, particularly when a ready-made translation table has been provided. This is probably less than an hour of work for a knowledgeable programmer.

hornc commented 1 month ago

@tfmorris I am running through this now.

hornc commented 1 month ago

I believe this is now complete. Solr seems to still list some books under the old language codes, but the records have been corrected.

tfmorris commented 1 month ago

Thanks @hornc !