internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.24k stars 1.38k forks source link

Recent (non-MARC) imports are adding deprecated language codes (presumably via language name lookups, not just old codes in the import data) #9504

Open hornc opened 4 months ago

hornc commented 4 months ago

Problem

https://openlibrary.org/books/OL51818714M/Yederasiw_Mastawesha

is a recently imported item that picked up the deprecated Ethiopian language code (the metadata has since been updated), it looks like the language code lookups, converting from language name to a code are using a list of codes with deprecated duplicates, so the resulting code may be the deprecated one (it's probably arbitrary depending on which is listed first?)

How to fix: The Name -> code lookup list should only contain current item codes.

This relates to the 'duplicates in the language drop down list' issue that I thought I saw recently, but cannot find it now. The dropdown and import translation list should both only contain current language codes.

Perhaps the language code config should have a deprecated parameter, and these can be excluded as needed.

Relates to #9002 in that the example shows at least BWB sourced import are using language lookups.

The specific code to change is: https://github.com/internetarchive/openlibrary/pull/9488/files

mekarpeles commented 4 months ago

@hornc can you propose a priority for this based on your use cases? Is this happening at a large scale (e.g. how many records being affected)? Is this blocking one of our systems/processes? This would help us prioritize accordingly

AbhinavKRN commented 3 months ago

@scottbarnes can you assign this issue to me?

scottbarnes commented 3 months ago

I have assigned this to you, @AbhinavKRN. Please ask any questions if you get stuck anywhere.

AbhinavKRN commented 3 months ago

Sure @scottbarnes on it.

hornc commented 3 months ago

So, I think this is a relatively low priority issue because I have a bot task that runs weekly to correct deprecated language codes to their current codes (if one exists).

To do this properly, we might want to think a bit about what is supposed to happen in the various cases.

What should happen in the following cases:

  1. an import record contains the deprecated /languages/eth code?
  2. an import record contains the deprecated /languages/esk code?

I was hoping someone would find and link the related "duplicate languages in dropdowns" issue, as that has similar requirements for extending the language code model, which I think is necessary to add this functionality.

Optional language fields we might need to add:

deprecated: /type/boolean deprecated_note: /type/string (a human readable description to indicate why this is deprecated and point to the preferred alternative, if there is one- i.e. use a more specific code (not-automatable), use a different code, current: /type/language (a current language to use instead, if this code is deprecated, and there is an automatic preferred version.)

Note: some deprecated codes may not have a clear single value for current

I'm not completely happy with the current terminology, but I can't think of a better term at the moment. Anyone have any ideas for better naming?

hornc commented 3 months ago

I think #8145 was perhaps the issue I remember, which touches on duplicate names. Is there a clearer one?

hornc commented 3 months ago

@cdrini having #8160 merged would bring us up-to-date with some of the previous language code issues that have already been raised, discussed, and addressed, so we can build on them here. Is there something blocking the merge of #8160 ?

scottbarnes commented 3 months ago

@hornc, I had hoped we could discuss this during the Monday ABC call, but somehow it was missed during triage. I added this to the agenda for the coming week.

cdrini commented 3 months ago

Howdy! Stumbled on this thanks to @RayBB ; taking a look at #8160

cdrini commented 3 months ago

@hornc merged! Although I will note I'm not too sure why #8160 would help with deprecated languages 🤔 But leaving that up to you!

hornc commented 2 months ago

I just found this code that translates already translates deprecated language codes: https://github.com/internetarchive/openlibrary/blob/447142086b90648207a558a3b0ed495acb6f168d/openlibrary/catalog/marc/parse.py#L288-L317

I had been thinking this (and the related removing deprecated language codes from the edition edit dropdown) required an update to the /type/language model . I looks like this could be fixed in code using the existing method.

hornc commented 2 months ago

It looks like MARC imports use the hardcoded deprecated language code tables in openlibrary/openlibrary/catalog/marc/parse.py , but imports from other sources do not.

9809 is an attempt to consolidate the deprecations into the language code type , so there should be an opportunity to consolidate the imports, and perhaps remove the special-case translations?

scottbarnes commented 2 months ago

@AbhinavKRN, are you still interested in working on this issue? If not I will open it back for others who may wish to work on it.