internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.11k stars 1.33k forks source link

When importing non-MARC records, look up required /type/language code by language name #2435

Open hornc opened 4 years ago

hornc commented 4 years ago

Is your feature request related to a problem? Please describe.

In https://github.com/internetarchive/openlibrary/blob/master/openlibrary/catalog/add_book/load_book.py build_query(rec)

languages are expected to be the 3 letter codes ~ISO_639-3_language_codes~ correction: these are MARC21 language codes https://www.loc.gov/marc/languages/language_code.html which are similar, but do differ from the ISO standard.

There should be a facility to look up the the code by using language name by querying https://openlibrary.org/languages to get the code.

Specifically https://github.com/internetarchive/openlibrary/blob/f30611af14d5acc48e19cb216bbfafac37ec4ce4/openlibrary/core/vendors.py#L167-L177

gets the language as a name rather than a code

By default, MARC records use the 3 character codes already: https://www.loc.gov/marc/bibliographic/bd041.html

It would be nice if the import system was flexible enough to support both methods, and be able to convert one to the other using the existing language types we store.

Describe the solution you'd like

Proposal & Constraints

Additional context

cdrini commented 4 years ago

Oh! I just had to do a bunch of language code nonsense for bookreader :P https://github.com/internetarchive/bookreader/pull/150

I think it needs to be in ISO 639-2/B (see https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes ). Proof: https://openlibrary.org/languages/fre ; the ISO 639-3 would be fra). Although looking at the list of MARC languages you posted, it looks like they follow their own standard 😅

tfmorris commented 4 years ago

MARC languages are definitely different. They are defined here: https://www.loc.gov/marc/languages/

Regarding the relationship between the two, they say:

RELATIONSHIP TO ISO 639-2

ISO 639-2 (Codes for the representation of names of languages-- Part 2: alpha-3 code) was based on the MARC Code List for Languages and published in 1998. In the 22 cases where the ISO 639-2 list has two alternative codes, the bibliographic code is the same as the MARC code. Language names in ISO 639-2 are not necessarily the same as those in MARC, particularly because of the practice of correlating the MARC language names with those used in Library of Congress Subject Headings. The MARC list includes references for unused forms of language names, while the ISO list has in some cases included alternative name forms, but many are lacking, since this practice of supplying alternate forms has only recently been implemented. In addition the MARC documentation includes a list of individual languages under collective codes or language groups, while the ISO list only includes the group codes themselves. The Library of Congress is maintenance agency for both lists, and the two are kept compatible in terms of code additions and deletions.

tfmorris commented 4 years ago

The edition edit form already knows how to autocomplete languages and convert them to their associated codes. Have you looked at whatever API powers that? It seems like it should be possible to reuse it.

Note also that the codes have changed over time, so probably also need to be able to handle historical codes which were in use at the time that the catalog record was created.

hornc commented 4 years ago

I keep being reminded of this. It's probably linked to from the other URLs above too, but here is the list of codes: http://www.loc.gov/marc/languages/language_code.html

tfmorris commented 4 years ago

The code list on this page has both the /B and /T forms, so is helpful for crosswalking the MARC or ISO 639-2/B fre to the ISO 639-2/T fra form which most other code systems use. https://www.loc.gov/standards/iso639-2/php/code_list.php

Apparently the first is derived from the name of the language in English, while the second is derived from the name of the language in the language itself (ie French vs français).