internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.08k stars 1.32k forks source link

Improve language nomenclature #8127

Open stopregionblocking opened 1 year ago

stopregionblocking commented 1 year ago

Many languages are described with bad names (for example, names rarely or never used by its speakers or linguists) or no names (for example, "bucket" categories for entire language groups). This could be addressed much like bad or missing book metadata is addressed.

Describe the problem that you'd like solved

The naming/identification system for languages could be improved.

Proposal & Constraints

Proposal: make it possible for language names to be editable. As an underlying database, Ethnologue (from recent years) and Glottolog are probably much better than the Library of Congress, though certainly not perfect.

Constraints: Maybe this should to limited to a select group of librarians to have the ability to edit names. It may or may not be beneficial to require discussion before suggesting names.

Additional context

Here are some specific languages which 1) have works in the language on OpenLibrary but 2) do not exist in OL's language database & should therefore be added:

tfmorris commented 1 year ago

It would be helpful if you were specific about which language labels which you think are wrong. That would allow people to check/fix them.

The vast majority of language metadata comes from library catalogers who follow a set of guidelines in their cataloging. This has expanded from solely using MARC language codes to include ISO 639-3 as well, but OpenLibrary's current internal language codes are the MARC codes, e.g. fre instead of the ISO 639-2 fra for French.

The macrolanguages (what you called bucket categories) are a tradeoff to keep cataloging expense / effort manageable. Certainly OpenLibrary could invest in more fine grained categorization, but they won't get it for free from the library cataloging community like they do the current language metadata.

I think there are a few concrete steps that can be taken to improve the current system (which could probably be broken out into separate tickets):

  1. Verify that all English, French, & German language labels match the current Library of Congress standards for MARC languages. There have been a number of updates over the years and not all of them have been applied to OpenLibrary. For example, him should now be "Western Pahari languages" in English rather than Himachali. [Note: this may be redundant if Wikidata names are used preferentially]

  2. Apply updates which were made in the 2007 edition as well as those made since then. For example, all records which are coded with scr, the previous code for Croatian, should be migrated to hrv. OpenLibrary currently has a mix of both language codes (and no translated labels for the scr code). I've created #8139 for this.

  3. Use Wikidata labels to fill in gaps in translations for currently supported language codes. (This is partially done, but I created #8138 to cover the missing codes) These can be looked up using P4801 using a SPARQL query like this:

    SELECT DISTINCT ?item ?label (LANG(?label) AS ?label_lang) WHERE {
    {
    SELECT DISTINCT ?item ?label WHERE {
      ?item p:P4801 ?statement0.
      ?statement0 ps:P4801 "languages/him".
      ?item rdfs:label ?label.
    }
    LIMIT 10
    }
    }
  4. A bigger step would be to add support for ISO 639-3 codes internally and in the MARC import pipeline so that ISO 639-3 metadata outlined in the cataloging guidelines doesn't get dropped and OpenLibrarians have a wider range of codes (~8,000) to choose from.

I think the return on investment for switching to something that's not supported by the library community is likely to be pretty low. Sticking with established standards, imperfect as they are, is likely to have much higher leverage and synergy.

stopregionblocking commented 1 year ago

2 & 3 seem like straightforward steps to improving the data; thanks for opening relevant issues.

An example with the bucket category issue: OL2616885W has been published in (at least) 5 languages. 3 of these translations have the language given as "Nilo-Saharan (Other)", & OL therefore indicates the work as having been published in 3 languages. However, each "Nilo-Saharan (Other)" edition is in a different language (& accordingly has a different title):

OL30494270M is in Ateso (teo). OL44884319M is in Dhopadhola (adh). OL30494300M is in Lëblaŋo (laj).

Trying to input these language names to update the records doesn't produce any dropdown menu suggestions, which I understood to mean the options were not available.

ISO 639-3 & Ethnologue both recognize the above languages. The PCC guidelines you linked are based on ISO 639-3, & they state that "When it is possible to identify a language within a macrolanguage and locate a matching ISO 639-3 code, prefer the specific code over the macrolanguage one." It seems clear that the LoC standards based on ISO 639-2 are central to the problem, rather than it being a question of managing cataloging expenses or effort.

tfmorris commented 1 year ago

The PCC guidelines you linked are based on ISO 639-3, & they state

This new version of the PCC Guidelines was approved less than a year ago, most likely isn't widely adopted, and, more importantly, does NOT mandate ISO 639-3 cataloging. The introduction includes the following:

In order to ensure that PCC records are usable in all systems, the guidelines that follow instruct PCC catalogers working in shared cataloging environments to encode all languages using MARC language codes and, optionally, also using ISO 639-3 language codes. [emphasis added]

There is likely very little ISO 639-3 coding in MARC records today, but OpenLibrary doesn't support it anyway. That's my item 4 (which I just numbered). It will require a bigger engineering investment than the other data cleanups that I created tickets for. The list of languages supported by OpenLibrary today is here: https://www.loc.gov/marc/languages/language_name.html

If you want to see how your examples were cataloged originally, you can look at the MARC records which are linked at the bottom of their edition pages. The first one was imported from a record provided by Columbia University. The Library of Congress record is here and the WorldCat record is here. In all cases the language is coded as ssa, although there is a textual note saying "In Teso"

It seems clear that the LoC standards based on ISO 639-2 are central to the problem, rather than it being a question of managing cataloging expenses or effort.

I guess we'll just have to disagree on that point, but, just to be clear, I wasn't talking about OpenLibrary's cataloging effort, which is minuscule in balance, but the global cataloging effort by all cataloging librarians worldwide. They balance investment in cataloging with all the other responsibilities that they have and their available budgets.

stopregionblocking commented 1 year ago

In all cases the language is coded as ssa

Then OL is replicating bad cataloging practices from other libraries, as I mentioned when initially describing this in Slack. I don't think this is good, necessary, or inevitable. I'm flexible on the technical details as long as there are proposals that aren't "just leave it that way"; it seems to me like 4 addresses this, but as you said it wasn't there when I replied.

tfmorris commented 1 year ago

it seems to me like 4 addresses this, but as you said it wasn't there when I replied.

Not true. It absolutely was there from my very first reply. All I did was add a number to it to make it easier to reference. Feel free to check the edit history.

stopregionblocking commented 1 year ago

Sure, let's amend that to "I missed it as being one of the practical proposals, possibly because it wasn't numbered".