internetarchive / fatcat

Perpetual Access To The Scholarly Record
https://guide.fatcat.wiki
Other
114 stars 18 forks source link

language on releases #96

Open HughP opened 2 years ago

HughP commented 2 years ago

Greetings

currently the documentation states the following about language of the release:

language (string, slug): the primary language used in this particular release of the work. Only a single language can be specified; additional languages can be stored in "extra" metadata (TODO: which field?). This field should be a valid RFC1766/ISO639 language code (two letters). AKA, a controlled vocabulary, not a free-form name of the language.
  1. for ISO639 if you want two letter codes ISO639-1 should be specified. ISO639 has 6 parts, the two letter codes comprise part one.

  2. referencing RFC1766 is old form. RFC 1766 was obsoleted by RFC 3066 which was obsoleted by RFC 4647, which was obsoleted by RFC 5646. The stable way to reference this chain is to reference BCP-47.

  3. Is there a downstream technical reason to limit this field to two characters? instead of supporting BCP-47?

bnewbold commented 2 years ago

For 1 and 2, do you want to send a PR with preferred language? Or I can write something.

For 3, I didn't research this decision particularly deeply. One of the goals for this field was to be able to collect metadata from multiple sources (aka, other catalogs) and have them in a consistent format, even if that results in discarding some information from some sources. Another was to be able to aggregate (analytics) and query (search filters) simply. It would probably be possible to have more general purpose fields, and then synthesize them them to, eg, an ISO639-1 field for querying and analytics. These were different design priorities compared to a more authoritative/complete system like wikidata or MARC which are flexible to capture as much information as possible about each individual work.

HughP commented 2 years ago

BCP-47 allows for iso639-1 to be used for languages it exists for, but then for iso639-2 or 639-3 for languages outside that scope. However if the database field only expects two characters then an iso639-2 or 639-3 code of three characters will throw an error as unexpected length. So issue number 3 is important as it has implications on design requirements for infrastructure.

HughP commented 2 years ago

I'm happy to video chat for further clarification.