Recognising proto-languages

parker57 commented 6 years ago

Words are given with ISO 639-3 codes (additionally, there are some ISO 639-2 codes prefixed with "p_" to indicate proto-languages).

From Wikipedia for ISO 639-3:

ISO 639-3.[2] It provides an enumeration of languages as complete as possible, including living and extinct, ancient and constructed, major and minor, written and unwritten.[1] However, it does not include reconstructed languages such as Proto-Indo-European.

The etywn-relety.json contains the following proto_language references

124 instances of 'p_sla', from ISO 639-2 Proto-Slavic
13 instance of 'p_gem', from ISO 639-2 Proto-Germanic
6 instances of 'p_ine', from ISO 639-2 Proto-Indo-European
3 instance of 'p_gmw', not in ISO 639-2 but seems to be Proto-West-Germanic (it only points to the word "iuwiz")

It's probably best to just add add the relevant JSON and document accordingly, for instance 'p_sla' could be

  {
    "name": "Proto-Slavic",
    "type": "extinct",
    "scope": "individual",
    "iso6393": 'p_sla',
    "iso6392B": null,
    "iso6392T": null,
    "iso6391": null
  }

no idea what to put for scope tbh

jmsv commented 6 years ago

We should probably strip values we're not using from the language json file and just keep keys called name and iso or something, which makes scope value irrelevant

alxwrd commented 6 years ago

I think it might be useful to keep the extra information and use it to extend the Language class. E.g. Language("eng").type

parker57 commented 6 years ago

I don't think there is much point keeping the other iso values but I am a bit upset don't have language family, that would be neat for analysis and even presentation.

This guy might have better JSON but tragically seems to have stopped at 639-3

  "bo": {
    "639-1": "bo",
    "639-2": "bod",
    "639-2/B": "tib",
    "family": "Sino-Tibetan",
    "name": "Tibetan Standard, Tibetan, Central",
    "nativeName": "བོད་ཡིག",
    "wikiUrl": "https://en.wikipedia.org/wiki/Standard_Tibetan"
  },
  ...
  "ru": {
    "639-1": "ru",
    "639-2": "rus",
    "family": "Indo-European",
    "name": "Russian",
    "nativeName": "Русский",
    "wikiUrl": "https://en.wikipedia.org/wiki/Russian_language"
  },

jmsv commented 6 years ago

I'm going to strip the unused keys for now, then we can add back other keys (or switch to a different dataset) when we want to expand the language class

alxwrd commented 6 years ago

noumar/iso639 looks like a good replacement to keeping the data in this project.

jmsv commented 6 years ago

Going to close this because the issue is solved - feel free to open a new issue for changing where we source iso639 codes if that's a good idea

jmsv / ety-python

Recognising proto-languages #32