glottolog / glottolog

Collaborative data curation for Glottolog
http://glottolog.org
Other
151 stars 136 forks source link

Errors in endangerment status from UNESCO bc iso codes #141

Closed HedvigS closed 6 years ago

HedvigS commented 6 years ago

There are examples of languages that are marked wrongly in the import from UNESCO World Atlas of Languages in danger bc a mess up with ISO codes. (Rather ironically, for reasons related to something I brought up before in relation to ISO codes and dialects.)

There are dialects of languages that are marked as endangered, when the entire language, arguably, is not. This results, for example, in French [fra] being "severely endangered" because the dialects:

are marked as severely endangered.

I don't know how many more cases like this there are, but importing form that dataset on ISO 639-3 alone seems to be problematic. I found some others by eyeballing the data when merged with descriptive status:

I would suggest to either match by language name instead of ISO code, or to do some manual checking of the data.

d97hah commented 6 years ago

This is a known issue and there is no automatizable way to know when UNESCO has information on a set of dialects but not the language as a whole. But this is all superseded now with the agglomerated endangerment added to Glottolog.

2017-11-06 6:51 GMT+01:00 Hedvig Skirgård notifications@github.com:

There are examples of languages that are marked wrongly in the import from UNESCO World Atlas of Languages in danger bc a mess up with ISO codes. Rather ironically, for reasons related to something I brought up before in relation to ISO codes and dialects.

There are dialects of languages that are marked as endangered, when the entire language, arguably, is not. This results, for example, in French [fra] being "severely endangered" because the dialects:

  • Burgundian
  • Champenois
  • Franc-Comtois
  • Gallo
  • Guernsey French
  • Lorrain
  • Poitevin-Saintongeais

are marked as severely endangered.

I don't know how many more cases like this there are, but importing form that dataset on ISO 639-3 alone seems to be problematic. I found some others by eyeballing the data when merged with descriptive status:

  • Hungarian
  • Modern Greek
  • Armenian
  • Bulgarian
  • Croatian
  • Estonian
  • Swedish
  • Belarusian

I would suggest to either match by language name instead of ISO code, or to do some manual checking of the data.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/clld/glottolog/issues/141, or mute the thread https://github.com/notifications/unsubscribe-auth/ADUDyMw5cUdNyBwbL_96FGyVDvK_eDJzks5szp50gaJpZM4QSw8d .

xrotwang commented 6 years ago

see https://github.com/clld/glottolog/pull/140

xflr6 commented 6 years ago

Thanks. There are related cases where dialects with different Degree_of_endangerment already map to the same iso-code (see [5]): https://nbviewer.jupyter.org/gist/xflr6/11387d5e62022c6734687e873bb349a2

I think the current import resolves this only by the order in the XML file: https://github.com/clld/glottolog3/blob/7b4869b5079a216093531600b9cb4917cac47ba5/glottolog3/scripts/loader/unesco.py#L84-L93

Do we still want to fix these cases (i.e. do not load a value, the most/least critical one, etc.)?

xrotwang commented 6 years ago

@xflr6 As @d97hah says, this should be taken care of, once his PR is merged.

xflr6 commented 6 years ago

Thanks (was not completely clear to me that this means we drop the unesco field in the next version).

d97hah commented 6 years ago

IMHO we should drop the unesco field in the next version (but I haven't erased it from the .ini:s)

2017-11-06 11:24 GMT+01:00 Sebastian Bank notifications@github.com:

Thanks (was not completely clear to me that this means we drop the unesco field in the next version).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/clld/glottolog/issues/141#issuecomment-342106795, or mute the thread https://github.com/notifications/unsubscribe-auth/ADUDyLDZFrea2BRPJvJFikUaNAIxrrdlks5szt54gaJpZM4QSw8d .