Closed HedvigS closed 6 years ago
This is a known issue and there is no automatizable way to know when UNESCO has information on a set of dialects but not the language as a whole. But this is all superseded now with the agglomerated endangerment added to Glottolog.
2017-11-06 6:51 GMT+01:00 Hedvig Skirgård notifications@github.com:
There are examples of languages that are marked wrongly in the import from UNESCO World Atlas of Languages in danger bc a mess up with ISO codes. Rather ironically, for reasons related to something I brought up before in relation to ISO codes and dialects.
There are dialects of languages that are marked as endangered, when the entire language, arguably, is not. This results, for example, in French [fra] being "severely endangered" because the dialects:
- Burgundian
- Champenois
- Franc-Comtois
- Gallo
- Guernsey French
- Lorrain
- Poitevin-Saintongeais
are marked as severely endangered.
I don't know how many more cases like this there are, but importing form that dataset on ISO 639-3 alone seems to be problematic. I found some others by eyeballing the data when merged with descriptive status:
- Hungarian
- Modern Greek
- Armenian
- Bulgarian
- Croatian
- Estonian
- Swedish
- Belarusian
I would suggest to either match by language name instead of ISO code, or to do some manual checking of the data.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/clld/glottolog/issues/141, or mute the thread https://github.com/notifications/unsubscribe-auth/ADUDyMw5cUdNyBwbL_96FGyVDvK_eDJzks5szp50gaJpZM4QSw8d .
Thanks. There are related cases where dialects with different Degree_of_endangerment
already map to the same iso-code (see [5]
):
https://nbviewer.jupyter.org/gist/xflr6/11387d5e62022c6734687e873bb349a2
I think the current import resolves this only by the order in the XML file: https://github.com/clld/glottolog3/blob/7b4869b5079a216093531600b9cb4917cac47ba5/glottolog3/scripts/loader/unesco.py#L84-L93
Do we still want to fix these cases (i.e. do not load a value, the most/least critical one, etc.)?
@xflr6 As @d97hah says, this should be taken care of, once his PR is merged.
Thanks (was not completely clear to me that this means we drop the unesco field in the next version).
IMHO we should drop the unesco field in the next version (but I haven't erased it from the .ini:s)
2017-11-06 11:24 GMT+01:00 Sebastian Bank notifications@github.com:
Thanks (was not completely clear to me that this means we drop the unesco field in the next version).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/clld/glottolog/issues/141#issuecomment-342106795, or mute the thread https://github.com/notifications/unsubscribe-auth/ADUDyLDZFrea2BRPJvJFikUaNAIxrrdlks5szt54gaJpZM4QSw8d .
There are examples of languages that are marked wrongly in the import from UNESCO World Atlas of Languages in danger bc a mess up with ISO codes. (Rather ironically, for reasons related to something I brought up before in relation to ISO codes and dialects.)
There are dialects of languages that are marked as endangered, when the entire language, arguably, is not. This results, for example, in French [fra] being "severely endangered" because the dialects:
are marked as severely endangered.
I don't know how many more cases like this there are, but importing form that dataset on ISO 639-3 alone seems to be problematic. I found some others by eyeballing the data when merged with descriptive status:
I would suggest to either match by language name instead of ISO code, or to do some manual checking of the data.