glottolog / glottolog-legacy

DEPRECATED. See https://github.com/clld/glottolog
12 stars 11 forks source link

name <-> identifier mismatches #38

Closed xflr6 closed 10 years ago

xflr6 commented 10 years ago

Some description='Glottolog' identifiers are out of sync with their languoid names:

print pd.read_sql("""SELECT l.id, l.name, i.name as identifier FROM language AS l
LEFT JOIN (languageidentifier AS li JOIN identifier AS i ON li.identifier_pk = i.pk
AND i.type = 'name' AND i.description = 'Glottolog') ON l.pk = li.language_pk
WHERE l.name IS DISTINCT FROM i.name ORDER BY l.id""", engine)
          id                    name                  identifier
0   anci1242           Ancient Greek     Ancient Greek (to 1453)
1   anyx1238                   Anykh                        Anyx
2   araw1276                  Lokono                      Arawak
3   avar1256                    Avar                      Avaric
4   caod1238                Tshobdun                     Caodeng
5   cent2045                   Jalaa                     Centuum
6   chab1238                  Japhug                      Chabao
7   daof1238                    Rtau                       Daofu
8   djam1254             Jaminjungan               Djamindjungan
9   djam1255               Jaminjung                 Djamindjung
10  gong1255            Ta-Ne-Omotic               Gonga-Gimojan
11  guan1266              Khroskyabs                 Guanyinqiao
12  guan1269                 Kotiria                     Guanano
13  hinu1240                   Hinuq                      Hinukh
14  iwai1245                Iwaidjic             Iwaidjan Proper
15  jeri1242                    Jeli                    Jeri Kuo
16  jiar1239                Gyalrong                     Jiarong
17  keoo1238                     Keo                        Ke'o
18  khva1239                Khwarshi                    Khvarshi
19  kjac1234  Chinese Pidgin Russian              Kjachta Pidgin
20  kryt1240                    Kryz                       Kryts
21  mala1533  Malacca-Batavia Creole  Malaccan Creole Portuguese
22  mana1288                 Manange                    Manangba
23  pira1254               Wa'ikhana                  Piratapuyo
24  rgya1239              Gyalrongic                 Rgyalrongic
25  ribu1240                     Zbu                        Ribu
26  shan1274                 Stodsde                   Shangzhai
27  yapu1240                    Yeri                     Yapunda

The ones on the left are newer. If we remove the ones on the right, the languoids can no longer be found under that name (i.e. there are no other providers for these names).

Also, some orphaned identifiers:

print pd.read_sql("""SELECT type, description, name, lang FROM identifier AS i
WHERE NOT EXISTS (SELECT 1 FROM languageidentifier AS li
WHERE li.identifier_pk = i.pk)""", engine)
   type description                                           name lang
0  name   Glottolog                           Yimas-Araundi-Pidgin   en
1  name   Glottolog  Kentish (English of the Southeast of England)   en
xflr6 commented 10 years ago

Slight correction, only the following have no other provider for the old name:

          id                    name       identifier other_i
4   caod1238                Tshobdun          Caodeng   False
5   cent2045                   Jalaa          Centuum   False
7   daof1238                    Rtau            Daofu   False
10  gong1255            Ta-Ne-Omotic    Gonga-Gimojan   False
14  iwai1245                Iwaidjic  Iwaidjan Proper   False
19  kjac1234  Chinese Pidgin Russian   Kjachta Pidgin   False
24  rgya1239              Gyalrongic      Rgyalrongic   False
25  ribu1240                     Zbu             Ribu   False
d97hah commented 10 years ago

The names on the left are the current ones, the ones on the right should be retained as alternative names somehow. H

2014-10-25 10:30 GMT+02:00 Sebastian Bank notifications@github.com:

Slight correction, only the following have no other provider for the old name:

      id                    name       identifier other_i4   caod1238                Tshobdun          Caodeng   False5   cent2045                   Jalaa          Centuum   False7   daof1238                    Rtau            Daofu   False10  gong1255            Ta-Ne-Omotic    Gonga-Gimojan   False14  iwai1245                Iwaidjic  Iwaidjan Proper   False19  kjac1234  Chinese Pidgin Russian   Kjachta Pidgin   False24  rgya1239              Gyalrongic      Rgyalrongic   False25  ribu1240                     Zbu             Ribu   False

— Reply to this email directly or view it on GitHub https://github.com/clld/glottolog-data/issues/38#issuecomment-60475852.

xflr6 commented 10 years ago

TIL the database already has multiple description='glottolog' identifiers per languoid:

>>> print pd.read_sql("""SELECT l.id, array_agg(i.name ORDER BY i.updated DESC) as names
FROM language AS l
JOIN languageidentifier AS li ON li.language_pk = l.pk
JOIN identifier AS i ON li.identifier_pk = i.pk
AND i.type = 'name' AND i.description = 'Glottolog'
GROUP BY l.pk HAVING count(*) > 1 ORDER BY l.id""", engine)
          id                                              names
0   anci1242           [Ancient Greek, Ancient Greek (to 1453)]
1   anyx1238                                      [Anykh, Anyx]
2   araw1276                                   [Lokono, Arawak]
3   avar1256                                     [Avar, Avaric]
4   caod1238                                [Tshobdun, Caodeng]
5   cent2045                                   [Jalaa, Centuum]
6   chab1238                                   [Japhug, Chabao]
7   daof1238                                      [Rtau, Daofu]
8   djam1254                       [Jaminjungan, Djamindjungan]
9   djam1255                           [Jaminjung, Djamindjung]
10  gong1255                      [Ta-Ne-Omotic, Gonga-Gimojan]
11  guan1266                          [Khroskyabs, Guanyinqiao]
12  guan1269                                 [Kotiria, Guanano]
13  hinu1240                                    [Hinuq, Hinukh]
14  iwai1245                        [Iwaidjic, Iwaidjan Proper]
15  jeri1242                                   [Jeli, Jeri Kuo]
16  jiar1239                                [Gyalrong, Jiarong]
17  keoo1238                                        [Keo, Ke'o]
18  khva1239                               [Khwarshi, Khvarshi]
19  kjac1234           [Chinese Pidgin Russian, Kjachta Pidgin]
20  kryt1240                                      [Kryz, Kryts]
21  mala1533  [Malacca-Batavia Creole, Malaccan Creole Portu...
22  mana1288                                [Manange, Manangba]
23  pira1254                            [Wa'ikhana, Piratapuyo]
24  rgya1239                          [Gyalrongic, Rgyalrongic]
25  ribu1240                                        [Zbu, Ribu]
26  shan1274                               [Stodsde, Shangzhai]
27  yapu1240                                    [Yeri, Yapunda]

So unless one of the alternative names is to be deleted, this can be closed. However, the alternative ones are not shown on the languoid page: I will open an issue for that.