pelias-deprecated / quattroshapes

(DEPRECATED) Pelias import pipeline for Quattroshapes
https://github.com/pelias/whosonfirst
5 stars 1 forks source link

investigate geonames association #19

Closed sevko closed 8 years ago

sevko commented 9 years ago

Some Quattroshapes records appear to have GeoNames IDs, which might imply that there are duplicate records across the datasets. Investigate the correlation, and if it's 1:1, add a flag to skip over records with non-null IDs.

sevko commented 9 years ago

The correlation isn't great. Here's an example of some records from qs_neighborhoods (the Quattroshapes neighborhood-level shapefile) and geonames matched on their qs_gn_ids:

+-----------+---------------------------------------------------------+----------------------------+----------------------+
| geonameid |                      quattroshapes                      |           geonames         |       distance       |
+-----------+---------------------------------------------------------+----------------------------+----------------------+
|   2753359 | Oudijsselmonde                                          | IJsselmonde                |   0.0119018984505134 |
|   2753380 | Ijkenberg                                               | IJkenberg                  |   0.0148742070442114 |
|   2753949 | Hoogerbrugge                                            | Hogebrug                   |    0.275712802073019 |
|   2754074 | Hilligersberg                                           | Hillegersberg              |   0.0115435583299558 |
|   2754236 | Het Meer                                                | Het Moer                   |    0.340442238578578 |
|   2754241 | 't Loo                                                  | Het Loo                    |   0.0133275700395486 |
|   2755114 | 't Loo                                                  | Groot Loo                  |    0.959396388857872 |
|   2755925 | Fijenoord                                               | Feijenoord                 |     0.01938306096269 |
|   2756168 | Angelsloo                                               | Elsloo                     |    0.725160813126098 |
|   2756544 | Drumt                                                   | Drumpt                     |   0.0307749767440195 |
|   2756888 | Diemerbrug                                              | Diemen                     |   0.0250188673778957 |
|   2757375 | De Vaan                                                 | De Laan                    |    0.432559144224079 |
|   2757442 | D' Ekker                                                | D’ Ekker                   |   0.0452435555601514 |
|   2757479 | Deileroord                                              | Deijleroord                |   0.0271360341415246 |
|   2757479 | Reijeroord                                              | Deijleroord                |    0.318006068210897 |
|   2757623 | De Gorzen                                               | De Goorn                   |    0.917858113572033 |
|   2757966 | Schaardijk                                              | Chaamdijk                  |    0.508751050098395 |
|   2758198 | Bruggemors                                              | Bruggenmors                |   0.0269058614048561 |
|   2758201 | Brügger                                                | Bruggen                    |     0.89482682466229 |
|   2758318 | Bijdorp                                                 | Brijdorpe                  |    0.541341190721896 |
|   2758390 | Bredius Kwartier                                        | Brediuskwartier            |   0.0169650326590277 |
|   2758676 | Bomen-En Bloemen Buurt                                  | Bomen- en Bloemen Buurt    |  0.00234175058980621 |
|   2758880 | Blaartem                                                | Blaarthem                  |  0.00580763383983014 |
|   3230181 | Vruchtenbuurt                                           | Vluchtenburg               |    0.131293981443509 |
|   6544759 | Zeijerveen                                              | Zeyerveen                  |   0.0138922225502451 |
|   6544866 | 't Heike                                                | Het Heike                  |     0.23543413071669 |
|   6544892 | Bomenbuurt                                              | Molenbuurt                 |    0.894138402543485 |
|   6941548 | Leyenburg                                               | Ypenburg                   |    0.110842873179306 |
|   7873874 | Kop Van Zuid                                            | Kop van Zuid               |   0.0128472700270584 |
|   3132939 | VÃ¥land                                                 | Våland                     |  0.00534706813464556 |
|   3133372 | UllevÃ¥l Haveby                                         | Ullevål Hageby             |   0.0220150573555189 |
|   3134109 | Tøyen                                                  | Tøyen                      |   0.0212454387740947 |
|   3135491 | Smistad                                                 | Sumstad                    |    0.835847859327129 |
|   3137099 | Stavne                                                  | Staven                     |    0.576223820787905 |
|   3138418 | Skullerud                                               | Skulerud                   |    0.717088906216955 |
|   3139368 | Skansen                                                 | Skansemyren                |   0.0173846993169306 |
|   3140003 | Samdalen                                                | Sedalen                    |    0.494333811857231 |
|   3141670 | Røa                                                    | Røa                        |   0.0208637999444873 |
|   3143604 | Selsbakk                                                | Olsbakk                    |    0.754767378874433 |
|   3143766 | Ã\u0098kern                                             | Økern                      |  0.00891066052687025 |
|   3143983 | Nyhamn                                                  | Nyhavn                     |  0.00848874665735069 |
|   3148504 | Lambertsæter                                           | Lambertsæter               |  0.00800685943951798 |
|   3148504 | Lambertseter                                            | Lambertsæter               |   0.0222990570827133 |
|   3149522 | Kolstad                                                 | Konstad                    |    0.271314014605761 |
|   3151570 | HøybrÃ¥ten                                             | Høybråten                  |   0.0295516132284727 |
|   3153050 | Tellevik                                                | Hellevik                   |    0.793919902476773 |
|   3159585 | Bygdøy                                                 | Bygdøy                     |    0.027651014208552 |

Here's the geonames ID coverage in Quattroshapes:

layer number of IDs
admin1 3172
localities 137039
neighborhoods 49906

All other layers (admin0, admin2, and localadmin) all have 0. If we want deduplication, we should probably just use the address-deduplicator.

hkrishna commented 9 years ago

Ideally, this should happen in the address-deduplicator. However, since this is geoname_id specific - I could see it living in quattroshapes-pipeline. When you encounter a 1-1 correlation you could add skip flag to the record - however if there is a minor distance between the two gn_ids we could update the existing geonames record with an alternate name name.alt in addition to the existing name.default - this can be possible with https://github.com/pelias/dbclient/issues/9 being resolved.

dianashk commented 9 years ago

We should not make assumptions about the datasets being imported. Some users might only import QS, in which case we'd need additional setup flags that the user has to understand and set. Relying on the deduper post-import is the cleanest solution and allows other modules to follow the Single Responsibility Principle.

sevko commented 9 years ago

Right, the intent was to provide importer flags for both geonames and Quattroshapes. It might be better to support it rather than not, and have advanced users (ie us, heh) take advantage of it. The number of ostensible duplicates is pretty significant.

nvkelso commented 9 years ago

Once the Quattroshapes gazetteer file is linked up with the polygons (so the qs_id) are filled in, this will solve 90%+ of this problem.

hkrishna commented 9 years ago

@nvkelso sounds good :+1: