Closed sevko closed 8 years ago
The correlation isn't great. Here's an example of some records from qs_neighborhoods
(the Quattroshapes neighborhood-level shapefile) and geonames matched on their qs_gn_id
s:
+-----------+---------------------------------------------------------+----------------------------+----------------------+
| geonameid | quattroshapes | geonames | distance |
+-----------+---------------------------------------------------------+----------------------------+----------------------+
| 2753359 | Oudijsselmonde | IJsselmonde | 0.0119018984505134 |
| 2753380 | Ijkenberg | IJkenberg | 0.0148742070442114 |
| 2753949 | Hoogerbrugge | Hogebrug | 0.275712802073019 |
| 2754074 | Hilligersberg | Hillegersberg | 0.0115435583299558 |
| 2754236 | Het Meer | Het Moer | 0.340442238578578 |
| 2754241 | 't Loo | Het Loo | 0.0133275700395486 |
| 2755114 | 't Loo | Groot Loo | 0.959396388857872 |
| 2755925 | Fijenoord | Feijenoord | 0.01938306096269 |
| 2756168 | Angelsloo | Elsloo | 0.725160813126098 |
| 2756544 | Drumt | Drumpt | 0.0307749767440195 |
| 2756888 | Diemerbrug | Diemen | 0.0250188673778957 |
| 2757375 | De Vaan | De Laan | 0.432559144224079 |
| 2757442 | D' Ekker | D’ Ekker | 0.0452435555601514 |
| 2757479 | Deileroord | Deijleroord | 0.0271360341415246 |
| 2757479 | Reijeroord | Deijleroord | 0.318006068210897 |
| 2757623 | De Gorzen | De Goorn | 0.917858113572033 |
| 2757966 | Schaardijk | Chaamdijk | 0.508751050098395 |
| 2758198 | Bruggemors | Bruggenmors | 0.0269058614048561 |
| 2758201 | Brügger | Bruggen | 0.89482682466229 |
| 2758318 | Bijdorp | Brijdorpe | 0.541341190721896 |
| 2758390 | Bredius Kwartier | Brediuskwartier | 0.0169650326590277 |
| 2758676 | Bomen-En Bloemen Buurt | Bomen- en Bloemen Buurt | 0.00234175058980621 |
| 2758880 | Blaartem | Blaarthem | 0.00580763383983014 |
| 3230181 | Vruchtenbuurt | Vluchtenburg | 0.131293981443509 |
| 6544759 | Zeijerveen | Zeyerveen | 0.0138922225502451 |
| 6544866 | 't Heike | Het Heike | 0.23543413071669 |
| 6544892 | Bomenbuurt | Molenbuurt | 0.894138402543485 |
| 6941548 | Leyenburg | Ypenburg | 0.110842873179306 |
| 7873874 | Kop Van Zuid | Kop van Zuid | 0.0128472700270584 |
| 3132939 | VÃ¥land | Våland | 0.00534706813464556 |
| 3133372 | UllevÃ¥l Haveby | Ullevål Hageby | 0.0220150573555189 |
| 3134109 | Tøyen | Tøyen | 0.0212454387740947 |
| 3135491 | Smistad | Sumstad | 0.835847859327129 |
| 3137099 | Stavne | Staven | 0.576223820787905 |
| 3138418 | Skullerud | Skulerud | 0.717088906216955 |
| 3139368 | Skansen | Skansemyren | 0.0173846993169306 |
| 3140003 | Samdalen | Sedalen | 0.494333811857231 |
| 3141670 | Røa | Røa | 0.0208637999444873 |
| 3143604 | Selsbakk | Olsbakk | 0.754767378874433 |
| 3143766 | Ã\u0098kern | Økern | 0.00891066052687025 |
| 3143983 | Nyhamn | Nyhavn | 0.00848874665735069 |
| 3148504 | Lambertsæter | Lambertsæter | 0.00800685943951798 |
| 3148504 | Lambertseter | Lambertsæter | 0.0222990570827133 |
| 3149522 | Kolstad | Konstad | 0.271314014605761 |
| 3151570 | HøybrÃ¥ten | Høybråten | 0.0295516132284727 |
| 3153050 | Tellevik | Hellevik | 0.793919902476773 |
| 3159585 | Bygdøy | Bygdøy | 0.027651014208552 |
Here's the geonames ID coverage in Quattroshapes:
layer | number of IDs |
---|---|
admin1 | 3172 |
localities | 137039 |
neighborhoods | 49906 |
All other layers (admin0, admin2, and localadmin) all have 0. If we want deduplication, we should probably just use the address-deduplicator.
Ideally, this should happen in the address-deduplicator. However, since this is geoname_id specific - I could see it living in quattroshapes-pipeline. When you encounter a 1-1 correlation you could add skip
flag to the record - however if there is a minor distance between the two gn_ids
we could update the existing geonames record with an alternate name name.alt
in addition to the existing name.default
- this can be possible with https://github.com/pelias/dbclient/issues/9 being resolved.
We should not make assumptions about the datasets being imported. Some users might only import QS, in which case we'd need additional setup flags that the user has to understand and set. Relying on the deduper post-import is the cleanest solution and allows other modules to follow the Single Responsibility Principle.
Right, the intent was to provide importer flags for both geonames and Quattroshapes. It might be better to support it rather than not, and have advanced users (ie us, heh) take advantage of it. The number of ostensible duplicates is pretty significant.
Once the Quattroshapes gazetteer file is linked up with the polygons (so the qs_id) are filled in, this will solve 90%+ of this problem.
@nvkelso sounds good :+1:
Some Quattroshapes records appear to have GeoNames IDs, which might imply that there are duplicate records across the datasets. Investigate the correlation, and if it's 1:1, add a flag to skip over records with non-null IDs.