pelias / geonames

Import pipeline for geonames in to Pelias
https://pelias.io
MIT License
43 stars 37 forks source link

Quality of Postal Codes #358

Open bartek5186 opened 5 years ago

bartek5186 commented 5 years ago

Why Pelias don't use better quality postal codes from GEONAMES ??? http://www.geonames.org/export/zip/

WoF has weak source of Postal Codes database. https://github.com/whosonfirst-data/whosonfirst-data/issues/1584

missinglink commented 5 years ago

Hi @bartek5186, can you confirm this is the case globally or is it only better in Poland?

bartek5186 commented 5 years ago

I have worked on PL, CZ, DE... I don't notice problem in other countries yet.

There are also a postals with bad postalcode, and bad position for example:

EDIT: You can obtain lat and lng of parent location of specific zip code via parser/findbyid?ids=101841989&lang=pol

image

Isabel-pena commented 3 years ago

I have found some entries in postal-codes database from wof that have incomplete data and it produces inconsistency on searchs. This is an example from ES:

{"id":554829649,"type":"Feature","properties":{"edtf:cessation":"uuuu","edtf:inception":"uuuu","geom:area":0,"geom:bbox":"0.0,0.0,0.0,0.0","geom:latitude":0,"geom:longitude":0,"gp:parent_id":"12602116","iso:country":"ES","mz:hierarchy_label":1,"src:geom":"geoplanet","wof:belongsto":[],"wof:breaches":[],"wof:concordances":{"gp:id":"22664266"},"wof:country":"ES","wof:geomhash":"fc4d4085e55d16b479f231dbf54d3cfb","wof:hierarchy":[],"wof:id":554829649,"wof:lastmodified":1474569770,"wof:name":"09151","wof:parent_id":-1,"wof:placetype":"postalcode","wof:repo":"whosonfirst-data-postalcode-es","wof:superseded_by":[],"wof:supersedes":[],"wof:tags":[]},"bbox":[0,0,0,0],"geometry":{"coordinates":[0,0],"type":"Point"}}

It is even difficult when you manage to search a postalcode that is the same in other country. Then you get the info about the other country and not from Spain.

missinglink commented 3 years ago

The WOF dataset contains a lot of those 0,0 postcodes, I believe the WOF team leave them as placeholders for when the correct coordinates become available.

Pelias should not import null island places, so those 0,0 records you pasted will not enter the search index, if you see results with a location of 0,0 in the index then it's a bug.

missinglink commented 3 years ago

I had a quick look at this today and opened up https://github.com/whosonfirst-data/whosonfirst-data-postalcode-pl/issues/1 to discuss with the WOF team.

@bartek5186 I pulled down http://www.geonames.org/export/zip/PL.zip to have a look and I'm not sure the data is very good quality? The coordinates appear to be duplicated and rounded to two decimal points of precision in many cases.

Could you please confirm that the data is actually correct for Poland before we continue?

missinglink commented 3 years ago
head PL.txt
PL  00-001  Warszawa    Mazowieckie     Warszawa            52.25   21  4
PL  00-002  Warszawa    Mazowieckie     Warszawa            52.25   21  4
PL  00-003  Warszawa    Mazowieckie     Warszawa            52.25   21  4
PL  00-004  Warszawa    Mazowieckie     Warszawa            52.25   21  4
PL  00-005  Warszawa    Mazowieckie     Warszawa            52.25   21  4
PL  00-006  Warszawa    Mazowieckie     Warszawa            52.25   21  4
PL  00-007  Warszawa    Mazowieckie     Warszawa            52.25   21  4
PL  00-008  Warszawa    Mazowieckie     Warszawa            52.25   21  4
PL  00-009  Warszawa    Mazowieckie     Warszawa            52.25   21  4
PL  00-010  Warszawa    Mazowieckie     Warszawa            52.25   21  4
head -n1000 PL.txt | tail
PL  01-193  Warszawa    Mazowieckie     Warszawa            52.25   21  4
PL  01-194  Warszawa    Mazowieckie     Warszawa            52.25   21  4
PL  01-195  Warszawa    Mazowieckie     Warszawa            52.25   21  4
PL  01-196  Warszawa    Mazowieckie     Warszawa            52.25   21  4
PL  01-197  Warszawa    Mazowieckie     Warszawa            52.25   21  4
PL  01-198  Warszawa    Mazowieckie     Warszawa            52.25   21  4
PL  01-199  Warszawa    Mazowieckie     Warszawa            52.25   21  4
PL  01-201  Warszawa    Mazowieckie     Warszawa            52.25   21  4
PL  01-202  Warszawa    Mazowieckie     Warszawa            52.25   21  4
PL  01-203  Warszawa    Mazowieckie     Warszawa            52.25   21  4
head -n5000 PL.txt | tail
PL  10-537  Olsztyn Warmińsko-Mazurskie     Olsztyn             53.7833 20.4833 4
PL  10-538  Olsztyn Warmińsko-Mazurskie     Olsztyn             53.7833 20.4833 4
PL  10-539  Olsztyn Warmińsko-Mazurskie     Olsztyn             53.7833 20.4833 4
PL  10-540  Olsztyn Warmińsko-Mazurskie     Olsztyn             53.7833 20.4833 4
PL  10-541  Olsztyn Warmińsko-Mazurskie     Olsztyn             53.7833 20.4833 4
PL  10-542  Olsztyn Warmińsko-Mazurskie     Olsztyn             53.7833 20.4833 4
PL  10-543  Olsztyn Warmińsko-Mazurskie     Olsztyn             53.7833 20.4833 4
PL  10-544  Olsztyn Warmińsko-Mazurskie     Olsztyn             53.7833 20.4833 4
PL  10-545  Olsztyn Warmińsko-Mazurskie     Olsztyn             53.7833 20.4833 4
PL  10-546  Olsztyn Warmińsko-Mazurskie     Olsztyn             53.7833 20.4833 4
head -n10000 PL.txt | tail
PL  40-094  Katowice    Śląskie     Katowice                50.2667 19.0167 4
PL  40-095  Katowice    Śląskie     Katowice                50.2667 19.0167 4
PL  40-096  Katowice    Śląskie     Katowice                50.2667 19.0167 4
PL  40-097  Katowice    Śląskie     Katowice                50.2667 19.0167 4
PL  40-098  Katowice    Śląskie     Katowice                50.2667 19.0167 4
PL  40-100  Katowice    Śląskie     Katowice                50.2667 19.0167 4
PL  40-101  Katowice    Śląskie     Katowice                50.2667 19.0167 4
PL  40-102  Katowice    Śląskie     Katowice                50.2667 19.0167 4
PL  40-103  Katowice    Śląskie     Katowice                50.2667 19.0167 4
PL  40-104  Katowice    Śląskie     Katowice                50.2667 19.0167 4
Isabel-pena commented 3 years ago

You are right with 0,0 coordiantes, because at the init steps I didn't find this postalcodes but I have update the geometry info quering from geonames.

I really have not many problems with coordinates. I am working with ES postalcodes, not PL. For now I update the coordinates in wof postalcodes-es with the coordinates in Geonames (I really need to find this postalcodes). The worst thing in this data is that too many postalcodes doesn't have the hierarchy in geojson, this field appears empty, and have the same issue with belongsto I updated this data manually, searching in admin-es the hierarchy in the cases the postalcode have a parent_id, again I can complete it with the help of geojson.

Also, me and my team have problems with postalcodes that doesn't exists in whosonfirst but are registered and exists in Spain, some of them are in geonames. Now I have build my index with the wof-spain data updated by myself and geonames. The postalcodes that have now fixed the hierarchy appears in searchs, with the locality, localadmin, region... corrected, with the original data from wof this doesn't happen, The bad thing is that I can't find the postalcodes from geonames that doesn't exist in wof, and we need it for our work.

Is any way in which we can update it and also fix the hierarchy of the postalcodes I have to update manually?

bartek5186 commented 3 years ago

Could you please confirm that the data is actually correct for Poland before we continue? I'm not sure the data is very good quality?

In Poland, some of bigger cities have multiple postal codes (based on for example streets, zones, offices or districts). So this dataset have poor/low quality without any detailed LatLon position.

For Example in Poland, There are postal codes conneted with for example streets - so there are possibility to make high quality database.

Poland PNA (postal codes) dataset are there: https://www.poczta-polska.pl/hermes/uploads/2013/11/spispna.pdf There are no LatLng position, but... there are address name for example: image Located for example there: 52°14'00.5"N 20°58'37.9"E 52.233480, 20.977189

Not in: 52.21, 21

This simple LatLng looks like high level container for bigger city like "Warszawa"

InteNs commented 2 years ago

for NL country geonames is also way better, wof data is 4 years out of date and incomplete

geonames is updated daily from official government sources unfortunately it can't be imported into pelias

missinglink commented 2 years ago

Which is the official source that geonames uses? You might be better off just using the csv-importer to import those files directly.

We've found the Geonames postcodes files to be mixed bag, generally not very good, NL might be an exception.

InteNs commented 2 years ago

For the dutch data it uses https://www.cbs.nl (Statistics Netherlands) and www.kadaster.nl (The Netherlands’ Cadastre, Land Registry and Mapping Agency) which are both officially related (fuly or partially) to our government.

we succesfully used the csv-importer for that dataset, thanks for the heads-up, I didn't know there was a csv-importer :)

bartek5186 commented 2 years ago

I also take cvs-importer to this action, and this works great. Build-in postalcodes in this case (Europe) are useless. I have imported all custom prepared Europe region via csv-importer. The data of postal codes was prepared from official sources, and manually revisioned. I noticed little bug in importer. Imported data are named csv:postalcode, but should be named bdp:postalcode (because i set layer name source to "bdp" in importer config file, that was ignored during csv import and name in the output is csv). Because I need autocomplete to work with postalcodes too. I was put into name_iso multiple codes.
the import file looks like that:

source,popularity,layer,id,lat,lon,name,postalcode,country,name_jso
bdp,100,postalcode,71ff447b-972b-4f7d-a8c1-e0c8c02a1a19,53.468958363988,18.760770296251,Grudziądz,86-300,PL,"[""86-300"", "" 86-301"", "" 86-302"", "" 86-303"", "" 86-304"", "" 86-305"", "" 86-306"", "" 86-307"", "" 86-308"", "" 86-309"", "" 86-310"", "" 86-311""]"

Searching work great with multiple codes in name_iso and in output i have postalocode from column postalcode.

Output: image

missinglink commented 2 years ago

Hi @bartek5186 I had a quick look at the issue you reported and I wasn't able to reproduce the error where the source you provide is not the same as the source of the document.

We actually have a testcase here which ensures that functionality works as expected.

If you're able to reproduce this could you please open a ticket.

bartek5186 commented 2 years ago

Hi @bartek5186 I had a quick look at the issue you reported and I wasn't able to reproduce the error where the source you provide is not the same as the source of the document.

We actually have a testcase here which ensures that functionality works as expected.

If you're able to reproduce this could you please open a ticket.

I have already done that before.. https://github.com/pelias/csv-importer/issues/89