pelias / whosonfirst

Importer for Who's on First gazetteer
MIT License
28 stars 42 forks source link

wof missing macroregions #414

Closed nexus1703 closed 5 years ago

nexus1703 commented 5 years ago

Hello, When running the command pelias download wof for the France project, it seems that only 7 macroregions are imported which leads to a lot of missing data after download. (postal codes, addresses etc) The message appears during extraction: extracted 20203 locality(s) extracted 5 timezone(s) extracted 2622 county(s) extracted 74 campus(s) extracted 7 macroregion(s) extracted 170 neighbourhood(s) extracted 220 localadmin(s) extracted 1 dependency(s) extracted 5 disputed(s) extracted 1 macrocounty(s) extracted 4148 postalcode(s) All done! Actually, when checking the imported csv file, only new MacroRegions are imported. Not the old ones. I think it's probably an issue with the importer script because data is present on whosinfirst website. I'm running docker on ubuntu 18.04 and I don't get any specific error. thanks for your help.

missinglink commented 5 years ago

Hi, I believe there was an issue with the last build published from the WOF team which resulted in some documents being missing from the distribution.

I will send them a message now to check the status, I suspect they reverted the distributions to the last known good version, which would mean that the data is slightly old (a week or two maximum)

missinglink commented 5 years ago

You can manually check the contents of the WOF distribution file by inspecting the sqlite file, you should be able to use a SQL query to search by ID.

missinglink commented 5 years ago

I've made a geocode.earth extract available for you to test against.

https://s3.amazonaws.com/geocodeearth-pelias-data/tmp/whosonfirst-data-latest.db.bz2

I'll leave this up for a couple weeks, we create our own sqlite distributions for geocode.earth and for some reason our scripts don't seem to be erroring, I can see there are more macroregion records in our distribution:

2018-11-18 06:59:40    3.0 GiB whosonfirst-data-latest.db.bz2
SELECT
s.placetype as pt,
count(*) as cnt
FROM ancestors a
JOIN spr AS s USING(id)
WHERE ancestor_id = 85633147
GROUP BY s.placetype
ORDER BY cnt DESC;
pt|cnt
localadmin|47662
locality|35914
neighbourhood|4082
county|3725
macrocounty|330
region|101
borough|25
macroregion|22
timezone|6
country|1
dependency|1
disputed|1
nexus1703 commented 5 years ago

Hello missinglink and thanks very much for your quick reply. I'm using docker so it's kinda hard to change the download location as it's hardcoded in the sql_download.js script in the image. So I tried to add the dataHost option to my pelias.json with your url but unfortunately it's not supported. The good thing would be to edit the image and make it point to your file on amazon rather to the wof repo but for now I've manually uploaded your file to my server and made it read only so that it doesn't get overwritten. Kind of a crappy solution but it seems to work okay and I get the 22 macroregions allright. Btw, the default pelias.json in the docker project for France has an erroneous country id. It should be 85633147. Thanks again for your help. Pierre

missinglink commented 5 years ago

Good to hear you got it working.

It looks like 136253037 refers to the empire of France, 85633147 refers to the country of France.

Here's a Wikipedia article which explains the distinction better than I can: https://en.m.wikipedia.org/wiki/Overseas_France