pelias / whosonfirst

Importer for Who's on First gazetteer
MIT License
28 stars 43 forks source link

Localities missing from GB import #508

Closed tomtaylor closed 4 years ago

tomtaylor commented 4 years ago

I've just set up a test Pelias installation locally using Docker. I'm using this pelias.json to load in the whole of the UK, and running the following commands:

pelias download all
pelias prepare all
pelias import wof

(I don't need street/address geocoding.)

Most of the places I'd expect to be present have loaded in fine, but I'm missing some places that should be present. For example: Nuneaton, Huddersfield, Sittingbourne. They all exist in my local WOF sqlite database, but aren't present in the ElasticSearch index. They work fine on the geocode.earth online tool.

Take Huddersfield. It's not in the ElasticSearch index:

> curl -I http://localhost:9200/pelias/_doc/whosonfirst:locality:101750573
HTTP/1.1 404 Not Found
content-type: application/json; charset=UTF-8
content-length: 87

While a sibling locality, Holme Valley, has loaded just fine:

> curl -I http://localhost:9200/pelias/_doc/whosonfirst:locality:1360754629
HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
content-length: 943

I've run the pelias import wof multiple times, with no errors produced. And I've tried to flush the index too.

Is there a way of debugging why they might not be getting loaded?

tomtaylor commented 4 years ago

OK, so I've inspected the geojson in the sqlite database more thoroughly and it looks like the places that haven't been imported have is_alt set true:

# Holme Valley, works fine
sqlite> SELECT id, source, is_alt FROM geojson WHERE id = 1360754629;
1360754629|quattroshapes|0
# Huddersfield, doesn't
sqlite> SELECT id, source, is_alt FROM geojson WHERE id = 101750573;
101750573|quattroshapes|1
101750573|quattroshapes_pg|1
# Nuneaton, doesn't
sqlite> SELECT id, source, is_alt FROM geojson WHERE id = 101750471;
101750471|quattroshapes_pg|1
101750471|whosonfirst|1
# Sittingbourne, doesn't
sqlite> SELECT id, source, is_alt FROM geojson WHERE id = 101853501;
101853501|quattroshapes|1
101853501|quattroshapes_pg|1
101853501|whosonfirst|1
# Hackney, works fine
sqlite> SELECT id, source, is_alt FROM geojson WHERE id = 1158857273;
1158857273|gbr-datalondon|0

It looks like this is expected behaviour with the whosonfirst importer.

I'm now thinking this might be an issue with how the SQLite distribution is generated... @missinglink it looks like you might be working on something related?

missinglink commented 4 years ago

Sounds a lot like the bug I fixed yesterday. https://github.com/pelias/wof/pull/13

Please try downloading the SQLite database again and checking the same IDs, you should find exactly one record with is_alt=0 per ID.

tomtaylor commented 4 years ago

Thanks @missinglink - I don't think whosonfirst-data-admin-gb-latest.db.bz2 has updated yet. I still get the same results with the new file. Is this still rolling out or did something go awry?

missinglink commented 4 years ago

Can you please post a shasum of the database file and paste a query that shows no is_alt=0, I'll have a look tomorrow.

missinglink commented 4 years ago

Agh damn I think you're right https://github.com/whosonfirst-data/whosonfirst-data-admin-gb/blob/master/data/101/853/501/101853501-alt-whosonfirst.geojson

I'll figure out a fix

tomtaylor commented 4 years ago

Sure thing, thanks for that.

> sqlite shasum -a 256 whosonfirst-data-admin-gb-latest.db.bz2 
044dc0e263647a487dc192740f7619ee1536c8cf3f8c927a1d7f09e862cb0c09  whosonfirst-data-admin-gb-latest.db.bz2
>  sqlite shasum -a 256 whosonfirst-data-admin-gb-latest.db
d6a43a27bc6fd6412400d3b679e5c1a417b58fd7fc59a9cf14c05531c00c992b  whosonfirst-data-admin-gb-latest.db
>  sqlite sqlite3 whosonfirst-data-admin-gb-latest.db
SQLite version 3.28.0 2019-04-15 14:49:49
Enter ".help" for usage hints.
sqlite> SELECT id, source, is_alt FROM geojson WHERE id = 101750573;
101750573|quattroshapes|1
101750573|quattroshapes_pg|1
sqlite> 
missinglink commented 4 years ago

Fix merged in https://github.com/pelias/wof/pull/16, data files are being regenerated by @pelias-bot

tomtaylor commented 4 years ago

Great, thank you!

missinglink commented 4 years ago
shasum -a 256 whosonfirst-data-admin-gb-latest.db
14d758d982e0d2661563ce761fc7d079df981a4eee1cf11d694fa28dbebf4e69  whosonfirst-data-admin-gb-latest.db
sqlite3 whosonfirst-data-admin-gb-latest.db 'SELECT id, source, alt_label, is_alt FROM geojson WHERE id = 101750573;'
101750573|quattroshapes||0
101750573|quattroshapes|quattroshapes|1
101750573|quattroshapes_pg|quattroshapes_pg|1

looks like it was fixed, the files are generated alphabetically and it's up to 'H' so they'll all get uploaded in the next couple of hours.

missinglink commented 4 years ago
curl 'https://data.geocode.earth/wof/dist/sqlite/whosonfirst-data-admin-gb-latest.db.bz2' | lbunzip2 | tee >(shasum -a 256) > whosonfirst-data-admin-gb-latest.db
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  367M  100  367M    0     0  13.4M      0  0:00:27  0:00:27 --:--:-- 15.8M

14d758d982e0d2661563ce761fc7d079df981a4eee1cf11d694fa28dbebf4e69  -
sqlite3 whosonfirst-data-admin-gb-latest.db 'SELECT id, source, alt_label, is_alt FROM geojson WHERE id = 101750573;'
101750573|quattroshapes||0
101750573|quattroshapes|quattroshapes|1
101750573|quattroshapes_pg|quattroshapes_pg|1
missinglink commented 4 years ago

Thanks for the bug report, the store.sqlite3.gz file we are hosting will also need regeneration so I'll kick that off now, it takes hours to complete.

If the problem is solved for you please close the github issue. FYI we just recently started an OpenCollective, we are hoping to use the funds to hire someone part time to keep the community assets/code up-to-date.

missinglink commented 4 years ago

This issue should now be resolved?

missinglink commented 4 years ago

Please reopen if you find it's not fixed.