pelias / whosonfirst

Importer for Who's on First gazetteer
MIT License
27 stars 42 forks source link

Import Who's on First venues #94

Closed orangejulius closed 4 years ago

orangejulius commented 8 years ago

Who's on First now includes many venues. The data is split across several hundred repos in the whosonfirst-data Github organization, so a big challenge will simply be gathering all the data. Several of the repositories use git-lfs as well.

On the importer side, we are currently able to squeeze all the WOF administrative area records into memory, which obviously won't work with millions of venues.

has to be done to allow for dev work

has to be done before production readiness

can be done as follow up improvements

trescube commented 8 years ago

I was poking around in the venue data recently and noticed that there are some Manhattan records with multiple hierarchies that are also placed in New Jersey.

orangejulius commented 8 years ago

Are there enough that reporting them and fixing them manually(-ish) would be difficult?

trescube commented 8 years ago

I found 4090 just in that area but am working on a script to check elsewhere.

orangejulius commented 7 years ago

Taking a look at the acceptance tests, there are 5 different issues happening. You can compare against dev2 as of this writing (October 13, 2016) to see the difference.

Daly City

I believe this is a variant on the issue where we almost never return admin areas for autocomplete queries with a focus. There were already venues being returned ahead of daly city, now there are just more.

4th and King

There's a new entry for the 4th and king transit station in SF. This one is probably ok.

Newfoundland and Labrador

screenshot from 2016-10-13 19-21-26

The scores for the venues that start with "Newfoundland and Labrador" are actually identical to the region. Perhaps we should apply a small boost to all admin areas? Even a 1.1x boost here would be enough. I'll investigate later

Maui, Hawaii

screenshot from 2016-10-13 19-31-27

This actually has nothing to do with the duplicate Maui, it appears that it's simply because "Maui Maui" is shorter than "Maui County", and so the relevance score is higher. Other "Maui XXXX" results show up with a tied score. Here the score is significantly higher for "Maui Maui", so I don't think we can boost our way out of it. One solution might be to add "Maui" as an alt name for the county, but this would mean we can't fix it until next quarter.

New South Wales

We already return the Geonames record for New South Wales first, but it has the name "State of New South Wales". It's boosted by the population, but the WOF record (name: "New South Wales") has no population info). I think this one is ok, and additionally we can and should add the population data to WOF.

Summary

Other than Maui Maui, most of these are easily fixable.

orangejulius commented 6 years ago

I have received word from the WOF team that WOF Venues are pretty low priority for them, as there's lots of other work to be done. At this time enabling venue imports should still be as easy as toggling a config flag (importVenues in pelias.json). We welcome reports of how well this works out for people, but don't intend to support it as a production-ready configuration any time soon.

orangejulius commented 4 years ago

After some recent discussion it sounds like we have no plans to continue supporting WOF venue downloads going forward. The new data hosting for Who's on First sponsored by Geocode Earth is not going to publish them, and we expect to remove support for this functionality in this importer.