openaustralia / planningalerts

Find out and have your say about what's being built and knocked down in your area.
https://www.planningalerts.org.au
Other
95 stars 50 forks source link

Experiment with alternative geocoding #1292

Open mlandauer opened 5 years ago

mlandauer commented 5 years ago

We are currently very dependent on a grant from Google for geocoding, address autocomplete, maps and streetview for PlanningAlerts. If we had to pay for this ourselves we would be paying around $15,000 AUD / month.

Also by using the Google geocoder under the TOS we are bound to use Google maps for showing the results.

If we were to use a different geocoder instead with more liberal TOS we could display the results with different maps too - using one many mapping services out there based on OpenStreetmap or even hosting a map tile server ourselves.

It would be good to investigate/experiment with using http://mappify.io/ which is an Australian-only geocoder that uses GNAF under the hood. There are almost no restrictions on the use of the geocoded data. We could also investigate how tricky it would be to make our own geocoder using the GNAF data.

jamezpolley commented 5 years ago

1281 contains several examples of addresses which are in the GNAF and findable in other geocoders, but can't be found by Google.

jamezpolley commented 5 years ago

https://github.com/openaustralia/planningalerts/issues?q=is%3Aissue+is%3Aopen+label%3Ageocoding has several other issues with geocoding, including the new developments issue.

mlandauer commented 5 years ago

Another potential service to look at: https://www.addressify.com.au/

mlandauer commented 5 years ago

After a bit more thinking I have a potential plan of attack. Of course, any ideas and suggestions very very welcome. It's a pretty conservative approach based on the understanding that this is a hard problem to properly understand because we're geocoding all sorts of addresses in all sorts of formats and different geocoders have different problems. Also, we're only finding out about the problematic cases right now from users who are letting us know. So, it's not about making a user-visible change in production just yet but collecting more systematic data which will allow us to understand how widespread this problem and how much of it is caused by new land developments which google doesn't know about yet and how much is caused by the address parsing of the google geocoder just returning results in crazy places far away from where it should.

The idea is:

Then, we can switch stuff over in production

jamezpolley commented 5 years ago

Update the geocoder to send the address string to both google geocoder and the mappify geocoder. Return only the result of the google geocoder to the rest of the application

Can we make this more broad than just comparing google and mappify?

I'd be interested in "google as we use it now" vs "Google with postcode and suburb provided as extra hints" or "Google with a bounding box to narrow down the search area".

mlandauer commented 5 years ago

Yes, absolutely. The new database schema could support any number of simultaneous checks against different geocoders and schemes with geocoders. So, yes totally possible to do.

I figure though if we just start with the super simple case of two geocoders we'll get a pretty exhaustive list of addresses that are problematic and then we can do a more full comparison with different techniques without having to run every geocoder scheme against every address which would potentially triple our use of the Google geocoder.

LoveMyData commented 5 years ago

I am thinking Area Lookup is the easiest to validate as we already know which council/LGA we are collecting the data from, eg. google geocoded an address then check with GNAF's area lookup with lat/lng.

Not sure if PA is ignoring location_type = "GEOMETRIC_CENTER" && location_type = "APPROXIMATE", this help to increase the accuracy.

GNAF will not solve issue for new development. A re-query every 3 or 6 months will fix most of that after a few tries.

@jamezpolley I did try to use viewport bias and got some very mixed result, but it was over 1 year ago

mlandauer commented 5 years ago

The changes made to be able to log instances where the geocoding results from using Google and Mappify has now been running in production for a week or so. The results are visible at the url https://www.planningalerts.org.au/geocode_queries (Note this is a temporary url for this work and is probably only going to work for a relatively short while). It's collected around 250 address examples where the geocoders give different results.

The majority of "bad" results are within the same suburb, usually something relatively small like getting the house number wrong. However, in the cases where the differences are large this causes some significant confusion to users. Therefore, in the short term let's focus our attention on those cases.

I think there are a couple of relatively simple strategies we could employ to make the results from the google geocoder more accurate. They both fall under the category of making the geocoder more conservative and flagging an error if the results are not reliable rather than doing its best guess which is often very very wrong.

Originally I wasn't too keen on the idea of doing a pre-processing step as I didn't want to get into the territory of parsing free text addresses. That feels like something you could easily spend a very long time getting right. But then I found out about libpostal, which is a state-of-the-art parser which uses NLP and machine learning across a large corpus (1 billion) of addresses to build a model which works internationally. It's not just a pile of regular expressions.

So, let's experiment with adding libpostal as a pre-processing step and see what happens. I'll split that into a separate issue.

mlandauer commented 5 years ago

I just noticed something super useful that the google geocoding API returns (but isn't as far as I'm aware passed down by geokit) and that is if the result is a "partial" match, it will include a partial_match field in the results. I would take a guess that in the cases where the suburb is different in the result we'll find that it's a "partial" match. In which case we should just flag this as an error and move on. This approach has the advantage that it doesn't require libpostal.

https://developers.google.com/maps/documentation/geocoding/intro#Results

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because there has been no activity on it for a year. If you want to keep it open please make a comment and explain why this issue is still relevant. Otherwise it will be automatically closed in a week. Thank you!

mlandauer commented 3 years ago

There's definitely still improvements to be made to the geocoding so I think this one is worth leaving open

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because there has been no activity on it for about six months. If you want to keep it open please make a comment and explain why this issue is still relevant. Otherwise it will be automatically closed in a week. Thank you!

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because there has been no activity on it for about six months. If you want to keep it open please make a comment and explain why this issue is still relevant. Otherwise it will be automatically closed in a week. Thank you!

katska commented 1 year ago

🤷‍♀️ @mlandauer still worth leaving this open?

stale[bot] commented 7 months ago

This issue has been automatically marked as stale because there has been no activity on it for about six months. If you want to keep it open please make a comment and explain why this issue is still relevant. Otherwise it will be automatically closed in a week. Thank you!

JoannaHill commented 1 month ago

@mlandauer can this issue be closed?