petewarden / dstk

A collection of the best open data sets and open-source tools for data science
http://www.datasciencetoolkit.org/
1.12k stars 186 forks source link

Google-style geocoder has difficulty parsing less formal addresses #39

Closed zslayton closed 10 years ago

zslayton commented 10 years ago

I've set up an AWS instance running version 0.51 and am having issues getting it to recognize common versions of some addresses.

For example, the Village Voice building in New York City is located at

36 Cooper Square, New York City, New York 10003

However, the following versions of the address all result in a Lat/Lng located in Turkey.

36 Cooper Square, New York City
36 Cooper Square, nyc
36 Cooper Square, nyc, ny
36 Cooper Square, 10003

Example URL:

/maps/api/geocode/json?sensor=false&address=36%20cooper%20sq,%20nyc,%20ny

The resulting JSON looks like:

{
  "status": "OK",
  "results": [
    {
      "address_components": [
        {
          "long_name": "36",
          "types": [
            "administrative_area_level_1",
            "political"
          ],
          "short_name": "36"
        },
        {
          "long_name": "Turkey",
          "types": [
            "country",
            "political"
          ],
          "short_name": "tr"
        }
      ],
      "geometry": {
        "location_type": "APPROXIMATE",
        "viewport": {
          "southwest": {
            "lat": 38.9389,
            "lng": 34.3244
          },
          "northeast": {
            "lat": 40.9389,
            "lng": 36.3244
          }
        },
        "location": {
          "lat": 39.9389,
          "lng": 35.3244
        }
      },
      "types": [
        "administrative_area_level_1",
        "political"
      ]
    }
  ]
}
zslayton commented 10 years ago

Interestingly, running the same queries against www.datasciencetoolkit.org (Which says it's also running 0.51) yields much better results!

Querying

36 cooper sq, nyc

yields:

[{"geometry":{"viewport":{"northeast":{"lng":-73.949592590332,"lat":40.770532},"southwest":{"lng":-74.036094665527,"lat":40.699989318848}},"location":{"lng":-73.992602,"lat":40.742185},"location_type":"APPROXIMATE"},"address_components":[{"short_name":"New York","types":["locality","political"],"long_name":"New York, NY, US"},{"short_name":"US","types":["country","political"],"long_name":"USA"}],"types":["locality","political"]}]

Given that all I did was fire up the US East AMI (ami-9386d1fa), I'm not sure what might be different about my environment.

Thanks, by the way, for starting this project. It's been enormously helpful!

petewarden commented 10 years ago

Thanks for the kind words, and sorry you're hitting problems with those addresses. There are two issues:

1 - I'm guessing the TwoFishes process isn't running on your AMI. The main code falls back on TwoFishes for city-level results on addresses it otherwise can't recognise, but because it runs as a separate service sometimes it doesn't start when the instance boots up. I'm still trying to understand why, but as a temporary fix try logging into the server and running sudo service twofishes start. It will take about a minute before it's ready to start serving, but hopefully that start getting you the same results as the main site.

2 - The geocoding logic isn't smart enough to locate the address without the ZIP in a lot of cases unfortunately. The examples are very helpful, thanks, they'll be useful for improving the code. The bulk of the work for US addresses is actually done by this package - https://github.com/geocommons/geocoder . Most of the logic is in these files if you fancy taking a poke at it too: https://github.com/geocommons/geocoder/tree/master/lib/geocoder/us

Let me know if that helps!

zslayton commented 10 years ago

Thanks Pete!

I dug around a bit, and it looks like twofishes fails to start up due to the JVM not having enough heap space. I initially assumed that that was because I ran it on an m1.small, but upgrading my instance to an m1.large and rebooting resulted in the same error.

You can see the logs in /var/log/upstart/twofishes.log.

I'm going to toy with the JVM flags and see if I can't get it to boot successfully.

zslayton commented 10 years ago

Ha! Fixed.

I had to make a change to ~/sources/dstk/twofishesd.sh

I replaced

java -jar /home/ubuntu/sources/twofishes/bin/twofishes.jar --hfile_basepath /home/ubuntu/sources/twofishes/data/latest/

with

java -Xmx1500M -jar /home/ubuntu/sources/twofishes/bin/twofishes.jar --hfile_basepath /home/ubuntu/sources/twofishes/data/latest/

1500MB of heap might be a bit over-the-top, but it got it running again. There's probably a saner default you could use. I imagine whatever instance type you're using for the main DSTK site has >6GB of memory, so the JVM gets at least 1500MB of heap by default.

petewarden commented 10 years ago

Ah, that's great, thanks! If you want to give me a pull request with the change I'll update the main repo, or I can manually apply the patch myself next time I touch the code.

zslayton commented 10 years ago

Happy to! I may not get around to it until this weekend though.

It might also be helpful to put a friendly "Please don't try to use an m1.small" warning next to the EC2 how-to in the documentation page. Should I do that as well?

On Wed, Feb 19, 2014 at 3:44 PM, Pete Warden notifications@github.comwrote:

Ah, that's great, thanks! If you want to give me a pull request with the change I'll update the main repo, or I can manually apply the patch myself next time I touch the code.

Reply to this email directly or view it on GitHubhttps://github.com/petewarden/dstk/issues/39#issuecomment-35546097 .

petewarden commented 10 years ago

That's a good idea. I know some folks have been experimenting with smalls, but I hadn't tried it myself: https://groups.google.com/forum/#!searchin/dstk-users/small/dstk-users/th1wMQDbzg0/T_I9eDiy6qkJ

zslayton commented 10 years ago

I'm going to close this in light of the merged PR.