petewarden / dstk

A collection of the best open data sets and open-source tools for data science
http://www.datasciencetoolkit.org/
1.12k stars 186 forks source link

Locating U.S. cities #11

Open wilkens opened 13 years ago

wilkens commented 13 years ago

Geodict has trouble with some U.S. cities (and others, too, I imagine). Specifically, it doesn't seem to consider region names, so it doesn't flag things like "Brooklyn, NY" or "Chicago, IL" as named locations. It does recognize "Brooklyn, United States," but then there's the problem that it doesn't know which state is the one in question (here it defaults to Brooklyn, AL). And of course no one ever writes "Brooklyn, United States."

Looking at the code in geodict_lib.rb, it seems this shouldn't be the case, that regions/states should be matched. But it doesn't work that way when I invoke it from the web-based interface and I'm not a good enough programmer to see what the issue might be.

Note that this isn't related to the database population issue is bug #7 (https://github.com/petewarden/dstk/issues/7), which I've fixed on my machine (with your patch to the populate_database script).

Also (forgive me if this should be a separate bug), the speed-optimized matching from the end of the string forward seems to produce problems with some multi-word city names. For example, "San Francisco, United States" is matched as "Francisco, United States." I imagine taking regions into account would help, but it wouldn't solve the problem. To wit, "New York, NY" and "York, NY."

petewarden commented 13 years ago

Thanks for spotting those, I obviously don't have enough coverage in my unit tests. I'll dig into what caused that to break, and make sure I have a fix in the upcoming 0.40 version.

On Thu, Jun 2, 2011 at 6:23 PM, wilkens < reply@reply.github.com>wrote:

Geodict has trouble with some U.S. cities (and others, too, I imagine). Specifically, it doesn't seem to consider region names, so it doesn't flag things like "Brooklyn, NY" or "Chicago, IL" as named locations. It does recognize "Brooklyn, United States," but then there's the problem that it doesn't know which state is the one in question (here it defaults to Brooklyn, AL). And of course no one ever writes "Brooklyn, United States."

Looking at the code in geodict_lib.rb, it seems this shouldn't be the case, that regions/states should be matched. But it doesn't work that way when I invoke it from the web-based interface and I'm not a good enough programmer to see what the issue might be.

Note that this isn't related to the database population issue is bug #7 ( https://github.com/petewarden/dstk/issues/7), which I've fixed on my machine (with your patch to the populate_database script).

Also (forgive me if this should be a separate bug), the speed-optimized matching from the end of the string forward seems to produce problems with some multi-word city names. For example, "San Francisco, United States" is matched as "Francisco, United States." I imagine taking regions into account would help, but it wouldn't solve the problem. To wit, "New York, NY" and "York, NY."

Reply to this email directly or view it on GitHub: https://github.com/petewarden/dstk/issues/11

petewarden commented 13 years ago

I've just checked in some changes that should hopefully fix this. There was both a case and whitespace issue with the state identifiers. We needed something like 'CA' but were passing in 'ca ' and so ended up missing out on places.

wilkens commented 13 years ago

Thanks, Pete. I don't have a system up to pull the code for testing, but will keep an eye on the next release.