petewarden / geodict

A simple Python library/tool for pulling location information from unstructured text
http://petewarden.typepad.com/
184 stars 52 forks source link

Trouble with the cities table #2

Open wilkens opened 13 years ago

wilkens commented 13 years ago

I'm seeing something strange in the cities table; it looks as though a lot of cities that are in the source data are missing from the populated geodict database, possibly getting clobbered on import.

Take Brooklyn, for example. In worldcitiespop.csv, grep finds 49 entries for 'brooklyn' (42 of which are in the US); in the geodict database, there are five entries for 'brooklyn', only one of which is in the US (and the US entry is in Alabama). The same seems to be true of other US cities like Rochester and Boston, each of which is found only once in the US (and in an alphabetically early state like AL or CA). Are the others getting clobbered on import? Or am I maybe making a mistake in looking through the database (not much experience with MySQL here).

The SQL query I'm using is:

SELECT city, country, region_code, population, lat, lon FROM cities WHERE city = 'Brooklyn';

Other things that might be relevant:

  1. The populate_database.py script produces two errors when I run it: ./populate_database.py:49: Warning: Data truncated for column 'last_word' at row 1 (city, country, region_code, population, lat, lon, last_word))

    ./populate_database.py:49: Warning: Data truncated for column 'city' at row 1 (city, country, region_code, population, lat, lon, last_word))

  2. populate_database.py won't work at all unless I first create the geodict database by hand, even though it looks as though the script is meant to handle that.
  3. System info:

    uname -a

    Darwin wilkens-imac.wustl.edu 10.7.0 Darwin Kernel Version 10.7.0: Sat Jan 29 15:17:16 PST 2011; root:xnu-1504.9.37~1/RELEASE_I386 i386

    mysql --version

    mysql Ver 14.14 Distrib 5.1.56, for apple-darwin10.3.0 (i386) using readline 5.1

Any other info I can provide? Happy to do any kind of debugging that might help. Thanks!

petewarden commented 13 years ago

Hi Matthew, first off, wonderful blog, that's such a great project to use Geodict on. It does look like you've uncovered a nasty bug, sorry you ran into that. I've actually folded this project into the larger Data Sciences Toolkit repository, so I hope you don't mind but I've cloned this bug over there: https://github.com/petewarden/dstk/issues#issue/7

I'll get a fix in there for the next release, hopefully you can switch to using DSTK? I'll let you know as I make progress.

cheers, Pete

wilkens commented 13 years ago

Hi Pete,

Thanks so much for looking into this. Geodict has a been a great tool, but/and I'll gladly move to DSTK. Let me know if there's anything else I can do to help squash this one.

Yours, Matt

On Mar 31, 2011, at 6:11 PM, petewarden wrote:

Hi Matthew, first off, wonderful blog, that's such a great project to use Geodict on. It does look like you've uncovered a nasty bug, sorry you ran into that. I've actually folded this project into the larger Data Sciences Toolkit repository, so I hope you don't mind but I've cloned this bug over there: https://github.com/petewarden/dstk/issues#issue/7

I'll get a fix in there for the next release, hopefully you can switch to using DSTK? I'll let you know as I make progress.

cheers, Pete

Reply to this email directly or view it on GitHub: https://github.com/petewarden/geodict/issues/2#comment_943490