petewarden / dstk

A collection of the best open data sets and open-source tools for data science
http://www.datasciencetoolkit.org/
1.13k stars 184 forks source link

Cities getting overwritten in geodict/text2places database #7

Open petewarden opened 13 years ago

petewarden commented 13 years ago

I'm seeing something strange in the cities table; it looks as though a lot of cities that are in the source data are missing from the populated geodict database, possibly getting clobbered on import.

Take Brooklyn, for example. In worldcitiespop.csv, grep finds 49 entries for 'brooklyn' (42 of which are in the US); in the geodict database, there are five entries for 'brooklyn', only one of which is in the US (and the US entry is in Alabama). The same seems to be true of other US cities like Rochester and Boston, each of which is found only once in the US (and in an alphabetically early state like AL or CA). Are the others getting clobbered on import? Or am I maybe making a mistake in looking through the database (not much experience with MySQL here).

The SQL query I'm using is:

SELECT city, country, region_code, population, lat, lon FROM cities WHERE city = 'Brooklyn'; Other things that might be relevant:

The populate_database.py script produces two errors when I run it: ./populate_database.py:49: Warning: Data truncated for column 'last_word' at row 1 (city, country, region_code, population, lat, lon, last_word))

./populate_database.py:49: Warning: Data truncated for column 'city' at row 1 (city, country, region_code, population, lat, lon, last_word))

populate_database.py won't work at all unless I first create the geodict database by hand, even though it looks as though the script is meant to handle that.

System info:

uname -a

Darwin wilkens-imac.wustl.edu 10.7.0 Darwin Kernel Version 10.7.0: Sat Jan 29 15:17:16 PST 2011; root:xnu-1504.9.37~1/RELEASE_I386 i386

mysql --version

mysql Ver 14.14 Distrib 5.1.56, for apple-darwin10.3.0 (i386) using readline 5.1

Any other info I can provide? Happy to do any kind of debugging that might help. Thanks!

wilkens commented 13 years ago

Good(ish) news. I tried the most recent DSTK VMware image (v35); the database of cities supplied with it is still borked, but a simple rerun of the included populate_database.rb script (which includes the change to the primary key made back in April) fixes it. Nice! Thanks.

petewarden commented 13 years ago

Thanks for trying that out, and apologies that the VMware image isn't working out of the box. I'll double-check the AMI as well, hopefully I ran the update there.

On Thu, Jun 2, 2011 at 5:48 PM, wilkens < reply@reply.github.com>wrote:

Good(ish) news. I tried the most recent DSTK VMware image (v35); the database of cities supplied with it is still borked, but a simple rerun of the included populate_database.rb script (which includes the change to the primary key made back in April) fixes it. Nice! Thanks.

Reply to this email directly or view it on GitHub: https://github.com/petewarden/dstk/issues/7#comment_1286061