maschinenmensch / edifice

A database of the built environment in Chicago
5 stars 1 forks source link

implement CSV import function #12

Open derekeder opened 11 years ago

derekeder commented 11 years ago

Cory - what is the preferred command line argument for importing a csv file?

Additional considerations

maschinenmensch commented 11 years ago

Do you mean importing CSVs from within psql? If so, it's the \copy command, as in:

\copy

FROM '/path/to/csv' WITH CSV HEADER;

However, you have to have the table already created with the correct number of fields. This can be a major headache sometimes (the building permits dataset comes to mind, with its 100+ fields).

I have a rudimentary geocoding function that uses the street centerline file. The addresses need to be split up on street number (as an integer), direction, street name, street type. That function will geocode to the middle of the street, but Edifice is all about buildings and it would be nice to try and resolve addresses to a building instead (ditto for lat/lon). A little trickier, but I think it's doable.

On Wed, Jan 23, 2013 at 1:14 AM, Derek Eder notifications@github.comwrote:

Cory - what is the preferred command line argument for importing a csv file?

Additional considerations

  • automatically detecting lat/long columns
  • if no lat/long, automatically geocoding on address fields (this could get pretty hairy, but doable)

    — Reply to this email directly or view it on GitHubhttps://github.com/maschinenmensch/edifice/issues/12.

Cory Mollet 847-894-5144 cory.mollet@gmail.com

jpvelez commented 11 years ago

Check out csvkit (from @onyxfish.)

There's a python function / shell command that can import csv files to Postgres using SQLalchemy.

Works great.

Juan-Pablo Velez 312-218-5448

On Wednesday, January 23, 2013 at 9:30 AM, Cory Mollet wrote:

Do you mean importing CSVs from within psql? If so, it's the \copy command,
as in:

\copy

FROM '/path/to/csv' WITH CSV HEADER;

However, you have to have the table already created with the correct number
of fields. This can be a major headache sometimes (the building permits
dataset comes to mind, with its 100+ fields).

I have a rudimentary geocoding function that uses the street centerline
file. The addresses need to be split up on street number (as an integer),
direction, street name, street type. That function will geocode to the
middle of the street, but Edifice is all about buildings and it would be
nice to try and resolve addresses to a building instead (ditto for
lat/lon). A little trickier, but I think it's doable.

On Wed, Jan 23, 2013 at 1:14 AM, Derek Eder notifications@github.comwrote:

Cory - what is the preferred command line argument for importing a csv
file?

Additional considerations

  • automatically detecting lat/long columns
  • if no lat/long, automatically geocoding on address fields (this
    could get pretty hairy, but doable)


Reply to this email directly or view it on GitHubhttps://github.com/maschinenmensch/edifice/issues/12.

Cory Mollet
847-894-5144
cory.mollet@gmail.com

— Reply to this email directly or view it on GitHub (https://github.com/maschinenmensch/edifice/issues/12#issuecomment-12600381).

maschinenmensch commented 11 years ago

Probably easier to grab the data in JSON format - has the metadata about number of fields and data types. From there, we could import into Python and create the table.

maschinenmensch commented 11 years ago

Just kidding. The JSON API call apparently doesn't load all the rows. We'll still use the JSON to create tables with the proper number of fields and data types, but looks like we'll have to grab the CSVs regardless.

mccc commented 11 years ago

A couple things:

1) Are you sure JSON doesn't give you all the rows?

https://data.cityofchicago.org/api/views/ydr8-5enu/rows.json?accessType=DOWNLOAD

Seems to be returning a lot of data..

2) However, in the case of importing to Postgres, it seems to make more sense to import from csv. However, it is true that the JSON has some nice fields like "dataTypeName" which can inform the appropriate parameters for CREATE TABLE. However, aren't we only going to CREATE TABLE once, or are you really considering generating new SQL tables every time you repopulate the Edifice database? Do the dataset schemas change that often?

3) The Building Permits dataset is one (perhaps among others) which should (I think) obviously be renormalized on import (i.e. we should create a separate contractor table which has a many-to-many relationship with building permits). I don't see any reason to preserve their denormalized design internally. (For example, users may want to be able to view all the buildings on which a single contractor has been involved — this query would be very ugly given the current format, as you have to select on CONTRACTOR_1_NAME, CONTRACTOR_2_NAME ad infinitum).

Even for a Mongo backend, again it seems natural to have Contractors be a separate indexed Collection and to have each document in the Building Permits collection contain a list of Contractor IDs. (Cory, can you add your Mongo schema to the repository so we can keep that stuff in mind?)

mccc commented 11 years ago

Re: 3), I now see that in building_permits_pruned (edifice_initialization_script.sql) the contractors have been elided (hence the "_pruned", I guess). Anyway, I think having our own contractor table (maybe still slightly denormalized, with one row per contractor per permit so we don't have to worry about merging records) would be a good idea. I was showing that data set to an architect yesterday and all that info was super-interesting to her—she had no idea it was publically available.

tplagge commented 11 years ago

On Feb 2, 2013, at 11:05 AM, Michael Castelle wrote:

A couple things:

1) Are you sure JSON doesn't give you all the rows?

https://data.cityofchicago.org/api/views/ydr8-5enu/rows.json?accessType=DOWNLOAD

Seems to be returning a lot of data..

That's true. However, if you request

https://data.cityofchicago.org/api/views/ydr8-5enu.json

instead of

https://data.cityofchicago.org/api/views/ydr8-5enu/rows.json

then you get just the column information, including the field types.

My plan was to use the json header information dictionary to generate a CREATE TABLE command, and then fetch the csv file for importing the data into the table. I committed some untested code to create the tables; it's in the import_json function in setup_edifice.py. There's some kludginess in there to handle the fact that Socrata doesn't distinguish between ints and floats. I didn't get as far as the csv import, unfortunately, and there may be more data types deserving of their own case statement.

Tom

jpvelez commented 11 years ago

+1 to this idea. Go for it, Michael.

On Saturday, February 2, 2013 at 11:36 AM, Michael Castelle wrote:

Re: 3), I now see that in building_permits_pruned (edifice_initialization_script.sql) the contractors have been elided (hence the "_pruned", I guess). Anyway, I think having our own contractor table (maybe still slightly denormalized, with one row per contractor per permit so we don't have to worry about merging records) would be a good idea. I was showing that data set to an architect yesterday and all that info was super-interesting to her—she had no idea it was publically available.

— Reply to this email directly or view it on GitHub (https://github.com/maschinenmensch/edifice/issues/12#issuecomment-13034067).

derekeder commented 11 years ago

All of the documented datasets from the Google doc have been added to datasets.py as of 66a98cd. Right now, all flat files are given a type of 'csv' and the import skips them for now. Once this is implemented, we can wire up the process_data function

mccc commented 11 years ago

My plan was to use the json header information dictionary to generate a CREATE TABLE command, and then fetch the csv file for importing the data into the table. I committed some untested code to create the tables; it's in the import_json function in setup_edifice.py. There's some kludginess in there to handle the fact that Socrata doesn't distinguish between ints and floats. I didn't get as far as the csv import, unfortunately, and there may be more data types deserving of their own case statement.

I also notice that this json data, while cool, doesn't tell us whether a column should be NOT NULL or NULL. That seems like an important factor when I create a table, especially one as large as Building_Permits. I am sort of inclined to not do this weird "create SQL table based on json metadata" thing and maybe try to use SQLAlchemy as mentioned by Juan above and see how clever its csv import is.

R, for example, does an incredible job figuring out what .csv datatypes are what by actually looking at the data. (It helps that R doesn't distinguish between ints and floats, but, for example, if there are empty columns in a row (which looks like ",,,,," , it successfully marks them as NA, etc.). So maybe SQLAlchemy (which I'd like to start using anyway) does a decent enough job. Worst case scenario is we have to cast some columns manually or do some date parsing in places, but that always happens anyway..

tplagge commented 11 years ago

I've never used SQLAlchemy, but cruising through the documentation, it looks (a) worth learning, and (b) like it could make CSV importing a lot easier. I'm on board.

On Feb 23, 2013, at 8:23 AM, Michael Castelle wrote:

My plan was to use the json header information dictionary to generate a CREATE TABLE command, and then fetch the csv file for importing the data into the table. I committed some untested code to create the tables; it's in the import_json function in setup_edifice.py. There's some kludginess in there to handle the fact that Socrata doesn't distinguish between ints and floats. I didn't get as far as the csv import, unfortunately, and there may be more data types deserving of their own case statement.

I also notice that this json data, while cool, doesn't tell us whether a column should be NOT NULL or NULL. That seems like an important factor when I create a table, especially one as large as Building_Permits. I am sort of inclined to not do this weird "create SQL table based on json metadata" thing and maybe try to use SQLAlchemy as mentioned by Juan above and see how clever its csv import is.

R, for example, does an incredible job figuring out what .csv datatypes are what by actually looking at the data. (It helps that R doesn't distinguish between ints and floats, but, for example, if there are empty columns in a row (which looks like ",,,,," , it successfully marks them as NA, etc.). So maybe SQLAlchemy (which I'd like to start using anyway) does a decent enough job. Worst case scenario is we have to cast some columns manually or do some date parsing in places, but that always happens anyway..

— Reply to this email directly or view it on GitHub.

jpvelez commented 11 years ago

More of an analysis tool, but pandas is really good for dealing with missing values http://pandas.pydata.org/pandas-docs/dev/missing_data.html.

  • Juan-Pablo Velez 312-218-5448 jpvelez@gmail.com

On Sat, Feb 23, 2013 at 8:40 AM, tplagge notifications@github.com wrote:

I've never used SQLAlchemy, but cruising through the documentation, it looks (a) worth learning, and (b) like it could make CSV importing a lot easier. I'm on board.

On Feb 23, 2013, at 8:23 AM, Michael Castelle wrote:

My plan was to use the json header information dictionary to generate a CREATE TABLE command, and then fetch the csv file for importing the data into the table. I committed some untested code to create the tables; it's in the import_json function in setup_edifice.py. There's some kludginess in there to handle the fact that Socrata doesn't distinguish between ints and floats. I didn't get as far as the csv import, unfortunately, and there may be more data types deserving of their own case statement.

I also notice that this json data, while cool, doesn't tell us whether a column should be NOT NULL or NULL. That seems like an important factor when I create a table, especially one as large as Building_Permits. I am sort of inclined to not do this weird "create SQL table based on json metadata" thing and maybe try to use SQLAlchemy as mentioned by Juan above and see how clever its csv import is.

R, for example, does an incredible job figuring out what .csv datatypes are what by actually looking at the data. (It helps that R doesn't distinguish between ints and floats, but, for example, if there are empty columns in a row (which looks like ",,,,," , it successfully marks them as NA, etc.). So maybe SQLAlchemy (which I'd like to start using anyway) does a decent enough job. Worst case scenario is we have to cast some columns manually or do some date parsing in places, but that always happens anyway..

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/maschinenmensch/edifice/issues/12#issuecomment-13990992.

mccc commented 11 years ago

Yes, I think pandas.read_csv() is the closest thing to the R data.frame csv parsing I described above. It still may be worth trying to pull metadata hints from the json but that's more of a necessity if the data structure (i.e. # of columns or datatypes on columns) changes frequently over time. But for this project to work at all, however, we will need to rely, to some degree, on relative stasis in portal dataset structure.

The advantage of SQL Alchemy likely has less to do with csv parsing (it's not a part of their 1000-page documentation AFAICT) — instead the idea is that people who fork our code to create edifice-like DBs for their own purposes (or even their own city!) can modify our database design at a higher level (e.g. objects in Python as opposed to tweaking low-level SQL code). To me SQL Alchemy seems painfully heavyweight, but I'm considering getting over that in the event that it's as useful as some people claim.