Do you mean importing CSVs from within psql? If so, it's the \copy command,
as in:
\copy
FROM '/path/to/csv' WITH CSV HEADER;
However, you have to have the table already created with the correct number
of fields. This can be a major headache sometimes (the building permits
dataset comes to mind, with its 100+ fields).
I have a rudimentary geocoding function that uses the street centerline
file. The addresses need to be split up on street number (as an integer),
direction, street name, street type. That function will geocode to the
middle of the street, but Edifice is all about buildings and it would be
nice to try and resolve addresses to a building instead (ditto for
lat/lon). A little trickier, but I think it's doable.
On Wed, Jan 23, 2013 at 1:14 AM, Derek Eder notifications@github.comwrote:
Cory - what is the preferred command line argument for importing a csv
file?
Additional considerations
automatically detecting lat/long columns
if no lat/long, automatically geocoding on address fields (this
could get pretty hairy, but doable)
—
Reply to this email directly or view it on GitHubhttps://github.com/maschinenmensch/edifice/issues/12.
There's a python function / shell command that can import csv files to Postgres using SQLalchemy.
Works great.
Juan-Pablo Velez
312-218-5448
On Wednesday, January 23, 2013 at 9:30 AM, Cory Mollet wrote:
Do you mean importing CSVs from within psql? If so, it's the \copy command,
as in:
\copy
FROM '/path/to/csv' WITH CSV HEADER;
However, you have to have the table already created with the correct number
of fields. This can be a major headache sometimes (the building permits
dataset comes to mind, with its 100+ fields).
I have a rudimentary geocoding function that uses the street centerline
file. The addresses need to be split up on street number (as an integer),
direction, street name, street type. That function will geocode to the
middle of the street, but Edifice is all about buildings and it would be
nice to try and resolve addresses to a building instead (ditto for
lat/lon). A little trickier, but I think it's doable.
On Wed, Jan 23, 2013 at 1:14 AM, Derek Eder notifications@github.comwrote:
Cory - what is the preferred command line argument for importing a csv
file?
Additional considerations
automatically detecting lat/long columns
if no lat/long, automatically geocoding on address fields (this
could get pretty hairy, but doable)
—
Reply to this email directly or view it on GitHubhttps://github.com/maschinenmensch/edifice/issues/12.
Probably easier to grab the data in JSON format - has the metadata about number of fields and data types. From there, we could import into Python and create the table.
Just kidding. The JSON API call apparently doesn't load all the rows. We'll still use the JSON to create tables with the proper number of fields and data types, but looks like we'll have to grab the CSVs regardless.
2) However, in the case of importing to Postgres, it seems to make more sense to import from csv. However, it is true that the JSON has some nice fields like "dataTypeName" which can inform the appropriate parameters for CREATE TABLE. However, aren't we only going to CREATE TABLE once, or are you really considering generating new SQL tables every time you repopulate the Edifice database? Do the dataset schemas change that often?
3) The Building Permits dataset is one (perhaps among others) which should (I think) obviously be renormalized on import (i.e. we should create a separate contractor table which has a many-to-many relationship with building permits). I don't see any reason to preserve their denormalized design internally. (For example, users may want to be able to view all the buildings on which a single contractor has been involved — this query would be very ugly given the current format, as you have to select on CONTRACTOR_1_NAME, CONTRACTOR_2_NAME ad infinitum).
Even for a Mongo backend, again it seems natural to have Contractors be a separate indexed Collection and to have each document in the Building Permits collection contain a list of Contractor IDs. (Cory, can you add your Mongo schema to the repository so we can keep that stuff in mind?)
Re: 3), I now see that in building_permits_pruned (edifice_initialization_script.sql) the contractors have been elided (hence the "_pruned", I guess). Anyway, I think having our own contractor table (maybe still slightly denormalized, with one row per contractor per permit so we don't have to worry about merging records) would be a good idea. I was showing that data set to an architect yesterday and all that info was super-interesting to her—she had no idea it was publically available.
then you get just the column information, including the field types.
My plan was to use the json header information dictionary to generate a CREATE TABLE command, and then fetch the csv file for importing the data into the table. I committed some untested code to create the tables; it's in the import_json function in setup_edifice.py. There's some kludginess in there to handle the fact that Socrata doesn't distinguish between ints and floats. I didn't get as far as the csv import, unfortunately, and there may be more data types deserving of their own case statement.
On Saturday, February 2, 2013 at 11:36 AM, Michael Castelle wrote:
Re: 3), I now see that in building_permits_pruned (edifice_initialization_script.sql) the contractors have been elided (hence the "_pruned", I guess). Anyway, I think having our own contractor table (maybe still slightly denormalized, with one row per contractor per permit so we don't have to worry about merging records) would be a good idea. I was showing that data set to an architect yesterday and all that info was super-interesting to her—she had no idea it was publically available.
All of the documented datasets from the Google doc have been added to datasets.py as of 66a98cd. Right now, all flat files are given a type of 'csv' and the import skips them for now. Once this is implemented, we can wire up the process_data function
My plan was to use the json header information dictionary to generate a CREATE TABLE command, and then
fetch the csv file for importing the data into the table. I committed some untested code to create the tables; it's
in the import_json function in setup_edifice.py. There's some kludginess in there to handle the fact that
Socrata doesn't distinguish between ints and floats. I didn't get as far as the csv import, unfortunately, and
there may be more data types deserving of their own case statement.
I also notice that this json data, while cool, doesn't tell us whether a column should be NOT NULL or NULL. That seems like an important factor when I create a table, especially one as large as Building_Permits. I am sort of inclined to not do this weird "create SQL table based on json metadata" thing and maybe try to use SQLAlchemy as mentioned by Juan above and see how clever its csv import is.
R, for example, does an incredible job figuring out what .csv datatypes are what by actually looking at the data. (It helps that R doesn't distinguish between ints and floats, but, for example, if there are empty columns in a row (which looks like ",,,,," , it successfully marks them as NA, etc.). So maybe SQLAlchemy (which I'd like to start using anyway) does a decent enough job. Worst case scenario is we have to cast some columns manually or do some date parsing in places, but that always happens anyway..
I've never used SQLAlchemy, but cruising through the documentation, it looks (a) worth learning, and (b) like it could make CSV importing a lot easier. I'm on board.
On Feb 23, 2013, at 8:23 AM, Michael Castelle wrote:
My plan was to use the json header information dictionary to generate a CREATE TABLE command, and then
fetch the csv file for importing the data into the table. I committed some untested code to create the tables; it's
in the import_json function in setup_edifice.py. There's some kludginess in there to handle the fact that
Socrata doesn't distinguish between ints and floats. I didn't get as far as the csv import, unfortunately, and
there may be more data types deserving of their own case statement.
I also notice that this json data, while cool, doesn't tell us whether a column should be NOT NULL or NULL. That seems like an important factor when I create a table, especially one as large as Building_Permits. I am sort of inclined to not do this weird "create SQL table based on json metadata" thing and maybe try to use SQLAlchemy as mentioned by Juan above and see how clever its csv import is.
R, for example, does an incredible job figuring out what .csv datatypes are what by actually looking at the data. (It helps that R doesn't distinguish between ints and floats, but, for example, if there are empty columns in a row (which looks like ",,,,," , it successfully marks them as NA, etc.). So maybe SQLAlchemy (which I'd like to start using anyway) does a decent enough job. Worst case scenario is we have to cast some columns manually or do some date parsing in places, but that always happens anyway..
—
Reply to this email directly or view it on GitHub.
On Sat, Feb 23, 2013 at 8:40 AM, tplagge notifications@github.com wrote:
I've never used SQLAlchemy, but cruising through the documentation, it
looks (a) worth learning, and (b) like it could make CSV importing a lot
easier. I'm on board.
On Feb 23, 2013, at 8:23 AM, Michael Castelle wrote:
My plan was to use the json header information dictionary to generate a
CREATE TABLE command, and then
fetch the csv file for importing the data into the table. I committed
some untested code to create the tables; it's
in the import_json function in setup_edifice.py. There's some kludginess
in there to handle the fact that
Socrata doesn't distinguish between ints and floats. I didn't get as far
as the csv import, unfortunately, and
there may be more data types deserving of their own case statement.
I also notice that this json data, while cool, doesn't tell us whether a
column should be NOT NULL or NULL. That seems like an important factor when
I create a table, especially one as large as Building_Permits. I am sort of
inclined to not do this weird "create SQL table based on json metadata"
thing and maybe try to use SQLAlchemy as mentioned by Juan above and see
how clever its csv import is.
R, for example, does an incredible job figuring out what .csv datatypes
are what by actually looking at the data. (It helps that R doesn't
distinguish between ints and floats, but, for example, if there are empty
columns in a row (which looks like ",,,,," , it successfully marks them as
NA, etc.). So maybe SQLAlchemy (which I'd like to start using anyway) does
a decent enough job. Worst case scenario is we have to cast some columns
manually or do some date parsing in places, but that always happens
anyway..
—
Reply to this email directly or view it on GitHub.
—
Reply to this email directly or view it on GitHubhttps://github.com/maschinenmensch/edifice/issues/12#issuecomment-13990992.
Yes, I think pandas.read_csv() is the closest thing to the R data.frame csv parsing I described above. It still may be worth trying to pull metadata hints from the json but that's more of a necessity if the data structure (i.e. # of columns or datatypes on columns) changes frequently over time. But for this project to work at all, however, we will need to rely, to some degree, on relative stasis in portal dataset structure.
The advantage of SQL Alchemy likely has less to do with csv parsing (it's not a part of their 1000-page documentation AFAICT) — instead the idea is that people who fork our code to create edifice-like DBs for their own purposes (or even their own city!) can modify our database design at a higher level (e.g. objects in Python as opposed to tweaking low-level SQL code). To me SQL Alchemy seems painfully heavyweight, but I'm considering getting over that in the event that it's as useful as some people claim.
Cory - what is the preferred command line argument for importing a csv file?
Additional considerations