catalog import of data - Githubissues

derekeder commented 11 years ago

List of datasets in the readme is a good start, but we need to be a little more explicit. Keeping track of datasets and metadata in a Google doc: https://docs.google.com/spreadsheet/ccc?key=0AtbqcVh3dkAqdGdlcWd5MzRYcGJkS1RoQTM3Qzd4dUE

jpvelez commented 11 years ago

Great stuff. What's the best way of storing table metadata and related domain knowledge? (Here's how the county assessor works, what we know about how this data is produced.) Github wiki? Github site?

derekeder commented 11 years ago

I think for now it makes sense to just add it to this spreadsheet and fiddle around with it in there. I've added a 'notes' column.

mccc commented 11 years ago

Just a nerdy historical note: the thing that Juan is describing used to be called a 'Data Dictionary' and was in the late 1970s considered just as important as the actual database. At the time it not only included a narrative description of the data's meaning but was also the only place to find out how many bytes wide your records were, or whether those bytes were to be interpreted as integers or strings.

Strangely (or not so strangely?), the term started to go out of style as relational databases took over, possibly because people confused the RDBMS metadata information (i.e. today's information_schema) with human-readable metadata:

http://books.google.com/ngrams/graph?content=data+dictionary&year_start=1960&year_end=2000&corpus=15&smoothing=3

I suspect in many organizations (even very large ones) this kind of table metadata / domain knowledge is best stored in Excel or Word. However, in the medium-sized organizations I have been exposed to, this domain knowledge was stored in the heads of 2 or 3 people who had to be interrogated at length (most of whom couldn't remember what the lesser-used tables contained or how they were generated).

mccc commented 11 years ago

That said.. there is a facility within postgres to insert textual metadata (probably not entire paragraphs, but maybe some quick one-sentence summaries) describing tables and columns. I would totally recommend doing this in the import script, at least at the table level.

COMMENT ON TABLE mytable  IS 'my comment';

mydatabase=> \d+
                                  List of relations
 Schema | Name             |   Type   |  Owner   |    Size    |  Description
--------+------------------+----------+----------+------------+---------------
 public | mytable          | table    | mcc      | 24 kB      | my comment
[...]

You can also do

COMMENT ON COLUMN mytable.mycolumn is 'my column comment';

And you'll see those when you \d+ a particular table.

If you need to for some reason, you can access the table comments manually like so:

mydatabase=> select pg_catalog.obj_description(c.oid) from pg_class c where relname='mytable';
 obj_description
-----------------
 my comment
(1 row)

I forget how to get the column comments out explicitly, but it's something similar.

mccc commented 11 years ago

Hey, I just added a bunch of code to more cleanly create the main database on any PostgreSQL 9.0+ system with PostGIS 2.0.x+ support. Basically, now you do:

setup_edifice.py --init

Which will do everything that you need the postgres superuser to do (may require password typing if you have one set), and:

setup_edifice.py --create

Which will actually create the edifice database structure. Then,

setup_edifice.py --data

Will download, unzip, and import the data. Still working on making that part do the right things, obviously.

Any constructive comments on the changes (and/or error messages) would be appreciated!

derekeder commented 11 years ago

initial cataloging complete

maschinenmensch / edifice

catalog import of data #10