svetlozarn / plenario-etl

ETL code for Plenario
plenar.io
1 stars 0 forks source link

Strip extra whitespace #3

Open evz opened 10 years ago

evz commented 10 years ago

@amandalund I noticed that on the meta tables there was a ton of extra whitespace within most of the fields. It seems like it might make sense to get rid of this before the data from those csv files gets inserted into the database, if for no other reason that to make it so we don't have to do it later on.

One nifty thing that could be leveraged here is that the psql command actually accepts input on stdin which means that loading a csv file can be as simple as doing something like cat name_of_csv_file.csv | psql -c "COPY tablename FROM STDIN WITH DELIMITER ',' CSV HEADER;". This, on it's own isn't that exciting. But if you chain that together with some other things to do some basic processing, you start to create some interesting possibilities. So, for instance, if you wanted to strip all the extra whitespace from the individual cells in an incoming csv file, you could write a little python processor like this (that reads from stdin, does some stuff and writes the result to stdout) and chain it together with the above command like so: cat name_of_csv_file.csv | python strip_whitespace.py | psql -c "COPY tablename FROM STDIN WITH DELIMITER ',' CSV HEADER;".

Anyways, I don't know if this works with how this is currently getting put together but it seems like we might want to allow for some super basic pre-processing, generally speaking, when ingesting data and this method is a pretty lightweight way of accomplishing that.

amandalund commented 10 years ago

I do notice some extra whitespace (mostly an extra newline here and there), but I think what looks like tons of whitespace may just be the way psql displays the queries when the tables have some very wide columns. When I change the output format to something like unaligned and expanded display (\a \x), it's a little easier to see the fields. I hadn't been doing a strip when I read in the data though, definitely a good idea to add it in.

svetlozarn commented 10 years ago

We should definitely incorporate pre-processing in our load and update processes. My suggestion is to adopt the "make" method that @evz and @derekeder suggested earlier. There at least are two reasons that the "make" method would be more efficient than mixing SQL and shell scripting:

  1. Each SQL init scripts can be kept as a single file.
  2. The pre-processing (make step) can be performed separately from the loading to Postgres and thus can used other computational resources (e.g. batch processing queue, hadoop cluster, etc.)