jbenet / transformer

transformer - multiformat data conversion
transform.datadex.io
131 stars 7 forks source link

simple use case #4

Open jbenet opened 10 years ago

jbenet commented 10 years ago

I'm looking for a simple use case -- or test data conversions, to help define the base transformer objects:

If people have concrete suggestions, ideally with some sample data, would be great to post some.

max-mapper commented 10 years ago

OpenAddresses could be really awesome here https://github.com/openaddresses/openaddresses

They have their own JSON definitions: https://github.com/openaddresses/openaddresses/blob/master/sources/us-co-jefferson.json

daguar commented 10 years ago

I'm going to be a bother, and give you less than totally simple use-case, because it's one that I've been struggling with in data transformation.

Aggregate, set-based transformations -- simple example: standard deviation. These are tricky if transformation is only by-row or by-cell.

(Feel free to ignore for now! But it's something that @maxogden's work to-date has got me thinking about in the context of Dat.)

jbenet commented 10 years ago

@daguar no, that's totally the sort of use case i want to support directly, thanks!

This can be expressed as conversion 'set-of-numbers-to-standard-deviation' that maps from 'set-of-numbers' to 'standard-deviation'.

So, very soon, you'll be able to do:

> echo '[1, 2, 3, 4, 5, 6]' | transform json-string array set-of-numbers standard-deviation
1.871...

array and set-of-numbers there because inferring array -> set-of-numbers is non-trivial in implementation. Later on, and with ever more difficulty :], we'll be able to do:

> echo '[1, 2, 3, 4, 5, 6]' | transform json-string set-of-numbers standard-deviation
1.871...
> echo '[1, 2, 3, 4, 5, 6]' | transform json-string standard-deviation
1.871...

Oh, and totally other inputs, like

> echo '1\n2\n3\n4\n5\n6' | transform space-delimited-tokens set-of-numbers standard-deviation
> echo '1,2,3,4,5,6' | transform comma-delimited-tokens set-of-numbers standard-deviation

And, for completeness, in js:

> var transformer = require('transformer');
> var stdev = transform('set-of-numbers', 'standard-deviation');
> stdev([1, 2, 3, 4, 5])
1.871...

:)

(transformer is what you get when you take a programming language and implement it on top of npm)

rossfeller commented 10 years ago

Some sample education data from New York State: http://data.nysed.gov/downloads.php

Would love to have an easy way to get this out of MS Access format and into a modern and usable format.

Beyond format conversion, looking specifically at the 2011-2012 dataset, a couple use cases for base transformations stand out:

cspanring commented 10 years ago

Bounding box coordinates (WGS84, lat/lon or lon/lat) for different applications, following different schemas:

[north, west, south, east] <=> [south, west, north, east] <=> [west, south, east, north]

e.g. TileStache <=> Leaflet <=> TileMill/Mapnik

jbenet commented 10 years ago

Via @maxogden

ogd: the python scripts here could be an interesting thing to express using transformer: https://github.com/mrobinson/violins (http://abandonedwig.info/collisions/)

Would be great to have a dat instance running with this data. Will make some Types + Conversions.

jbenet commented 10 years ago

Also, see: https://github.com/jbenet/transformer#more-examples

daguar commented 10 years ago

Thanks @jbenet! Let me throw a related question at you:

A lot of times, a desired transformation is made up of (1) an aggregate calculation, and (2) individual value transformations. This is super common in any normalization technique.

For example, in the case of the standard deviation, you may actually want to have a transformation that converts the initial data set to a data set representing each value's # of standard deviations from the mean.

That will require (a) calculating the standard deviation and mean, and (b) applying a second transformation to each value to convert it it to # of SDs away from the mean.

I guess what I'm trying to suss out is that in the versioned-data land of Dat, how does that work?

(Apologies if this is more a Dat question than Transformer one, but I'm trying to suss out the relationship between the two tools and understand the API for each in more detail. Also apologies if this isn't clear, I had trouble articulating it without convolution.)

jbenet commented 10 years ago

@daguar

(a) calculating the standard deviation and mean, and (b) applying a second transformation to each value to convert it it to # of SDs away from the mean.

Yeah, this kind of use case is aimed for.

  • Are the mean and standard deviation calculated and stored in a individual transformations (perhaps creation of intermediary data sets?) and then passed as hard-coded inputs (eg, standard-deviations-away sd=1.4 mean=90.5) to the second transformation?
  • Or would the total transformation itself actually be passed, in essence, higher-order set-based transformations as inputs?

Both approaches will be doable, which one to use depends on whether you want the intermediate construction to be in the history as well, and the sort of transform applied. For a more concrete case, let's assume you have:

As input a list of numbers:

0.2341
0.4321
0.7645
0.0965
....

And you want, as output a list of # of SDs away from the mean:

0.124
0.234
0.343
0.012
...

The cleanest way to do something like this would be to create a number-list-to-standard-deviations-away conversion that uses two other conversions (number-list-to-mean and number-list-to-standard-deviation). If this is written already (search the modules), you can just:

cat input | transform number-list standard-deviations-away > output

If it isn't written already, the main file of the module would be something like:

var transformer = require('transformer');
var nl = transformer('number-list');
var sda = transformer('standard-deviations-away');
var nl2sd = transformer('number-list', 'standard-deviation');
var nl2m = transformer('number-list', 'mean');

module.exports = new transformer.Conversion(NumberListToStandardDeviationsAway, {
  'id': 'number-list-to-standard-deviations-away',
}, ns, sda);

function NumberListToStandardDeviationsAway(numbers) {
  var mean = nl2m(numbers);  // runs a conversion
  var sd = nl2sd(numbers);   // runs a conversion 
  var sda = [];
  numbers.forEach(function (num) {
    sda.push( (num - mean) / sd );
  });
  return sda;
}

Notes:

that's all well and good. how do I use that with Dat?

The interfacing of Dat with Transformer is still very much tbd. But imagine things like:

# export, apply transform, import
dat cat | transformer .... | dat import

# transform within dat
dat --transform <col-name> <type>

Or applying these directly in the dat web gui.

Apologies if this is more a Dat question than Transformer one

Not at all! This is where these questions belong.

I'm trying to suss out the relationship between the two tools and understand the API for each in more detail.

The API for transformer isn't set yet, I want to get it to be as simple as possible. I'm not satisfied with it yet, but it's getting closer. There's lots of things complicating this API (modularity, clarity, readability, flexibility, automation (#10), stream-friendliness (#3), callback-friendliness (#11), etc).

HTH and that it didn't cause more confusion! :) Please ask if it did-- want to simplify all this as much as possible.

jbenet commented 10 years ago

Btw, some super early examples at http://transform.datadex.io/browser/

waldoj commented 10 years ago

I'm not sure how well this applies, but I'm working on a data transformation process now that would sure benefit from Dat.

I'm turning fixed-width text from a state agency into CSV and JSON. (Here's the GitHub project.) To do so, I've created table maps as YAML, that provide the character range for each field, based on the file structure information provided by the agency. There's one YAML file per data file. The YAML keeps getting more complex—I mean it's pretty simple, but still—to accommodate increasing complexity in the data. I need to turn the YYYYMMDD data strings into ISO-8601, turn the NNNNNNNNN postal code fields into NNNNN-NNNN (or NNNNN, if the last four digits are all 0s), and all those usual kinds of unglamorous things.

Anyhow, turning fixed-width data from legacy systems into useful data is a problem that I encounter pretty regularly. That means field mapping and data sanitizing.