johnkerl / miller

Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
https://miller.readthedocs.io
Other
8.95k stars 216 forks source link

Prep JSON data for stan #392

Open xvzftube opened 3 years ago

xvzftube commented 3 years ago

I have been stringing a shell script in with mlr to prepare the data for stan. I wanted to open this as a feature request. As oppose to my csv2json.sh script maybe a flag —json-cells-to-arrays or any other more suitable name.

wget https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv

mlr --csv --ojson --jlistwrap cut -f mpg,wt mtcars.csv | ./csv2json.sh > data.json

csv2json.sh Is a jq shell script


jq '. as $in
| reduce (.[0] | keys_unsorted[]) as $k ( {}; 
    .[$k] = ($in|map(.[$k] | (tonumber? // .))))'

As a reference this page shows the format of the json needed for CmdStan https://mc-stan.org/docs/2_25/cmdstan-guide/example-model-and-data.html

johnkerl commented 3 years ago

In Miller 6 (the as-yet-unreleased Go port) there is now support for JSON arrays. So this works:

mlr --icsv --ojson --from mtcars.csv cut -f mpg,wt then put -q '
  for (k, v in $*) {
    @output_record[k][NR] = v;
  }
  end {
    emit @output_record
  }
'
{
  "mpg": [21, 21, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 32.4, 30.4, 33.9, 21.5, 15.5, 15.2, 13.3, 19.2, 27.3, 26, 30.4, 15.8, 19.7, 15, 21.4],
  "wt": [2.62, 2.875, 2.32, 3.215, 3.44, 3.46, 3.57, 3.19, 3.15, 3.44, 3.44, 4.07, 3.73, 3.78, 5.25, 5.424, 5.345, 2.2, 1.615, 1.835, 2.465, 3.52, 3.435, 3.84, 3.845, 1.935, 2.14, 1.513, 3.17, 2.77, 3.57, 2.78]
}
johnkerl commented 3 years ago

I can also make a verb which does this kind of thing ... or maybe just a recipe item for the Miller docs -- ?

Part of me is tempted to make STAN a file format so mlr --icsv --ostan cut -f mpg,wt mtcars.csv. However, STAN isn't a separate file format; it's just JSON. On the third hand ... it would be really neat to have an "un-stan" functionality which would convert the mpg and wt arrays back into tabular format .....

johnkerl commented 3 years ago

Really this is a kind of sideways display. CC @aborruso and @ashmishr with regard to https://github.com/johnkerl/miller/issues/321.

johnkerl commented 3 years ago

A way to reuse this code more easily:

$ cat mkstan.mlr
for (k, v in $*) {
  @output_record[k][NR] = v;
}
end {
  emit @output_record
}

Then

$ mlr --from whatever-file.dat --ojson cut -f x,y then put -q -f mkstan.mlr
johnkerl commented 3 years ago

Anyway.

johnkerl commented 3 years ago

Thinking more, and having read more: There's more to Stan format than just single-dimensional arrays. So I think I'll do:

johnkerl commented 3 years ago

For reference (since the Miller 6 port is some weeks/months away from being done):

mkstan.mlr

# ================================================================
# Sample CSV input:
#
#   $ cat input.csv
#   a,b
#   1,4
#   2,5
#   3,6
#
# Invocation:
#
#   $ mlr --icsv --ojson put -q -f mkstan.mlr input.csv
#
# Sample JSON output:
#
#   {
#     "a": [1, 2, 3],
#     "b": [4, 5, 6]
#   }
# ================================================================

for (k, v in $*) {
  @output_record[k][NR] = v;
}
end {
  emit @output_record
}

unstan.mlr

# ================================================================
# Sample JSON input:
#
#   $ cat stan.json
#   {
#     "a": [1, 2, 3],
#     "b": [4, 5, 6]
#   }
#
# Invocation:
#
#   $ mlr --ijson --ocsv put -q -f unstan.mlr stan.json
#
# Output:
#
#   a,b
#   1,4
#   2,5
#   3,6

# ================================================================

# Find array length
n = 0;
for (k, v in $*) {
  n = max(n, length(v));
}
keys = keys($*);

# Emit one record per array entry
for (int i = 1; i <= n; i+=1) {
  map output_record = {};
  for (k in keys) {
    output_record[k] = $[k][i];
  }
  emit output_record;
}
git314 commented 3 years ago

Thinking more, and having read more: There's more to Stan format than just single-dimensional arrays. So I think I'll do:

  • The mkstan.mlr will work with Miller 6 (let me know if you want me to make you a binary)

  • I'll make an unstan.mlr as well.

  • I'll make new columns-to-arrays and arrays-to-columns verbs which will be source-code implementations of mkstan.mlr and unstan.mlr. So mlr --icsv --ojson cut -f mpg,wt then columns-to-arrays mtcars.csv

Thanks for all of the thought you put into this. I like the idea of the new verbs.