johnkerl / miller

Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
https://miller.readthedocs.io
Other
8.87k stars 214 forks source link

Open task for feature requests #77

Closed johnkerl closed 2 years ago

johnkerl commented 8 years ago

Feel free to open a new task, or just add discussion as a comment here.

jungle-boogie commented 8 years ago

Hi,

This is mostly thinking out loud...

I very much like your tool because it does so much so quickly. There are many types of tools in the data processing field but they rely on higher level languages and, I'm guessing, as a consequence, they run a bit slower for me. I have a 7.5MB 80,000 line file and it can compute a very simple request (just count-distinct then sort -nr count on postal codes) in 3/4 of a second with 42,314 lines.

Other utilities you may be interested in: https://github.com/dbohdan/sqawk https://github.com/sordina/SQLBong https://github.com/harelba/q And so on.

Have you considered something like sort -flag limit 20? What about json output or a table output?

thanks!

johnkerl commented 8 years ago

For sort -flag limit 20 do you mean for example (using 5 or 2 instead of 20 to save screen space here):

$ cat ../data/big.dkvp | mlr sort -nr x then head -n 5
a=zee,b=wye,i=853631,x=0.999999276362359191,y=0.987837822170852586
a=zee,b=eks,i=312012,x=0.999999214429407979,y=0.986796915059348967
a=hat,b=pan,i=616674,x=0.999997778329997256,y=0.962660392263280373
a=eks,b=hat,i=889555,x=0.999997151549904206,y=0.952126099240093504
a=eks,b=hat,i=586109,x=0.999995659571139006,y=0.927050412132893853
$ cat ../data/big.dkvp | mlr top -a -n 5 -f x
a=zee,b=wye,i=853631,x=0.999999276362359191,y=0.987837822170852586
a=zee,b=eks,i=312012,x=0.999999214429407979,y=0.986796915059348967
a=hat,b=pan,i=616674,x=0.999997778329997256,y=0.962660392263280373
a=eks,b=hat,i=889555,x=0.999997151549904206,y=0.952126099240093504
a=eks,b=hat,i=586109,x=0.999995659571139006,y=0.927050412132893853
$ cat ../data/big.dkvp | mlr top -a -n 2 -f x -g a
a=wye,b=wye,i=835662,x=0.999991802964355747,y=0.862232421926330983
a=wye,b=zee,i=558352,x=0.999986721668386203,y=0.776831080567478738
a=eks,b=hat,i=889555,x=0.999997151549904206,y=0.952126099240093504
a=eks,b=hat,i=586109,x=0.999995659571139006,y=0.927050412132893853
a=zee,b=wye,i=853631,x=0.999999276362359191,y=0.987837822170852586
a=zee,b=eks,i=312012,x=0.999999214429407979,y=0.986796915059348967
a=pan,b=pan,i=444121,x=0.999994222540405708,y=0.902898236598306481
a=pan,b=hat,i=743855,x=0.999987163580948080,y=0.784258304994673638
a=hat,b=pan,i=616674,x=0.999997778329997256,y=0.962660392263280373
a=hat,b=hat,i=60978,x=0.999995048623529725,y=0.916782215664527445

The idea is you can run the sort output through head to keep the desired number of records. But that runs longer than needed: it sorts all records and then throws away all but the top few. So you can instead use Miller's top which only every retains in memory the top n (e.g. 5 or 20) records at any one time. Same output between the first and second example just above. The third example is a reminder that top works with category fields, too, if you want.

johnkerl commented 8 years ago

For "table output" do you mean for example

% cat ../data/big.dkvp | mlr --opprint top -a -n 2 -f x -g a
a   b   i      x                    y
wye wye 835662 0.999991802964355747 0.862232421926330983
wye zee 558352 0.999986721668386203 0.776831080567478738
eks hat 889555 0.999997151549904206 0.952126099240093504
eks hat 586109 0.999995659571139006 0.927050412132893853
zee wye 853631 0.999999276362359191 0.987837822170852586
zee eks 312012 0.999999214429407979 0.986796915059348967
pan pan 444121 0.999994222540405708 0.902898236598306481
pan hat 743855 0.999987163580948080 0.784258304994673638
hat pan 616674 0.999997778329997256 0.962660392263280373
hat hat 60978  0.999995048623529725 0.916782215664527445

This is what the pprint format is for.

johnkerl commented 8 years ago

Regarding JSON, that's another story, and one deserving of a separate write-up. Miller handles tabular data -- iterables of insertion-ordered hashmaps -- lists of records which in turn are lists of fields which are key-value pairs -- these are all various ways to say the same thing. JSON (and XML for that matter) is a far, far richer format -- supporting list of maps containing lists of lists of lists or what have you. JSON and XML are recursive data structures and a tool like Miller is never going to go very far down that road.

One answer is to point you at https://stedolan.github.io/jq/.

Another is that Miller could handle JSON or XML, as long as it's understood that it'll only take lists of maps (i.e. tabular data) and nothing richer. And within that scope, there are two ways to do it: one is JSON support within Miller, and another is pre/post-filtering with something outside of Miller like sed.

To be very specific, a little CSV data such as

a,x
zee,3.1
wye,7.2

could be represented in DKVP by

a=zee,x=3.1
a=wye,x=7.2

and in JSON by

[
  {
    "a": "zee",
    "x": "3.1"
  },
  {
    "a": "wye",
    "x": "7.2"
  },
]

It would be just some regular-expression work to convert JSON to DKVP: remove the square brackets and double-quotes, treat the curly braces as record-delimiters, use colon as pair-separator, and so on.

Likewise, the same data could be represented in XML by

<table>
  <record>
    <a>zee</a>
    <x>3.1</x>
  </record>
  <record>
    <a>wye</a>
    <x>7.2</x>
  </record>
</table>

and again this could be turned into DKVP with some automatted reformatting.

Again, I'll write this up in more detail sometime soon, but for now that's a sketch.

jungle-boogie commented 8 years ago

Hi @johnkerl,

RE: top vs. head Yes, your examples are perfect. I had not known either head or top were builtin. I was piping through the OS head, which made the results much longer than either one of your solutions. In my case, mlr --csv --rs lf count-distinct -f ZIPCode then top -a -n 20 -f count 5-digitCommercial.csv was slightly more faster than miller's head option.

RE: table output opprint does format very nicely but I was asking about an actual table around the data. Like a table drawn with --- ||| around the data.

RE: json output Thanks for your thoughts!

Komosa commented 8 years ago

@jungle-boogie for nicely formatted ASCII-table you may use https://github.com/onyxfish/csvkit . It is written in python, but you are not going to display too much rows, right? :)

As a side note: I really like miller, mostly because it is maintained well :+1:

jungle-boogie commented 8 years ago

Hi @Komosa, yes, csvkit is what I was thinking about with ascii tables. Yes, you're right, you won't display a table around dozens of rows

Komosa commented 8 years ago

General question: what do you do when some fields are unavailable (ie. rows are not homogenous)? I encounter this problem with stats1.

johnkerl commented 8 years ago

Miller intends to handle heterogeneity as a primary feature. I missed a few spots such as stats1 which are fixed in head & will be included in the next release.

jpc4242 commented 8 years ago

Support for input compressed files (I'm thinking ,gz, others probably also apply). This would allow to manage directories with hundreds of .csv.gz files by using "mlr cat". Also very handy for join

johnkerl commented 8 years ago

True. zcat ... | mlr ... is a poor-man's substitute.

jpc4242 commented 8 years ago

(Before anything else: GREAT TOOL!!)

Yes, when wroking with csv if you use pipes you loose reference to "first line". I.e zcat 1,csv,gz 2,csv,gz | mlr --icsvlite ..... does consider header line of file 2 as just data according to header of file 1

Workaround (maybe you want to include it in the FAQ): for i in *.csv.gz; do zcat $i | mlr --icsvlite cat; done | mlr -idkvp .....

My proposal would be to add an additional "compression layer" and be able to do something like:

mlr --input-compress=xz --output-compress=gz --icsvlite --onidx .......

But I just "landed" in your project so I don't know how easy that would be.

Also, I don't know if I should open an issue for the following: dots in filed names for filter

I have a CSV file with fields like "pfx.name". Trying to use that in filter gives "Syntax Error!" with every escaping/quoting I could think of. Probably "mlr rename" is a workaround, but an enhancement would be to be able to use them. For example something like in bash: mrl filter '${pfx.name} == "value"'

johnkerl commented 8 years ago

Re escaping: nice! I should have done this sooner but didn't think of it; glad you asked!

https://github.com/johnkerl/miller/commit/f4b95c439c9efe813b0c1d17cb048233a997e1a3

This will go out in the next minor release (maybe a week from now).

johnkerl commented 8 years ago

Re compression: I'll look around for libraries I can portably use.

johnkerl commented 8 years ago

Meanwhile there's an xargs trick which would do well in the FAQ for gzip-file handling.

johnkerl commented 8 years ago

P.S. thanks for the compliment -- glad it's working well for you!

johnkerl commented 8 years ago

2nd half of field-name escapes: https://github.com/johnkerl/miller/commit/bc1d5d966855eb38a8a269e82cfdd489f4f01c01

johnkerl commented 8 years ago

Compression support:

jpc4242 commented 8 years ago

csv unquote on input? I know we have option for output quoting (always, minimal, ..), but I didn't find an option to remove quotes on the input. Rationale: I'm trying to avoid this:

$ cat kk1 id,name,category 34,"miller, the useful tool",data parsing $ mlr --icsvlite --ifs comma cat kk1 mlr: Header-data length mismatch in file kk1 at line 2.
johnkerl commented 8 years ago

--csvlite doesn't do quoting. It's my non-standard, ad-hoc way to do heterogeneous CSV, without the performance-impacting messiness of handling quoting, embedded commas, etc.

--csv is RFC-compliant CSV & handles double-quoted input & has options for what quoting behavior you want on ouput:

mlr -h
...
Double-quoting for CSV output:
  --quote-all        Wrap all fields in double quotes
  --quote-none       Do not wrap any fields in double quotes, even if they have
                     OFS or ORS in them
  --quote-minimal    Wrap fields in double quotes only if they have OFS or ORS
                     in them (default)
  --quote-numeric    Wrap fields in double quotes only if they have numbers
                     in them

and

$ mlr --csv --rs lf cat kk1
id,name,category
34,"miller, the useful tool",data parsing
jpc4242 commented 8 years ago

MY FAULT! I forgot about the "--rs" option. Thanks again, jpc

johnkerl commented 8 years ago

No worries. :D

P.S. I will admit, there are a LOT of options. ;)

aaronwolen commented 8 years ago

This is such a great tool; I rely on it more and more every day. Since you're soliciting feature requests, I'd love to be able to reshape data with miller, converting between "wide" and "long" formats. For example:

Wide format

         time           X          Y           Z
1  2009-01-01  0.65473572  2.4520609 -1.46570942
2  2009-01-02 -0.89248112  0.2154713 -2.05357735
3  2009-01-03  0.98012375  1.3179287  4.64248357
4  2009-01-04  0.35397376  3.3765645 -0.25237774
5  2009-01-05  2.19357813  1.3477511  0.09719105

Long format

         time stock       price
1  2009-01-01     X  0.65473572
2  2009-01-02     X -0.89248112
3  2009-01-03     X  0.98012375
4  2009-01-04     X  0.35397376
5  2009-01-05     X  2.19357813
6  2009-01-01     Y  2.45206093
7  2009-01-02     Y  0.21547134
8  2009-01-03     Y  1.31792866
9  2009-01-04     Y  3.37656453
10 2009-01-05     Y  1.34775108
11 2009-01-01     Z -1.46570942
12 2009-01-02     Z -2.05357735
13 2009-01-03     Z  4.64248357
14 2009-01-04     Z -0.25237774
15 2009-01-05     Z  0.09719105

The reshape package for R is a great model.

johnkerl commented 8 years ago

awesome :)

johnkerl commented 8 years ago

mlr reshape with wide-to-long is super-easy, and is already in head (no docs yet, but soon). The long-to-wide is just a smidge more work (and is non-streaming: it will require all input records to be read before producing any output records).

johnkerl commented 8 years ago
$ cat input.tbl
time       X           Y         Z
2009-01-01 0.65473572  2.4520609 -1.46570942
2009-01-02 -0.89248112 0.2154713 -2.05357735
2009-01-03 0.98012375  1.3179287 4.64248357
2009-01-04 0.35397376  3.3765645 -0.25237774
2009-01-05 2.19357813  1.3477511 0.09719105

$ mlr --pprint reshape -i X,Y,Z -o item,price input.tbl
time       item price
2009-01-01 X    0.65473572
2009-01-01 Y    2.4520609
2009-01-01 Z    -1.46570942
2009-01-02 X    -0.89248112
2009-01-02 Y    0.2154713
2009-01-02 Z    -2.05357735
2009-01-03 X    0.98012375
2009-01-03 Y    1.3179287
2009-01-03 Z    4.64248357
2009-01-04 X    0.35397376
2009-01-04 Y    3.3765645
2009-01-04 Z    -0.25237774
2009-01-05 X    2.19357813
2009-01-05 Y    1.3477511
2009-01-05 Z    0.09719105

$ mlr --pprint reshape -r '^[A-Z]$' -o item,price input.tbl
time       item price
2009-01-01 X    0.65473572
2009-01-01 Y    2.4520609
2009-01-01 Z    -1.46570942
2009-01-02 X    -0.89248112
2009-01-02 Y    0.2154713
2009-01-02 Z    -2.05357735
2009-01-03 X    0.98012375
2009-01-03 Y    1.3179287
2009-01-03 Z    4.64248357
2009-01-04 X    0.35397376
2009-01-04 Y    3.3765645
2009-01-04 Z    -0.25237774
2009-01-05 X    2.19357813
2009-01-05 Y    1.3477511
2009-01-05 Z    0.09719105
johnkerl commented 8 years ago

OK @aaronwolen this is in head: see also http://johnkerl.org/miller/doc/reference.html#reshape

Are you a pull-from-head-and-compile person, or should I cut a tag for you to try it out?

aaronwolen commented 8 years ago

Wow, that was fast! Compiling isn't a problem. I'll put it through the wringer and report back.

aaronwolen commented 8 years ago

It works perfectly and (not surprisingly) is incredibly fast. I put together a crude benchmark with a few popular packages for data shaping and only data.table could match miller's performance.

This is a killer addition to an awesome tool. Thanks!

johnkerl commented 8 years ago

Glad to hear it!!! :)

BTW, re cutting a release ... I've made the perhaps-mistake of biting off several distinct large pieces of work (JSON I/O, out-of-stream variables for mlr put, and awk-style begin/end/pattern-action blocks for mlr put) all of which are in some state of flux, and don't plan to cut a new release until 4.0.0 which will surely be some weeks away.

johnkerl commented 8 years ago

P.S. @aaronwolen you should repull from head. There is a memory-management bugfix, in reshape of which I only happened to observe symptoms in JSON format.

aaronwolen commented 8 years ago

What do you think about working with nested columns?

 x     y
 1     a
 2 d,e,f
 3   g,h

Where something like mlr unnest --fs comma -f y would produce

x     y
1     a
2     d
2     e
2     f
3     g
3     h
johnkerl commented 8 years ago

Yes, this is like @komosa's explode on https://github.com/johnkerl/miller/issues/96.

It's very easy to do as long as the punctuators are defined, & different from = and ,. If they aren't, it's undefined where one field ends and the next one starts.

johnkerl commented 8 years ago

The difference between your example & @komosa's is yours replicates across multiple records, whereas his replicates across multiple fields within the same record.

aaronwolen commented 8 years ago

His example is somewhat complicated by the regex capturing but being able to deal with replicates in both directions would be really useful. Here's a real world example I was working on today, which contains 2 different measures of association between pairs of genetic markers (rs*):

rs10    rs2237570,0.3,-0.72;rs117465896,0.75,-0.88
rs10041592  rs1628466,0.21,0.92;rs1729036,0.24,1;rs79177551,0.2,0.85
rs10033329  rs4835294,0.2,0.46

There are 2 fields, tab delimited. Records in the second field provide the association results for all markers relevant to the marker in field 1. Converting this format into a "long" format requires 2 operators. First, the unnesting step I described above, which spreads the semicolon-separated replicates across new records

rs10        rs2237570,0.3,-0.72
rs10        rs117465896,0.75,-0.88
rs10041592  rs1628466,0.21,0.92
rs10041592  rs1729036,0.24,1
rs10041592  rs79177551,0.2,0.85
rs10033329  rs4835294,0.2,0.46

followed by a separating step (as described by @komosa) to spread the comma-separated replicates across new fields:

rs10        rs2237570    0.3   -0.72
rs10        rs117465896  0.75  -0.88
rs10041592  rs1628466    0.21  0.92
rs10041592  rs1729036    0.24  1
rs10041592  rs79177551   0.2   0.85
rs10033329  rs4835294    0.2   0.46
johnkerl commented 8 years ago

sweet

johnkerl commented 8 years ago

OK looking at this more today. These are both super-simple; working on them now.

johnkerl commented 8 years ago

@aaronwolen and @Komosa there are explode-variants on https://github.com/johnkerl/miller/commit/9b85f1f22f81683a1dc50b84d43ffba97535d85a. I think this covers them all; please let me know if I'm wrong.

The implode-variants (undoing explode; necessarily non-streaming) are up next & will be almost as easy. This is a lot like reshape, as it turns out.

aaronwolen commented 8 years ago

Still testing it out and learning the syntax but it looks great so far. Could you clarify how the nested record/field separators should be specified?

I see --nested-ps (for key-pair values, I'm assuming) and --nested-fs for fields but no --nested-rs for records?

I created a nested test filed based on your colored-shapes.dkvp that looks like:

shape   color
circle  red|green|purple|orange|yellow|blue
square  orange|red|yellow|blue|purple|green
triangle    yellow|blue|red|purple|orange|green

The following works fine if colors are ;-separated but I'm not sure how to specify a different FS.

mlr --csvlite --fs tab \
    nest --explode --values --across-records -f color  \
    color-shapes-nested.tsv
johnkerl commented 8 years ago
$ cat mydata.tsv
shape   color
circle  red|green|purple|orange|yellow|blue
square  orange|red|yellow|blue|purple|green
triangle        yellow|blue|red|purple|orange|green

$ mlr --csvlite --fs tab nest --explode --values --across-records -f color --nested-fs '|' mydata.tsv
shape   color
circle  red
circle  green
circle  purple
circle  orange
circle  yellow
circle  blue
square  orange
square  red
square  yellow
square  blue
square  purple
square  green
triangle        yellow
triangle        blue
triangle        red
triangle        purple
triangle        orange
triangle        green

$ mlr --csvlite --fs tab nest --explode --values --across-records -f color --nested-fs pipe mydata.tsv
shape   color
circle  red
circle  green
circle  purple
circle  orange
circle  yellow
circle  blue
square  orange
square  red
square  yellow
square  blue
square  purple
square  green
triangle        yellow
triangle        blue
triangle        red
triangle        purple
triangle        orange
triangle        green

My bad: I had inadvertently failed to invoke the part that replaces pipe with |. You should get it if you re-pull from head.

johnkerl commented 8 years ago

Also, I'm still working on some logic which explodes across fields by putting the exploded fields where the original was, rather than at the end of the record.

johnkerl commented 8 years ago

OK @aaronwolen , done with the in-place mods. This is currently in head:

Input:

$ cat w.tsv
a b
rs10  rs2237570,0.3,-0.72;rs117465896,0.75,-0.88
rs10041592  rs1628466,0.21,0.92;rs1729036,0.24,1;rs79177551,0.2,0.85
rs10033329  rs4835294,0.2,0.46

First step:

$ mlr --csvlite --fs tab nest --explode --values --across-records -f b w.tsv
a b
rs10  rs2237570,0.3,-0.72
rs10  rs117465896,0.75,-0.88
rs10041592  rs1628466,0.21,0.92
rs10041592  rs1729036,0.24,1
rs10041592  rs79177551,0.2,0.85
rs10033329  rs4835294,0.2,0.46

Second step:

$ mlr --csvlite --fs tab nest --explode --values --across-records -f b
  then nest --nested-fs comma --explode --values --across-fields -f b w.tsv
a b_1 b_2 b_3
rs10  rs2237570 0.3 -0.72
rs10  rs117465896 0.75  -0.88
rs10041592  rs1628466 0.21  0.92
rs10041592  rs1729036 0.24  1
rs10041592  rs79177551  0.2 0.85
rs10033329  rs4835294 0.2 0.46

along with:

$ mlr nest -h
Usage: mlr nest [options]
Explodes specified field values into separate fields/records, or reverses this.
Options:
  --explode,--implode   One is required.
  --values,--pairs      One is required.
  --across-records,--across-fields One is required.
  -f {field name}       Required.
  --nested-fs {string}  Defaults to ";". Field separator for nested values.
  --nested-ps {string}  Defaults to ":". Pair separator for nested key-value pairs.
Please use "mlr --usage-separator-options" for information on specifying separators.

Examples:

  mlr nest --explode --values --across-records -f x
  with input record "x=a;b;c,y=d" produces output records
    "x=a,y=d"
    "x=b,y=d"
    "x=c,y=d"
  Use --implode to do the reverse.

  mlr nest --explode --values --across-fields -f x
  with input record "x=a;b;c,y=d" produces output records
    "x_1=a,x_2=b,x_3=c,y=d"
  Use --implode to do the reverse.

  mlr nest --explode --pairs --across-records -f x
  with input record "x=a:1;b:2;c:3,y=d" produces output records
    "a=1,y=d"
    "b=2,y=d"
    "c=3,y=d"

  mlr nest --explode --pairs --across-fields -f x
  with input record "x=a:1;b:2;c:3,y=d" produces output records
    "a=1,b=2,c=3,y=d"

Notes:
* With --pairs, --implode doesn't make sense since the original field name has
  been lost.
* The combination "--implode --values --across-records" is non-streaming:
  no output records are produced until all input records have been read. In
  particular, this means it won't work in tail -f contexts. But all other flag
  combinations result in streaming (tail -f friendly) data processing.
* It's up to you to ensure that the nested-fs is distinct from your data's IFS:
  e.g. by default the former is semicolon and the latter is comma.
aaronwolen commented 8 years ago

Not sure how I missed this but I've been trying out the latest version and the nest functionality works perfectly. This has really simplified my workflow. Thanks, @johnkerl!

johnkerl commented 8 years ago

Back at you @aaronwolen -- thanks for helping make Miller better with some great feature requests!!

severach commented 8 years ago

Can we get another quoting option "--quote-original". Better spreadsheets detect quotes around numbers and import them as text instead of numbers without lots of hassle. Quote original keeps the same quoting as the source document.

Starting with "FOO,BAR","1:30",1:40,1.50,"0616685"

Should not result in any of the following

"FOO,BAR",1:30,1:40,1.50,0616685 1:30 might be converted to a time when it should be text 0616685 will become numeric and lose the leading zero

"FOO,BAR","1:30","1:40","1.50","0616685" 1:40 won't be converted to a time 1.50 will become text and won't behave as a dollar amount should

With only 4 columns I can manually specify text. I can't with hundreds of columns.

johnkerl commented 8 years ago

OK. There are two issues here: conversion or non-conversion of strings to other things (like integers), and preservation of double-quotes.

For the latter issue, I can think about putting a was-quoted bit in the stream-record fields. Then, for fields which were present in the input data, quote them if that bit is set and if --quote-original is specified on the command line. For fields not present in the input data -- e.g. those computed using the put DSL -- I'm less interested in OR'ing or AND'ing the was-quoted bits of the fields. E.g. with mlr put '$c = $a . $b' if field a was quoted and b wasn't, should c be quoted or not?

For the former issue, Miller is already quite insistent on preserving text when it can. Verbs in general don't convert to number unless they need to -- e.g. stats1 -f fields, or sort -n. The exceptions are filter and put which by default do type-inferencing: for input x=123, field x will be parsed as integer. But this can be suppressed using put -S and filter -S, where (for example) field x would have string value "123" and you can cast using int($x) on your own choosing.

@severach can you give me some example Miller commands in your use case(s), to help me see which of these two (or both) are at play here?

severach commented 8 years ago

My report generator made a CSV file about 164MB. No spreadsheet can handle a file this large. My plan was to extract with mlr about 6 of the 50 or so fields of interest then use the spreadsheet to trim out unwanted records. This plan was halted when the quotes were removed from some of the numeric fields.

At 6 fields I could have told the spreadsheet which were numeric and which were text but this wouldn't be practical for a lot more fields or files sent to others who aren't aware of the need for keeping data types proper.

johnkerl commented 8 years ago

Did you try --quote-all? And, can you copy/paste any miller commands you tried?

severach commented 8 years ago

Correction to above: This plan was halted when quotes were removed from fields that must be text and contain some numbers that are sure to be mangled by spreadsheets.

--quote-all is no good as that would put quotes on the numeric values. Quotes on dollar amounts make them import as text. They don't justify right or add properly.

I'm using the --csv option. There's no appropriate sample commands since the quote options are performing exactly as specified. It's just that none work for me.

The idea here is that the spreadsheet recognizes quotes and always imports quoted text as text no matter what the contents and always imported unquoted text as money, counts, times, dates, and other numeric fields, only importing as text when all attempts at recognizing some type of number have been exhausted. The generator has been designed to take advantage of this feature and has put quotes on the fields that need to be text and leaves the quotes out on fields that need to be numbers.

Imagine a field that contains both text and numbers, like a facility code. A01 005 B01 006

Leaving the quotes out allow some values to come in as numbers. The leading zeros are destructively trimmed and turned into numbers that justify right. With enough data even stuff that looks like text can get converted to some whacky number because it looks like one of the many spreadsheet number formats.

johnkerl commented 8 years ago

OK from "My plan was to extract with mlr about 6 of the 50 or so fields of interest ..." the command I'm guessing is mlr cut. Sounds like computing fields is a non-issue since you're just cutting out the ones you need so we don't need to go down that road. Then a --quote-original combined with per-field was-quoted-on-input flag would suffice to solve your problem.

This sounds pretty easy -- not this week because RL but maybe next week I can draft something up -- ?