Closed johnkerl closed 2 years ago
Hi,
This is mostly thinking out loud...
I very much like your tool because it does so much so quickly. There are many types of tools in the data processing field but they rely on higher level languages and, I'm guessing, as a consequence, they run a bit slower for me. I have a 7.5MB 80,000 line file and it can compute a very simple request (just count-distinct then sort -nr count
on postal codes) in 3/4 of a second with 42,314 lines.
Other utilities you may be interested in: https://github.com/dbohdan/sqawk https://github.com/sordina/SQLBong https://github.com/harelba/q And so on.
Have you considered something like sort -flag limit 20
? What about json output or a table output?
thanks!
For sort -flag limit 20
do you mean for example (using 5 or 2 instead of 20 to save screen space here):
$ cat ../data/big.dkvp | mlr sort -nr x then head -n 5
a=zee,b=wye,i=853631,x=0.999999276362359191,y=0.987837822170852586
a=zee,b=eks,i=312012,x=0.999999214429407979,y=0.986796915059348967
a=hat,b=pan,i=616674,x=0.999997778329997256,y=0.962660392263280373
a=eks,b=hat,i=889555,x=0.999997151549904206,y=0.952126099240093504
a=eks,b=hat,i=586109,x=0.999995659571139006,y=0.927050412132893853
$ cat ../data/big.dkvp | mlr top -a -n 5 -f x
a=zee,b=wye,i=853631,x=0.999999276362359191,y=0.987837822170852586
a=zee,b=eks,i=312012,x=0.999999214429407979,y=0.986796915059348967
a=hat,b=pan,i=616674,x=0.999997778329997256,y=0.962660392263280373
a=eks,b=hat,i=889555,x=0.999997151549904206,y=0.952126099240093504
a=eks,b=hat,i=586109,x=0.999995659571139006,y=0.927050412132893853
$ cat ../data/big.dkvp | mlr top -a -n 2 -f x -g a
a=wye,b=wye,i=835662,x=0.999991802964355747,y=0.862232421926330983
a=wye,b=zee,i=558352,x=0.999986721668386203,y=0.776831080567478738
a=eks,b=hat,i=889555,x=0.999997151549904206,y=0.952126099240093504
a=eks,b=hat,i=586109,x=0.999995659571139006,y=0.927050412132893853
a=zee,b=wye,i=853631,x=0.999999276362359191,y=0.987837822170852586
a=zee,b=eks,i=312012,x=0.999999214429407979,y=0.986796915059348967
a=pan,b=pan,i=444121,x=0.999994222540405708,y=0.902898236598306481
a=pan,b=hat,i=743855,x=0.999987163580948080,y=0.784258304994673638
a=hat,b=pan,i=616674,x=0.999997778329997256,y=0.962660392263280373
a=hat,b=hat,i=60978,x=0.999995048623529725,y=0.916782215664527445
The idea is you can run the sort
output through head
to keep the desired number of records. But that runs longer than needed: it sorts all records and then throws away all but the top few. So you can instead use Miller's top
which only every retains in memory the top n (e.g. 5 or 20) records at any one time. Same output between the first and second example just above. The third example is a reminder that top
works with category fields, too, if you want.
For "table output" do you mean for example
% cat ../data/big.dkvp | mlr --opprint top -a -n 2 -f x -g a
a b i x y
wye wye 835662 0.999991802964355747 0.862232421926330983
wye zee 558352 0.999986721668386203 0.776831080567478738
eks hat 889555 0.999997151549904206 0.952126099240093504
eks hat 586109 0.999995659571139006 0.927050412132893853
zee wye 853631 0.999999276362359191 0.987837822170852586
zee eks 312012 0.999999214429407979 0.986796915059348967
pan pan 444121 0.999994222540405708 0.902898236598306481
pan hat 743855 0.999987163580948080 0.784258304994673638
hat pan 616674 0.999997778329997256 0.962660392263280373
hat hat 60978 0.999995048623529725 0.916782215664527445
This is what the pprint format is for.
Regarding JSON, that's another story, and one deserving of a separate write-up. Miller handles tabular data -- iterables of insertion-ordered hashmaps -- lists of records which in turn are lists of fields which are key-value pairs -- these are all various ways to say the same thing. JSON (and XML for that matter) is a far, far richer format -- supporting list of maps containing lists of lists of lists or what have you. JSON and XML are recursive data structures and a tool like Miller is never going to go very far down that road.
One answer is to point you at https://stedolan.github.io/jq/.
Another is that Miller could handle JSON or XML, as long as it's understood that it'll only take lists of maps (i.e. tabular data) and nothing richer. And within that scope, there are two ways to do it: one is JSON support within Miller, and another is pre/post-filtering with something outside of Miller like sed
.
To be very specific, a little CSV data such as
a,x
zee,3.1
wye,7.2
could be represented in DKVP by
a=zee,x=3.1
a=wye,x=7.2
and in JSON by
[
{
"a": "zee",
"x": "3.1"
},
{
"a": "wye",
"x": "7.2"
},
]
It would be just some regular-expression work to convert JSON to DKVP: remove the square brackets and double-quotes, treat the curly braces as record-delimiters, use colon as pair-separator, and so on.
Likewise, the same data could be represented in XML by
<table>
<record>
<a>zee</a>
<x>3.1</x>
</record>
<record>
<a>wye</a>
<x>7.2</x>
</record>
</table>
and again this could be turned into DKVP with some automatted reformatting.
Again, I'll write this up in more detail sometime soon, but for now that's a sketch.
Hi @johnkerl,
RE: top vs. head
Yes, your examples are perfect. I had not known either head or top were builtin. I was piping through the OS head, which made the results much longer than either one of your solutions.
In my case, mlr --csv --rs lf count-distinct -f ZIPCode then top -a -n 20 -f count 5-digitCommercial.csv
was slightly more faster than miller's head option.
RE: table output opprint does format very nicely but I was asking about an actual table around the data. Like a table drawn with --- ||| around the data.
RE: json output Thanks for your thoughts!
@jungle-boogie for nicely formatted ASCII-table you may use https://github.com/onyxfish/csvkit . It is written in python, but you are not going to display too much rows, right? :)
As a side note: I really like miller, mostly because it is maintained well :+1:
Hi @Komosa, yes, csvkit is what I was thinking about with ascii tables. Yes, you're right, you won't display a table around dozens of rows
General question: what do you do when some fields are unavailable (ie. rows are not homogenous)?
I encounter this problem with stats1
.
Miller intends to handle heterogeneity as a primary feature. I missed a few spots such as stats1 which are fixed in head & will be included in the next release.
Support for input compressed files (I'm thinking ,gz, others probably also apply). This would allow to manage directories with hundreds of .csv.gz files by using "mlr cat". Also very handy for join
True. zcat ... | mlr ...
is a poor-man's substitute.
(Before anything else: GREAT TOOL!!)
Yes, when wroking with csv if you use pipes you loose reference to "first line". I.e zcat 1,csv,gz 2,csv,gz | mlr --icsvlite ..... does consider header line of file 2 as just data according to header of file 1
Workaround (maybe you want to include it in the FAQ): for i in *.csv.gz; do zcat $i | mlr --icsvlite cat; done | mlr -idkvp .....
My proposal would be to add an additional "compression layer" and be able to do something like:
mlr --input-compress=xz --output-compress=gz --icsvlite --onidx .......
But I just "landed" in your project so I don't know how easy that would be.
Also, I don't know if I should open an issue for the following: dots in filed names for filter
I have a CSV file with fields like "pfx.name". Trying to use that in filter gives "Syntax Error!" with every escaping/quoting I could think of. Probably "mlr rename" is a workaround, but an enhancement would be to be able to use them. For example something like in bash: mrl filter '${pfx.name} == "value"'
Re escaping: nice! I should have done this sooner but didn't think of it; glad you asked!
https://github.com/johnkerl/miller/commit/f4b95c439c9efe813b0c1d17cb048233a997e1a3
This will go out in the next minor release (maybe a week from now).
Re compression: I'll look around for libraries I can portably use.
Meanwhile there's an xargs
trick which would do well in the FAQ for gzip-file handling.
P.S. thanks for the compliment -- glad it's working well for you!
2nd half of field-name escapes: https://github.com/johnkerl/miller/commit/bc1d5d966855eb38a8a269e82cfdd489f4f01c01
csv unquote on input? I know we have option for output quoting (always, minimal, ..), but I didn't find an option to remove quotes on the input. Rationale: I'm trying to avoid this:
--csvlite
doesn't do quoting. It's my non-standard, ad-hoc way to do heterogeneous CSV, without the performance-impacting messiness of handling quoting, embedded commas, etc.
--csv
is RFC-compliant CSV & handles double-quoted input & has options for what quoting behavior you want on ouput:
mlr -h
...
Double-quoting for CSV output:
--quote-all Wrap all fields in double quotes
--quote-none Do not wrap any fields in double quotes, even if they have
OFS or ORS in them
--quote-minimal Wrap fields in double quotes only if they have OFS or ORS
in them (default)
--quote-numeric Wrap fields in double quotes only if they have numbers
in them
and
$ mlr --csv --rs lf cat kk1
id,name,category
34,"miller, the useful tool",data parsing
MY FAULT! I forgot about the "--rs" option. Thanks again, jpc
No worries. :D
P.S. I will admit, there are a LOT of options. ;)
This is such a great tool; I rely on it more and more every day. Since you're soliciting feature requests, I'd love to be able to reshape data with miller, converting between "wide" and "long" formats. For example:
Wide format
time X Y Z
1 2009-01-01 0.65473572 2.4520609 -1.46570942
2 2009-01-02 -0.89248112 0.2154713 -2.05357735
3 2009-01-03 0.98012375 1.3179287 4.64248357
4 2009-01-04 0.35397376 3.3765645 -0.25237774
5 2009-01-05 2.19357813 1.3477511 0.09719105
Long format
time stock price
1 2009-01-01 X 0.65473572
2 2009-01-02 X -0.89248112
3 2009-01-03 X 0.98012375
4 2009-01-04 X 0.35397376
5 2009-01-05 X 2.19357813
6 2009-01-01 Y 2.45206093
7 2009-01-02 Y 0.21547134
8 2009-01-03 Y 1.31792866
9 2009-01-04 Y 3.37656453
10 2009-01-05 Y 1.34775108
11 2009-01-01 Z -1.46570942
12 2009-01-02 Z -2.05357735
13 2009-01-03 Z 4.64248357
14 2009-01-04 Z -0.25237774
15 2009-01-05 Z 0.09719105
The reshape package for R is a great model.
awesome :)
mlr reshape
with wide-to-long is super-easy, and is already in head (no docs yet, but soon). The long-to-wide is just a smidge more work (and is non-streaming: it will require all input records to be read before producing any output records).
$ cat input.tbl
time X Y Z
2009-01-01 0.65473572 2.4520609 -1.46570942
2009-01-02 -0.89248112 0.2154713 -2.05357735
2009-01-03 0.98012375 1.3179287 4.64248357
2009-01-04 0.35397376 3.3765645 -0.25237774
2009-01-05 2.19357813 1.3477511 0.09719105
$ mlr --pprint reshape -i X,Y,Z -o item,price input.tbl
time item price
2009-01-01 X 0.65473572
2009-01-01 Y 2.4520609
2009-01-01 Z -1.46570942
2009-01-02 X -0.89248112
2009-01-02 Y 0.2154713
2009-01-02 Z -2.05357735
2009-01-03 X 0.98012375
2009-01-03 Y 1.3179287
2009-01-03 Z 4.64248357
2009-01-04 X 0.35397376
2009-01-04 Y 3.3765645
2009-01-04 Z -0.25237774
2009-01-05 X 2.19357813
2009-01-05 Y 1.3477511
2009-01-05 Z 0.09719105
$ mlr --pprint reshape -r '^[A-Z]$' -o item,price input.tbl
time item price
2009-01-01 X 0.65473572
2009-01-01 Y 2.4520609
2009-01-01 Z -1.46570942
2009-01-02 X -0.89248112
2009-01-02 Y 0.2154713
2009-01-02 Z -2.05357735
2009-01-03 X 0.98012375
2009-01-03 Y 1.3179287
2009-01-03 Z 4.64248357
2009-01-04 X 0.35397376
2009-01-04 Y 3.3765645
2009-01-04 Z -0.25237774
2009-01-05 X 2.19357813
2009-01-05 Y 1.3477511
2009-01-05 Z 0.09719105
OK @aaronwolen this is in head: see also http://johnkerl.org/miller/doc/reference.html#reshape
Are you a pull-from-head-and-compile person, or should I cut a tag for you to try it out?
Wow, that was fast! Compiling isn't a problem. I'll put it through the wringer and report back.
It works perfectly and (not surprisingly) is incredibly fast. I put together a crude benchmark with a few popular packages for data shaping and only data.table could match miller's performance.
This is a killer addition to an awesome tool. Thanks!
Glad to hear it!!! :)
BTW, re cutting a release ... I've made the perhaps-mistake of biting off several distinct large pieces of work (JSON I/O, out-of-stream variables for mlr put
, and awk-style begin/end/pattern-action blocks for mlr put
) all of which are in some state of flux, and don't plan to cut a new release until 4.0.0 which will surely be some weeks away.
P.S. @aaronwolen you should repull from head. There is a memory-management bugfix, in reshape of which I only happened to observe symptoms in JSON format.
What do you think about working with nested columns?
x y
1 a
2 d,e,f
3 g,h
Where something like mlr unnest --fs comma -f y
would produce
x y
1 a
2 d
2 e
2 f
3 g
3 h
Yes, this is like @komosa's explode
on https://github.com/johnkerl/miller/issues/96.
It's very easy to do as long as the punctuators are defined, & different from =
and ,
. If they aren't, it's undefined where one field ends and the next one starts.
The difference between your example & @komosa's is yours replicates across multiple records, whereas his replicates across multiple fields within the same record.
His example is somewhat complicated by the regex capturing but being able to deal with replicates in both directions would be really useful. Here's a real world example I was working on today, which contains 2 different measures of association between pairs of genetic markers (rs*
):
rs10 rs2237570,0.3,-0.72;rs117465896,0.75,-0.88
rs10041592 rs1628466,0.21,0.92;rs1729036,0.24,1;rs79177551,0.2,0.85
rs10033329 rs4835294,0.2,0.46
There are 2 fields, tab delimited. Records in the second field provide the association results for all markers relevant to the marker in field 1. Converting this format into a "long" format requires 2 operators. First, the unnesting step I described above, which spreads the semicolon-separated replicates across new records
rs10 rs2237570,0.3,-0.72
rs10 rs117465896,0.75,-0.88
rs10041592 rs1628466,0.21,0.92
rs10041592 rs1729036,0.24,1
rs10041592 rs79177551,0.2,0.85
rs10033329 rs4835294,0.2,0.46
followed by a separating step (as described by @komosa) to spread the comma-separated replicates across new fields:
rs10 rs2237570 0.3 -0.72
rs10 rs117465896 0.75 -0.88
rs10041592 rs1628466 0.21 0.92
rs10041592 rs1729036 0.24 1
rs10041592 rs79177551 0.2 0.85
rs10033329 rs4835294 0.2 0.46
sweet
OK looking at this more today. These are both super-simple; working on them now.
@aaronwolen and @Komosa there are explode-variants on https://github.com/johnkerl/miller/commit/9b85f1f22f81683a1dc50b84d43ffba97535d85a. I think this covers them all; please let me know if I'm wrong.
The implode-variants (undoing explode; necessarily non-streaming) are up next & will be almost as easy. This is a lot like reshape
, as it turns out.
Still testing it out and learning the syntax but it looks great so far. Could you clarify how the nested record/field separators should be specified?
I see --nested-ps
(for key-pair values, I'm assuming) and --nested-fs
for fields but no --nested-rs
for records?
I created a nested test filed based on your colored-shapes.dkvp
that looks like:
shape color
circle red|green|purple|orange|yellow|blue
square orange|red|yellow|blue|purple|green
triangle yellow|blue|red|purple|orange|green
The following works fine if colors are ;
-separated but I'm not sure how to specify a different FS.
mlr --csvlite --fs tab \
nest --explode --values --across-records -f color \
color-shapes-nested.tsv
$ cat mydata.tsv
shape color
circle red|green|purple|orange|yellow|blue
square orange|red|yellow|blue|purple|green
triangle yellow|blue|red|purple|orange|green
$ mlr --csvlite --fs tab nest --explode --values --across-records -f color --nested-fs '|' mydata.tsv
shape color
circle red
circle green
circle purple
circle orange
circle yellow
circle blue
square orange
square red
square yellow
square blue
square purple
square green
triangle yellow
triangle blue
triangle red
triangle purple
triangle orange
triangle green
$ mlr --csvlite --fs tab nest --explode --values --across-records -f color --nested-fs pipe mydata.tsv
shape color
circle red
circle green
circle purple
circle orange
circle yellow
circle blue
square orange
square red
square yellow
square blue
square purple
square green
triangle yellow
triangle blue
triangle red
triangle purple
triangle orange
triangle green
My bad: I had inadvertently failed to invoke the part that replaces pipe
with |
. You should get it if you re-pull from head.
Also, I'm still working on some logic which explodes across fields by putting the exploded fields where the original was, rather than at the end of the record.
OK @aaronwolen , done with the in-place mods. This is currently in head:
Input:
$ cat w.tsv
a b
rs10 rs2237570,0.3,-0.72;rs117465896,0.75,-0.88
rs10041592 rs1628466,0.21,0.92;rs1729036,0.24,1;rs79177551,0.2,0.85
rs10033329 rs4835294,0.2,0.46
First step:
$ mlr --csvlite --fs tab nest --explode --values --across-records -f b w.tsv
a b
rs10 rs2237570,0.3,-0.72
rs10 rs117465896,0.75,-0.88
rs10041592 rs1628466,0.21,0.92
rs10041592 rs1729036,0.24,1
rs10041592 rs79177551,0.2,0.85
rs10033329 rs4835294,0.2,0.46
Second step:
$ mlr --csvlite --fs tab nest --explode --values --across-records -f b
then nest --nested-fs comma --explode --values --across-fields -f b w.tsv
a b_1 b_2 b_3
rs10 rs2237570 0.3 -0.72
rs10 rs117465896 0.75 -0.88
rs10041592 rs1628466 0.21 0.92
rs10041592 rs1729036 0.24 1
rs10041592 rs79177551 0.2 0.85
rs10033329 rs4835294 0.2 0.46
along with:
$ mlr nest -h
Usage: mlr nest [options]
Explodes specified field values into separate fields/records, or reverses this.
Options:
--explode,--implode One is required.
--values,--pairs One is required.
--across-records,--across-fields One is required.
-f {field name} Required.
--nested-fs {string} Defaults to ";". Field separator for nested values.
--nested-ps {string} Defaults to ":". Pair separator for nested key-value pairs.
Please use "mlr --usage-separator-options" for information on specifying separators.
Examples:
mlr nest --explode --values --across-records -f x
with input record "x=a;b;c,y=d" produces output records
"x=a,y=d"
"x=b,y=d"
"x=c,y=d"
Use --implode to do the reverse.
mlr nest --explode --values --across-fields -f x
with input record "x=a;b;c,y=d" produces output records
"x_1=a,x_2=b,x_3=c,y=d"
Use --implode to do the reverse.
mlr nest --explode --pairs --across-records -f x
with input record "x=a:1;b:2;c:3,y=d" produces output records
"a=1,y=d"
"b=2,y=d"
"c=3,y=d"
mlr nest --explode --pairs --across-fields -f x
with input record "x=a:1;b:2;c:3,y=d" produces output records
"a=1,b=2,c=3,y=d"
Notes:
* With --pairs, --implode doesn't make sense since the original field name has
been lost.
* The combination "--implode --values --across-records" is non-streaming:
no output records are produced until all input records have been read. In
particular, this means it won't work in tail -f contexts. But all other flag
combinations result in streaming (tail -f friendly) data processing.
* It's up to you to ensure that the nested-fs is distinct from your data's IFS:
e.g. by default the former is semicolon and the latter is comma.
Not sure how I missed this but I've been trying out the latest version and the nest functionality works perfectly. This has really simplified my workflow. Thanks, @johnkerl!
Back at you @aaronwolen -- thanks for helping make Miller better with some great feature requests!!
Can we get another quoting option "--quote-original". Better spreadsheets detect quotes around numbers and import them as text instead of numbers without lots of hassle. Quote original keeps the same quoting as the source document.
Starting with "FOO,BAR","1:30",1:40,1.50,"0616685"
Should not result in any of the following
"FOO,BAR",1:30,1:40,1.50,0616685 1:30 might be converted to a time when it should be text 0616685 will become numeric and lose the leading zero
"FOO,BAR","1:30","1:40","1.50","0616685" 1:40 won't be converted to a time 1.50 will become text and won't behave as a dollar amount should
With only 4 columns I can manually specify text. I can't with hundreds of columns.
OK. There are two issues here: conversion or non-conversion of strings to other things (like integers), and preservation of double-quotes.
For the latter issue, I can think about putting a was-quoted bit in the stream-record fields. Then, for fields which were present in the input data, quote them if that bit is set and if --quote-original
is specified on the command line. For fields not present in the input data -- e.g. those computed using the put
DSL -- I'm less interested in OR'ing or AND'ing the was-quoted bits of the fields. E.g. with mlr put '$c = $a . $b'
if field a
was quoted and b
wasn't, should c
be quoted or not?
For the former issue, Miller is already quite insistent on preserving text when it can. Verbs in general don't convert to number unless they need to -- e.g. stats1 -f
fields, or sort -n
. The exceptions are filter
and put
which by default do type-inferencing: for input x=123
, field x
will be parsed as integer. But this can be suppressed using put -S
and filter -S
, where (for example) field x
would have string value "123"
and you can cast using int($x)
on your own choosing.
@severach can you give me some example Miller commands in your use case(s), to help me see which of these two (or both) are at play here?
My report generator made a CSV file about 164MB. No spreadsheet can handle a file this large. My plan was to extract with mlr about 6 of the 50 or so fields of interest then use the spreadsheet to trim out unwanted records. This plan was halted when the quotes were removed from some of the numeric fields.
At 6 fields I could have told the spreadsheet which were numeric and which were text but this wouldn't be practical for a lot more fields or files sent to others who aren't aware of the need for keeping data types proper.
Did you try --quote-all
? And, can you copy/paste any miller commands you tried?
Correction to above: This plan was halted when quotes were removed from fields that must be text and contain some numbers that are sure to be mangled by spreadsheets.
--quote-all is no good as that would put quotes on the numeric values. Quotes on dollar amounts make them import as text. They don't justify right or add properly.
I'm using the --csv option. There's no appropriate sample commands since the quote options are performing exactly as specified. It's just that none work for me.
The idea here is that the spreadsheet recognizes quotes and always imports quoted text as text no matter what the contents and always imported unquoted text as money, counts, times, dates, and other numeric fields, only importing as text when all attempts at recognizing some type of number have been exhausted. The generator has been designed to take advantage of this feature and has put quotes on the fields that need to be text and leaves the quotes out on fields that need to be numbers.
Imagine a field that contains both text and numbers, like a facility code. A01 005 B01 006
Leaving the quotes out allow some values to come in as numbers. The leading zeros are destructively trimmed and turned into numbers that justify right. With enough data even stuff that looks like text can get converted to some whacky number because it looks like one of the many spreadsheet number formats.
OK from "My plan was to extract with mlr about 6 of the 50 or so fields of interest ..." the command I'm guessing is mlr cut
. Sounds like computing fields is a non-issue since you're just cutting out the ones you need so we don't need to go down that road. Then a --quote-original
combined with per-field was-quoted-on-input flag would suffice to solve your problem.
This sounds pretty easy -- not this week because RL but maybe next week I can draft something up -- ?
Feel free to open a new task, or just add discussion as a comment here.