johnkerl / miller

Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
https://miller.readthedocs.io
Other
8.88k stars 214 forks source link

DSL parsing of map*() broken #1025

Open gromgit opened 2 years ago

gromgit commented 2 years ago

I was trying out one of the emit example commands from the documentation, and ran into an unexpected parse failure:

$ head -5 f.dat
a,date,open,high,low,close,volume,wap,bid,ask,status,currency,market
1,2022-05-12,54.200,54.500,53.400,53.550,3404922,-1.000,53.550,53.600,,HKD,XHKG
2,2022-05-12,75.900,76.100,75.600,75.750,2063001,-1.000,75.750,75.800,,HKD,XHKG
3,2022-05-12,8.400,8.430,8.340,8.350,14994100,-1.000,8.350,8.360,,HKD,XHKG
4,2022-05-12,21.500,21.500,20.700,20.700,1614083,-1.000,20.650,20.700,,HKD,XHKG

$ mlr --from /tmp/f.dat put 'emit >  "/tmp/data-".$a, mapexcept($*, "a")'
mlr: cannot parse DSL expression.
Parse error on token "(" at line 1 column 35.
Please check for missing semicolon.
Expected one of:
  $ ; ,

As far as I know, the DSL expression is syntactically correct, so I'm assuming it's a parser bug rather than a documentation issue. I also get a parse error if I substitute mapselect in the above expression.

johnkerl commented 2 years ago

Hi @gromgit --

TL;DR assign the mapexcept bit to a temp variable and emit that:

$ mlr --csv --from f.dat put 'temp = mapexcept($*, "a"); emit >  "/tmp/data-".$a, temp'
a,date,open,high,low,close,volume,wap,bid,ask,status,currency,market
1,2022-05-12,54.200,54.500,53.400,53.550,3404922,-1.000,53.550,53.600,,HKD,XHKG
2,2022-05-12,75.900,76.100,75.600,75.750,2063001,-1.000,75.750,75.800,,HKD,XHKG
3,2022-05-12,8.400,8.430,8.340,8.350,14994100,-1.000,8.350,8.360,,HKD,XHKG
4,2022-05-12,21.500,21.500,20.700,20.700,1614083,-1.000,20.650,20.700,,HKD,XHKG
######################################################## /tmp/data-
1=a,2=date,3=open,4=high,5=low,6=close,7=volume,8=wap,9=bid,10=ask,11=status,12=currency,13=market
1=1,2=2022-05-12,3=54.200,4=54.500,5=53.400,6=53.550,7=3404922,8=-1.000,9=53.550,10=53.600,12=HKD,13=XHKG
1=2,2=2022-05-12,3=75.900,4=76.100,5=75.600,6=75.750,7=2063001,8=-1.000,9=75.750,10=75.800,12=HKD,13=XHKG
1=3,2=2022-05-12,3=8.400,4=8.430,5=8.340,6=8.350,7=14994100,8=-1.000,9=8.350,10=8.360,12=HKD,13=XHKG
1=4,2=2022-05-12,3=21.500,4=21.500,5=20.700,6=20.700,7=1614083,8=-1.000,9=20.650,10=20.700,12=HKD,13=XHKG

######################################################## /tmp/data-1
date,open,high,low,close,volume,wap,bid,ask,status,currency,market
2022-05-12,54.200,54.500,53.400,53.550,3404922,-1.000,53.550,53.600,,HKD,XHKG

######################################################## /tmp/data-2
date,open,high,low,close,volume,wap,bid,ask,status,currency,market
2022-05-12,75.900,76.100,75.600,75.750,2063001,-1.000,75.750,75.800,,HKD,XHKG

######################################################## /tmp/data-3
date,open,high,low,close,volume,wap,bid,ask,status,currency,market
2022-05-12,8.400,8.430,8.340,8.350,14994100,-1.000,8.350,8.360,,HKD,XHKG

######################################################## /tmp/data-4
date,open,high,low,close,volume,wap,bid,ask,status,currency,market
2022-05-12,21.500,21.500,20.700,20.700,1614083,-1.000,20.650,20.700,,HKD,XHKG

Longer reason is here:

https://miller.readthedocs.io/en/latest/reference-dsl-output-statements/#emit1-and-emitemitpemitf

This is ultimately because when I was first creating Miller -- & emit was there from the start, before local variables, or for-loops, or any of these relatively more powerful syntaxes -- I packed a lot of syntax (too much) into emit. And I did it as a keyword, not as a function.

What most parsers (Miller's included) do is have a "lookahead of one symbol" -- LR1 being the jargon. So after the emit statement, the 'what comes next' and the 'one ahead of that' need to be unambiguous.

Since emit is keyword, with no parentheses, and I added the ability to emit multiple oosvars, and possibly indexed, etc., there are too many possibilities for the parser to handle with regard to parentheses, commas, etc.

In Miller 5 there were "LR1 reduce-reduce conflicts" and I understood less then, and I somehow got the Lemon parser to handle them by doing a rule like "accept conflicts by using first-found rule" which was a huge hack.

In Miller 6, being a little less clueless about parsing, I allowed no shift-reduce or reduce-reduce conflicts in the grammar.

See also https://miller.readthedocs.io/en/latest/new-in-miller-6/#emit-statements

The result is what we have at https://miller.readthedocs.io/en/latest/reference-dsl-output-statements/#emit1-and-emitemitpemitf -- namely that you can use emit1 to put the grammatical complexity in the emittable & the keys, or, use a temp variable (a syntactically simpler emittable) with emit.

Beyond this temp-variable workaround, question is what to do now to get syntactical support to make all the richness of emittables, keys, and redirector all in one expression, in a way that's LR1-parseable.

Given the fact that the emit syntax was, in hindsight, very poorly thought out, really the best I can do moving forward is make (yikes!) yet another pair of emit variants --emitv2 and emitpv2 -- which would have the syntactic structure that emit and emitp should have (in hindsight) had all along. Namely:

emit([@var1, @var2], ["key1", "key2"]);
emit([@var1, @var2], ["key1", "key2"]) > "/tmp/data-".$a;
johnkerl commented 2 years ago

@gromgit also I'll update the docs to use the temp-var workaround

gromgit commented 2 years ago

Thanks much for the explanation and workaround, @johnkerl!