johnkerl / miller

Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
https://miller.readthedocs.io
Other
8.91k stars 214 forks source link

miller evaluates all records even when not needed #1653

Open balki opened 2 weeks ago

balki commented 2 weeks ago

In the below example, only first 5 records are needed. But system in put has run for all the records as we can see in the tmp file.

❯ {rm /tmp/1; echo index; seq 10} | mlr --c2p  put '$v = system("echo hello; echo err >> /tmp/1")' then head -n 5; nl /tmp/1
index v
1     hello
2     hello
3     hello
4     hello
5     hello
     1  err
     2  err
     3  err
     4  err
     5  err
     6  err
     7  err
     8  err
     9  err
    10  err

When in head is moved ahead of put, it works fine.

❯ {rm /tmp/1; echo index; seq 10} | mlr --c2p head -n 5 then put '$v = system("echo hello; echo err >> /tmp/1")' ; nl /tmp/1 
index v
1     hello
2     hello
3     hello
4     hello
5     hello
     1  err
     2  err
     3  err
     4  err
     5  err

It appears that each verb is run on all records before moving to rest. Can miller be made lazy? I understand it will not be possible when stats/grouping is used. But for simple case I thought it wold work lazy.

johnkerl commented 2 weeks ago

There is indeed laziness and some early-out logic when head is in the verb list -- however there is some batching (default 500 rows at a time) which was necessary for performance in the port from C to Go ....

If we're getting readahead of over 500 records then that's a bug though ...

johnkerl commented 2 weeks ago

(In C it was record-at-a-time lazy ... in Go it's 500-records-at-a-time lazy ....)

johnkerl commented 2 weeks ago

OTOH this looks odd to me:

❯ {rm /tmp/1; echo index; seq 10} | mlr --c2p head -n 5 then put '$v = system("echo hello; echo err >> /tmp/1")' ; nl /tmp/1 

🤔 👀

balki commented 2 weeks ago

(In C it was record-at-a-time lazy ... in Go it's 500-records-at-a-time lazy ....)

Thanks for clarifying. Makes sense. I was running below in the logs and found it took a long time (11 seconds) when head was used after put but the other way was instant. I think I should just move filter and head as early as possible.

❯ mlr --l2p --tz America/Toronto put '$ts = sec2localtime($ts); $cn = system(format("geoiplookup {} | grep Country", $request.remote_ip))' then filter '$status == 200' then flatten t
hen cut -of ts,cn,request.remote_ip,request.uri then head caddy.log | wc -l 
11

~/tmp/millerexp took 11s
❯ mlr --l2p --tz America/Toronto filter '$status == 200' then head then put '$ts = sec2localtime($ts); $cn = system(format("geoiplookup {} | grep Country", $request.remote_ip))' then
 filter '$status == 200' then flatten then cut -of ts,cn,request.remote_ip,request.uri caddy.log | wc -l                                                                                      
11
johnkerl commented 1 week ago

it took a long time (11 seconds) when head was used after put but the other way was instant

@balki this needs fixing for sure.