h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
321 stars 85 forks source link

awk #160

Open JohannesBuchner opened 3 years ago

JohannesBuchner commented 3 years ago

awk is a small DSL which can parse texts relatively quickly. It is installed by default on many unix-based systems, requires little code, and is easy to integrate in shell script pipelines.

I placed some solutions for the groupby questions here: https://gist.github.com/JohannesBuchner/442e09b7c77c7150a4885c715eb17e6b Some of them may be correct.

mawk used to be faster than gawk, not sure this is still significant.

The median-related question has sorting in the solution, which can be parallelized. Not sure if there is a more elegant solution.

JohannesBuchner commented 3 years ago

This should work OK for very large datasets, in particular those much larger than RAM.

jangorecki commented 3 years ago

Thank you, will try it out. AFAIU it prints result to stdout. What is the best way to print it to a in-memory variable? piping into file on a ram-disk? In the last question, there should be also count by group, not just sum.

JohannesBuchner commented 3 years ago

Not sure I understand, stdout is in RAM. If you want to store it in a python program, perhaps subprocess.check_output is easiest.

JohannesBuchner commented 3 years ago

Updated the last command to include count.

JohannesBuchner commented 3 years ago

For very large responses, perhaps reading with a pipe (also possible with subprocess) is useful, to avoid using much memory.

jangorecki commented 3 years ago

The problem is that printing out to console will add an overhead, thus piping output into file should be preferred to reduce the overhead.

jangorecki commented 3 years ago

Also each single command read data from disk, this is another overhead that should be reduced. Ideally to read data once and then run all commands in sequence producing output files of each query.

JohannesBuchner commented 3 years ago

OK, if you want to remove the io time, ramdisks are probably a good solution.

JohannesBuchner commented 3 years ago

I am not sure whether you want to look at the output or not. If not, then you can pipe it to /dev/null, which will avoid the console printing overhead.

jangorecki commented 3 years ago

Any idea if this is the most recent version? https://github.com/ploxiln/mawk-2

JohannesBuchner commented 3 years ago

I simply installed the ubuntu package, which is mawk 1.3.3.