quixdb / squash

Compression abstraction library and utilities
https://quixdb.github.io/squash/
MIT License
405 stars 53 forks source link

Pareto frontier? #230

Open bknowles opened 7 years ago

bknowles commented 7 years ago

I learned about Squash from the page at http://jdlm.info/articles/2017/05/01/compression-pareto-docker-gnuplot.html and one of the key additional things I think they bring to the table is the reduction of the amount of data to be charted and graphed by eliminating all points from the chart where another data point is both faster and smaller compression size.

As your corpus grows, and as you grow the number of compression algorithms you test against, I think it's going to become more and more important to try to reduce the amount of data that you try to present in your graphs. So, I think this principle will be very useful to you.

nemequ commented 7 years ago

Did you see the Optimal Codecs chart, which is pretty much what you describe?

As for the other charts, I think it's going to be a long time before this becomes a problem; see Rich Geldrich's amazing charts for an example with a lot of data points which are still quite useful.

Execution time of the benchmark is a limiting factor here; it already takes around 24 hours to run on a fast machine, and over a week on some of the slower machines. I want to keep adding more codecs to squash, but I plan to reduce the number of pieces of data in the main benchmark (by replacing it with the Squash Compression Corpus when it's ready). Additional data should go in its own benchmark.

bknowles commented 7 years ago

I had seen the Optimal Codes chart, but while each data point is identified as to what compression algorithm was used and what compression/decompression speed was achieved, what gets lost in that chart is how the selected few "best" algorithms compare at each data point -- all codecs get munged together on the blue line, the green line, or the black line.

I know you have to slice the data in multiple different ways, and this chart is just one way to look at it. But I found the introduction of the Pareto Frontier to be a particularly interesting concept that helped reduce the complexity of the charts at http://jdlm.info/articles/2017/05/01/compression-pareto-docker-gnuplot.html and I was thinking that it might also help in your case -- you're comparing considerably more compression algorithms and compression settings.

Anyway, it was just a thought. If it's not helpful to you, then please feel free to close this issue without further comment.

Thanks again!

nemequ commented 7 years ago

You're right about everything getting munged together; it would be very nice to provide a visual indication of which codec is represented by each point without having to hover over the point in question. I'm not sure how to do that with high charts, though…

I don't see a lot of benefit in only showing Pareto frontier results in the other charts, though, since you can get the same information by just looking for results with no results higher on both axes. I'm hardly a data visualization expert, though.

One thing that might be interesting is a way to only show results on the Pareto frontier for other variables; for example, a version of the "compression ratio vs. compression speed" chart which only showed items which are on the Pareto frontier of "compression ratio vs. decompression speed".