Add option to drop disk caches before running a benchmark

Closes #126. Implementation is based on that from the conbench runner, meaning it will attempt to drop disk caches, but if it fails it will a. not error and b. not try again (in Python, by setting an attribute, which I translated to R as an option()).

Resetting is in run_one, so if drop_caches is set to TRUE, they will be cleared once per case (not per benchmark-code or iteration). This is different from the conbench runner, which drops caches in each iteration. I can change this here if we like (though it will look more like how we handle profiling and less like other global options), but from previous discussion this seems like a thing we haven't quite figured out the best way to handle yet, and per-case is very much on the table so we could ignore warmup iterations. If we want to enable both and move beyond boolean, we could do that too. Opinions welcome!

The drop_caches global option is inserted in result.optional_benchmark_info.options, but not in result.tags, so this will not break any histories. Arguably it should, but presently cpu_count is the only global option that is (although the way we're running right now it's always null anyway); lib_path and mem_alloc are not. Because we're not actually dropping caches anywhere (Python, R), it doesn't seem like it's worth breaking this now. If we want to set any of these global options differently (or run more than one value), we should make a separate story to refactor. Or maybe just make them not-global parameters of the particular benchmarks where they matter; options and global_options are only runner things, not Conbench things.

Resetting is in run_one, so if drop_caches is set to TRUE, they will be cleared once per case (not per benchmark-code or iteration). This is different from the conbench runner, which drops caches in each iteration. I can change this here if we like (though it will look more like how we handle profiling and less like other global options), but from previous discussion this seems like a thing we haven't quite figured out the best way to handle yet, and per-case is very much on the table so we could ignore warmup iterations. If we want to enable both and move beyond boolean, we could do that too. Opinions welcome!

I think we should implement this at both the case and the iteration level (though this can be exclusive to one layer or the other or off totally — I can't foresee a circumstance we'll need to clear the cache both at the case level and at the iteration level!). For today we want to mimic what we are already running (not just for history, but also there is benefit in matching what Python is doing; and it will defer having to figure out the answer to warmup run differences). In the (hopefully nearterm!) future, I would love to toggle this (in both R and Python) to be at the case level experiment with a few runs + investigate what would be needed to support running these like this generally. I suspect this will include some expansion of what we're measuring (mean, min, max, median, cf #640) — that's great, we should do that at some point, but having the option to run a real experiment with the cache clearing only at the case level, analyze the data, make test cases for changes or expansion to our metrics would be really fantastic (as opposed to needing to implement all of them right now as quickly as possible to unblock this).

Because we're not actually dropping caches anywhere (Python, R), it doesn't seem like it's worth breaking this now. If we want to set any of these global options differently (or run more than one value), we should make a separate story to refactor. Or maybe just make them not-global parameters of the particular benchmarks where they matter; options and global_options are only runner things, not Conbench things.

I was under the impression that we did drop caches for both Python and R. Looking at labs/benchmarks and the json there: https://github.com/voltrondata-labs/benchmarks/blob/5ea34d76951be9a323683344c5233310eb867908/benchmarks.json#L9

Which should trigger for Python:https://github.com/conbench/conbench/blob/175bc404b2f39f1518efef8e33a20848b4c4bac5/conbench/runner.py#L350 and for R: https://github.com/voltrondata-labs/benchmarks/blob/5ea34d76951be9a323683344c5233310eb867908/benchmarks/_benchmark.py#L253

That this is confusing (and spans as much of our stack as it does!) is exactly the kind of cleanup that I'm looking forward to with our use-the-arrowbench-runner-to-run-arrowbench project that this is part of 😄!

Because we're not actually dropping caches anywhere (Python, R), it doesn't seem like it's worth breaking this now. If we want to set any of these global options differently (or run more than one value), we should make a separate story to refactor. Or maybe just make them not-global parameters of the particular benchmarks where they matter; options and global_options are only runner things, not Conbench things.

I was under the impression that we did drop caches for both Python and R. Looking at labs/benchmarks and the json there: https://github.com/voltrondata-labs/benchmarks/blob/5ea34d76951be9a323683344c5233310eb867908/benchmarks.json#L9

Which should trigger for Python:https://github.com/conbench/conbench/blob/175bc404b2f39f1518efef8e33a20848b4c4bac5/conbench/runner.py#L350 and for R: https://github.com/voltrondata-labs/benchmarks/blob/5ea34d76951be9a323683344c5233310eb867908/benchmarks/_benchmark.py#L253

Oh interesting. And that makes me realize that even though we're using run_one() right now, we're really making a separate call for each iteration—otherwise the clearing would be at the case level instead of the iteration level.

I think we should implement this at both the case and the iteration level (though this can be exclusive to one layer or the other or off totally — I can't foresee a circumstance we'll need to clear the cache both at the case level and at the iteration level!). For today we want to mimic what we are already running (not just for history, but also there is benefit in matching what Python is doing; and it will defer having to figure out the answer to warmup run differences). In the (hopefully nearterm!) future, I would love to toggle this (in both R and Python) to be at the case level experiment with a few runs + investigate what would be needed to support running these like this generally. I suspect this will include some expansion of what we're measuring (mean, min, max, median, cf #640) — that's great, we should do that at some point, but having the option to run a real experiment with the cache clearing only at the case level, analyze the data, make test cases for changes or expansion to our metrics would be really fantastic (as opposed to needing to implement all of them right now as quickly as possible to unblock this).

I think for now I can switch it from logical to categorical to cover all three cases (no clearing, at the case level, at the iteration level) reasonably easily. Longer-term, I'm increasingly convinced cache clearing should be part of the benchmark, i.e. if you want it, call (parameterized, if you like) sync_and_drop_caches() directly in setup() (for case-level) or before_each() (for iteration-level) to use benchrun structure (arrowbench for now has setup() but not before_each()).

I think for now I can switch it from logical to categorical to cover all three cases (no clearing, at the case level, at the iteration level) reasonably easily. Longer-term, I'm increasingly convinced cache clearing should be part of the benchmark, i.e. if you want it, call (parameterized, if you like) sync_and_drop_caches() directly in setup() (for case-level) or before_each() (for iteration-level) to use benchrun structure (arrowbench for now has setup() but not before_each()).

Yeah, that sounds right. In an ideal world we would supply a micro-binary that just does this cleaning that has no install drama and just works (so I think it simply cannot be Python, unfortunately). But for now copy | pasting these shell commands isn't so bad.

Parameterizing this will allow us to do the testing we need to gain confidence in what the right setting is (for our benchmarks). My heart + gut says it should be dropped only per case, and that we should do good + smart things about inter-iteration differences that might cause.

I think for now I can switch it from logical to categorical to cover all three cases (no clearing, at the case level, at the iteration level) reasonably easily. Longer-term, I'm increasingly convinced cache clearing should be part of the benchmark, i.e. if you want it, call (parameterized, if you like) sync_and_drop_caches() directly in setup() (for case-level) or before_each() (for iteration-level) to use benchrun structure (arrowbench for now has setup() but not before_each()).

Yeah, that sounds right. In an ideal world we would supply a micro-binary that just does this cleaning that has no install drama and just works (so I think it simply cannot be Python, unfortunately). But for now copy | pasting these shell commands isn't so bad.

To be clear, I'm advocating still keeping a sync_and_drop_caches() util that benchmarks can call. If that calls a microbinary instead of reimplementing the same thing, that's cool. But I don't want each benchmark copypasta-ing the body of that function.

To be clear, I'm advocating still keeping a sync_and_drop_caches() util that benchmarks can call. If that calls a microbinary instead of reimplementing the same thing, that's cool.

nods yeah if that's the case let's make something that doesn't have the drama needed to get that working (anything we've tried so far with Python ends up having installation scripts | functions that are longer than the body of the copy-pastaed version!). I'm absolutely no fan of copy-pasta; but we can't keep bashing our heads on install drama that blocks us from getting done what we need to. But also, once we have this here we'll be fine for a little while with python and R sharing the same thing for a bit. We can defer building the cache dropper in go or rust or something that gives us better isolation + easy distributability for a bit (and possibly forever, depending on what we're benchmarking if it needs that or if that would even be helpful at all)

Ok, refactored so drop_caches can be set to NULL (the default) meaning do nothing, "case", or "iteration". This is different than the same parameter in labs/benchmarks and the conbench runner (where it's boolean and true means iteration), but it can reproduce the functionality and the differences are additive. Left the default as NULL both because it's like the other two and because it sounds more in line with our longer-term plans. We'll need to set drop_caches = "iteration" in https://github.com/voltrondata-labs/arrow-benchmarks-ci/pull/99 to reproduce the current behavior.

Alright, ensured global options like drop_caches get passed through from run.BenchmarkDataFrame() and populate the correct places on the results. This stuff is quite a mess, but it's not clear how to make it less so without rebuilding a lot and/or breaking things, so for now I've just tried to keep it all working and not increase the complication.

Also discovered and patched a related bug breaking run.BenchmarkDataFrame() when specifying n_iter—n_iter can be double-specified in run_benchmark() (once in params and once in n_iter)—so now a value in params will take precedence. (In theory you could run different numbers of iterations for different cases, which is nice. In practice, it unbreaks run.BenchmarkDataFrame() without changing get_default_parameters() significantly, which would have broader implications.)

voltrondata-labs / arrowbench

Add option to drop disk caches before running a benchmark #134