voltrondata-labs / benchmarks

Language-independent Continuous Benchmarking (CB) for Apache Arrow
MIT License
10 stars 11 forks source link

Fix cpu_count handling for R benchmarks #130

Open alistaire47 opened 1 year ago

alistaire47 commented 1 year ago

Currently for R benchmarks, this repo passescpu_count = NULL to run_one() (code), which then does not set the number of CPUs or threads anywhere (it omits that part of the script it creates). When run through higher-level arrowbench interfaces, cpu_count = NULL gets translated by get_default_parameters() to c(1L, parallel::detectCores()), which would create two cases for run_one(), which would be a problem.

In practice, not calling arrow:::SetCpuThreadPoolCapacity() means we're running with the default, which is the number of cores on the machine (pyarrow.cpu_count()). We should move to specifying this and recording it in tags. Right now the cpu_count key is in tags, but the value is empty. Changing this will break histories, but we should be able to adjust old records based on machine_info.cpu_core_count or machine_info.cpu_thread_count (I'm not exactly sure which we want, but they may not differ for any of the machines we're running on anyway).

Because of the shift to running arrowbench directly from arrow-benchmarks-ci, it may be more pragmatic to break things as we switch over and then do the cleanup, but I'm opening this issue here because the problem is presently here, even if the fix ends up being some tweaks in arrowbench defaults and some database cleanup.