optimagic-dev / optimagic

optimagic is a Python package for numerical optimization. It is a unified interface to optimizers from SciPy, NlOpt and other packages. optimagic's minimize function works just like SciPy's, so you don't have to adjust your code. You simply get more optimizers for free. On top you get diagnostic tools, parallel numerical derivatives and more.
https://optimagic.readthedocs.io/
MIT License
269 stars 30 forks source link

Improve runtime measures for criterion plot and benchmarking plots #547

Open janosg opened 3 weeks ago

janosg commented 3 weeks ago

Current Situation / Problem you want to solve

The proposal in this issue concerns the functions criterion_plot, profile_plot and convergence_plot.

Each runtime measure serves a purpose:

n_evaluations and n_batches measure important aspects but also have a big drawback: They exclusively focus on objective functions and ignore all time that is spent on evaluating derivatives. This is not a problem as long as only derivative free or only derivative based optimizers are compared. But as soon as one compares a derivative free with a derivative based optimizer it becomes misleading.

Describe the solution you'd like

Step 1: Introduce a new runtime measures:

All relevant functions will get a runtime_measure argument which can be:

We also keep the legacy measures "n_evaluations" and "n_batches".

Step 2: Introduce an optional cost model

While "function_time" and "batch_function" time allow to ignore optimizer overhead, they are not deterministic nor comparable across machines. In order to achieve this, we optionally allow a user to pass a CostModel as runtime_measure. Using a CostModel allows to reproduce all existing measures except for walltime. Moreover, it allows to get reproducible and hardware agnostic runtime measures for almost any situation.

A cost model looks as follows:

@dataclass(frozen=True):
class CostModel:
    fun: float | None = None
    jac: float | None = None
    fun_and_jac: float | None = None 

    label: str | None

    def aggregate_batch_times(times: list[float]) -> float:
        return sum(times)

The attributes fun, jac, and fun_and_jac allow a user to provide runtimes of the user provided functions. Those could be actual times in seconds or normalized values (e.g. 1 for fun). None means, that an actual measured runtime is used.

The attribute label is used as x-axis label in plots.

The method aggregate_batch_times takes a list of times (which might be measured runtimes or replaced times based on the other attributes) and returns a scalar value. The default implementation assumes that no parallelization is used.

To see the cost model in action, let's reproduce a few existing measures:

n_evaluations_cost_model = CostModel(fun=1, jac=0, fun_and_jac=0, label="evaluations of the objective function")
function_time_cost_model = CostModel(label="seconds")

@dataclass(frozen=True)
PerfectParallelizationCostModel:
    def aggregate_batches(times: list[float]) -> float:
        return max(times)

n_batches_cost_model = PerfectParallelizationCostModel(fun=1, jac=0, fun_and_jac=0, label="batch evaluations of the objective function")

The zero values for jac and fun_and_jac make the problems of n_evaluations and n_batches very apparent.

Potential variations

Questions

timmens commented 2 weeks ago

Very nice proposal! :tada:

This definitely fills a small but relevant gap. Some comments:

Regarding your question, I am unsure whether I understand it correctly. If, for example, I have a benchmark with two functions that have different runtimes of their derivative, I could use the "function_time" runtime measure, or not? And for a profile_plot we would have one function and different optimizers; I would've suspected that here again, "function_time" should work for a fair comparison?

janosg commented 2 weeks ago

Regarding your question, I am unsure whether I understand it correctly. If, for example, I have a benchmark with two functions that have different runtimes of their derivative, I could use the "function_time" runtime measure, or not? And for a profile_plot we would have one function and different optimizers; I would've suspected that here again, "function_time" should work for a fair comparison?

Yes, function time would be a fair comparison but it is hardware specific and not fully reproducible. In benchmarking you often want to get reproducible results and potentially even compare benchmark results generated on different computers. So we need the CostModel solution to work for benchmarks as well and unfortunately there could be cases where each problem has a different cost model.