openml / openml-python

OpenML's Python API for a World of Data and More 💫
http://openml.github.io/openml-python/
Other
280 stars 145 forks source link

Why is computation time not reported if n_jobs != 1 or != None? #895

Closed NicolasHug closed 3 years ago

NicolasHug commented 4 years ago

I'm running a big benchmark suite with RandomizedSearchCV(n_jobs=-1).

Unfortunately, computation time is reported only if n_jobs is None or 1.

I don't understand the reason https://github.com/openml/openml-python/issues/229. Why isn't the interpretation left out to the user?


As a side note: n_jobs=None can be overridden with a context manager:

from joblib import parallel_backend
with parallel_backend('loky', n_jobs=-1):
    RandomizedSearchCV(n_jobs=None)

This is equivalent to just calling RandomizedSearchCV(n_jobs=-1).

With the latter, openml won't report computation time, but as far as I understand, the former will run just fine and report the computation time. So it seems that the check isn't properly enforced anyway.

CC @amueller

amueller commented 4 years ago

ping @janvanrijn @mfeurer ;)

amueller commented 4 years ago

Should we add a wallclock_time_millis_training additionally maybe which can always be computed?

amueller commented 4 years ago

The reason it is wrong for n_jobs != 1 is that internally it uses process_time which will not count any of the subprocess time, and it's not using wall-clock time.

mfeurer commented 4 years ago

ping @janvanrijn @mfeurer ;)

I'll come back to you after the ICML deadline.

mfeurer commented 4 years ago

Thanks for raising this issue, it seems that there are indeed one or two problems here.

I believe the reason why the wallclock time is not reported if the number of cores is -1 is because we can't figure out on how many cores it was executed and the number then only makes limited sense. Currently, this is a very restrictive assumption that can be circumvented in plenty of ways (as you showed). Do you have any suggestions on how to improve on this?

Should we add a wallclock_time_millis_training additionally maybe which can always be computed?

That exists and is computed if n_jobs != -1.

In order to get the times of each base run you can check optimization trace which should have the time for each model fit. However, we currently don't seem to store the refit time correctly (or at all?), which to me currently seems like the biggest bug here.

NicolasHug commented 4 years ago

Do you have any suggestions on how to improve on this?

I think you can use effective_n_jobs from joblib: https://github.com/joblib/joblib/blob/master/joblib/parallel.py#L366

mfeurer commented 4 years ago

Yet another issue we have to think about is the recent use of OpenMP in scikit-learn which might make it harder for us to get a useful estimate of the used time.

mfeurer commented 3 years ago

Sorry that this has stalled for so long, but now it's finally time to pick this up and finish it!

I think we basically have the following cases here which we need to consider:

  1. estimators that don't involve any parallelism, for example simple decision trees
  2. estimators that do parallelization inside themselves via BLAS or OpenMP, for example SGD or HistGradientBoosting
  3. estimators that do parallelization via joblib, for example RandomForest
  4. HPO algorithms that call an underlying algorithm multiple times via joblib

and IIRC we can measure the following things:

  1. CPU time for the whole run
  2. Wallclock time for the whole run
  3. CPU time for each individual run in HPO
  4. Wallclock time for each individual run in HPO

That means we can do the following things for cases 1-4:

  1. Easy, we can measure both CPU time and wallclock time
  2. Tricky. We can measure both CPU time and wallclock time, but for wallclock time we won't know how many CPUs were involved. In case OpenML is started on several processes or machines (no idea if this is realistic) we also don't get reliable estimates of the process usage any more.
  3. Hard. We can easily measure CPU and wallclock time as long as n_jobs=1. We can still measure wallclock time as long as n_jobs>=1 as we'd know how many cores are used. In case of n_jobs==-1 we won't know how many cores are being used, but we could use effective_n_jobs to get an estimate. CPU time is never measurable as we don't have access to the CPU time of the individual jobs.
  4. Easier again. As each individual job measures the time itself we can in the end gather all individual times and add them up to obtain the total time taken.

As @NicolasHug pointed out, one can override the behavior via a context manager. Another caveat is that when using a server-worker system such as dask one does not necessarily get all available CPUs or the jobs might just be in the queue, making the wallclock time of the overall run completely useless.

Therefore, I propose to do the following:

  1. Document what we're doing
  2. Implement not being cheated by the context manager
  3. Implement storing the refit time for HPO as asked for in #248
  4. Figure out what to do with dask - can we somehow store which backend was used?

What do you think about this @NicolasHug @amueller @PGijsbers

PGijsbers commented 3 years ago

I'd be careful not to spend too much time on this, as it will become a very complicated/impossible project on its own (we're going to have to account for different parallelization strategies/packages, but would also need to start capturing hardware information etc.). However making the proposed changes, and then clearly documenting under which conditions what is measured, and how to interpret this data, still seems like a worthwhile change to me.

mfeurer commented 3 years ago

We followed the suggestion of @NicolasHug to just log the CPU and wallclock time and give the user the possibility and duty to interpret those. To simplify matters we added a lengthy example.