sklearn_benchmarks

A comparative benchmarking library for scikit-learn's estimators

About

sklearn_benchmarks is a library to benchmark scikit-learn's estimators against concurrent implementations. It can be used with a Python API or through a command line as described below.

See benchmark results here.

Install

In order to setup the environment, you need to have conda installed. See instructions here.

To get a local copy up and running follow these simple example steps:

$ git clone -b master --single-branch https://github.com/mbatoul/sklearn_benchmarks
$ cd sklearn_benchmarks
$ conda env create --file bench_environment.yml
$ conda activate sklbench_dev
$ pip install .

Quickstart

Configuration

The benchmark script consumes the configuration file (config.yml by default) which contains the specification of the scikit-learn estimators, their hyperparameters and datasets.

- name: sklearn_KNeighborsClassifier_brute_force
  estimator: sklearn.neighbors.KNeighborsClassifier
  strategy: hp_match
  predict_with_onnx: True
  parameters:
    n_neighbors:
      - 1
      - 5
      - 100
    algorithm:
      - brute
    n_jobs:
      - -1
      - 1
  metrics:
    - accuracy_score
  datasets:
    - generator: sklearn.datasets.make_classification
      generator_parameters:
        n_classes: 2
        n_redundant: 0
      n_features: 100
      n_samples_train:
        - 100_000
      n_samples_test:
        - 1
        - 1000
    - generator: sklearn.datasets.make_classification
      generator_parameters:
        n_classes: 2
        n_redundant: 0
      n_features: 2
      n_samples_train:
        - 100_000
      n_samples_test:
        - 1
        - 1000

Each entry in the estimators list describes an estimator to benchmark.
- name: identifies the estimator, will be used as the name of the benchmark run result file
- estimator: path to the estimator
- predict_with_onnx: when set to True, predictions will be made with ONNX also (for estimators respecting the scikit-learn API only)
- parameters: parameters grid to explore in the benchmarks
- metrics: the metrics used to compute the scores (function of the sklearn.metrics module)
- datasets: list of datasets
The results of the benchmark runs are stored in the output directory (output_dir entry in the benchmarking section of the configuration file or --output-dir CLI option). Each run result folder is named with a timestamp corresponding to the date and time of the run (e.g. 20220313T171732).

Run benchmarks for all estimators with profiling

python -m sklearn_benchmarks.cli run --profile

Run benchmarks for a pair of estimators with SLURM

python -m sklearn_benchmarks.cli run --estimator sklearn_Ridge --estimator sklearnex_Ridge --slurm slurm_config.yml

Example of SLURM config file:

logs_dir: slurm_logs
parameters:
  timeout_min: 70
  slurm_partition: normal
  slurm_gpus_per_task: 0
  slurm_additional_parameters:
    hint: nomultithread
  slurm_exclusive: True
  slurm_cpus_per_task: 40

Run HPO benchmarks only

python -m sklearn_benchmarks.cli run --strategy hpo

Set a custom output directory

python -m sklearn_benchmarks.cli run --output-dir ~/dev/sklbench_results

Generate reports for one run

python -m sklearn_benchmarks.cli report results/local/20220303T132031

The run directory passed as an argument to the report command must be valid: it must contain a benchmarking folder with csv files as well as the files env_info.txt, time_most_recent_run.txt and versions.txt.

Generate reports for all local runs

python -m sklearn_benchmarks.cli report results/local

The results/local folder will be traversed to identify all the valid run directories it contains (see above). Reporting will be generated for each of them. Complex folder trees with multiple levels of nesting are supported.

Usage

`run`

Usage: python -m sklearn_benchmarks.cli run [OPTIONS]

  Run benchmarks for estimators specified in the configuration file.

Options:
  -r, --output-dir TEXT          Path to directory where benchmark results
                                 should be stored.
  -p, --profile                  Activate profiling of functions.
  -c, --config TEXT              Path to benchmarks configuration file.
  -e, --estimator TEXT           Select estimator(s) to benchmark from
                                 configuration file. By default, they will all
                                 be run.
  -s, --strategy [hp_match|hpo]  Select estimators by benchmarking strategies.
  -tb, --time-budget INTEGER     Custom time budget for HPO benchmarks in
                                 seconds.
  -sl, --slurm TEXT              Run benchmarks using SLURM with configuration
                                 from the given config file.
  -h, --help                     Show this message and exit.

`report`

Usage: python -m sklearn_benchmarks.cli report [OPTIONS] [DIR]...

  Generate HTML reports.

Options:
  -c, --config TEXT  Path to benchmarks configuration file.
  -h, --help         Show this message and exit.

mbatoul / sklearn_benchmarks

readme