scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
60.06k stars 25.39k forks source link

HistGradientBoosting pickle portability between 64bit and 32bit arch #27952

Closed stuartlynn closed 9 months ago

stuartlynn commented 11 months ago

Describe the bug

HistGradinetBoosting models use np.intp to represent the feature_idx in TreePredictor nodes

https://github.com/scikit-learn/scikit-learn/blob/0f8a7775ad248b9aa4be63291ae71d9212a46e6c/sklearn/ensemble/_hist_gradient_boosting/common.pyx#L19-L36

This seems to cause issues with using pickled HistGradientBoosting models which are trained on a 64 bit environment, in 32 bit environments ( like Pyodide which is where I encountered this issue).

I know that for a while the other Tree models in sklearn had a similar problem but I am not 100% what the solution was.

Would changing the type to be np.uint32 be an acceptable solution here?

Steps/Code to Reproduce

Steps to reproduce

  1. Train a model in python on a 64 bit system
  2. Pickle the output
  3. Load that pickle on a 32 bit python environment like Pyodide
  4. Attempt to run the prediction on the loaded model

see this repo for a full example: https://github.com/stuartlynn/hist_gradient_boost_bug

Expected Results

The pyodide code to run and give the expected output

Actual Results

Error message

Running the above gives the following error message when trying to execute the Pyodide code

PythonError: Traceback (most recent call last):
  File "/lib/python311.zip/_pyodide/_base.py", line 571, in eval_code_async
    await CodeRunner(
  File "/lib/python311.zip/_pyodide/_base.py", line 394, in run_async
    coroutine = eval(self.code, globals, locals)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<exec>", line 61, in <module>
  File "/lib/python3.11/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", l
    return self._loss.link.inverse(self._raw_predict(X).ravel())
                                   ^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", l
    self._predict_iterations(
  File "/lib/python3.11/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", l
    raw_predictions[:, k] += predict(X)
                             ^^^^^^^^^^
  File "/lib/python3.11/site-packages/sklearn/ensemble/_hist_gradient_boosting/predictor.py", line 71,
    _predict_from_raw_data(
  File "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx", line 18, in sklearn.ensemble._hist_gr
ValueError: Buffer dtype mismatch, expected 'intp_t' but got 'long long' in 'const node_struct.feature_

    at new_error (/Users/slynn/tmp/demoland_onnx_test/runner/node_modules/.pnpm/pyodide@0.24.1/node_mod
    at wasm://wasm/02250ad6:wasm-function[295]:0x158827
    at wasm://wasm/02250ad6:wasm-function[452]:0x15fcd5
    at _PyCFunctionWithKeywords_TrampolineCall (/Users/slynn/tmp/demoland_onnx_test/runner/node_modules
    at wasm://wasm/02250ad6:wasm-function[1057]:0x1a3091
    at wasm://wasm/02250ad6:wasm-function[3387]:0x289e4d
    at wasm://wasm/02250ad6:wasm-function[2037]:0x1e3f77
    at wasm://wasm/02250ad6:wasm-function[1064]:0x1a3579
    at wasm://wasm/02250ad6:wasm-function[1067]:0x1a383a
    at wasm://wasm/02250ad6:wasm-function[1068]:0x1a38dc
    at wasm://wasm/02250ad6:wasm-function[3200]:0x2685c5
    at wasm://wasm/02250ad6:wasm-function[3201]:0x26e3d0
    at wasm://wasm/02250ad6:wasm-function[1070]:0x1a3a04
    at wasm://wasm/02250ad6:wasm-function[1065]:0x1a3694
    at wasm://wasm/02250ad6:wasm-function[440]:0x15f45e
    at Module.callPyObjectKwargs (/Users/slynn/tmp/demoland_onnx_test/runner/node_modules/.pnpm/pyodide@0.24.1/node_modules/pyodide/pyodide.asm.js:9:81732)
    at Module.callPyObject (/Users/slynn/tmp/demoland_onnx_test/runner/node_modules/.pnpm/pyodide@0.24.1/node_modules/pyodide/pyodide.asm.js:9:82066)
    at Timeout.wrapper [as _onTimeout] (/Users/slynn/tmp/demoland_onnx_test/runner/node_modules/.pnpm/pyodide@0.24.1/node_modules/pyodide/pyodide.asm.js:9:58562)
    at listOnTimeout (node:internal/timers:569:17)
    at process.processTimers (node:internal/timers:512:7) {
  type: 'ValueError',
  __error_address: 116329376
}

Things I have already checked

Hacky fix

So what I found to work is the following. In pyodide, after loading the model if we manually change the types of the nodes for the predictors, then the model runs fine. There is an example of this in the example repo

Y_DTYPE = np.float64
X_DTYPE = np.float64
X_BINNED_DTYPE = np.uint8  # hence max_bins == 256
# dtype for gradients and hessians arrays
G_H_DTYPE = np.float32
X_BITSET_INNER_DTYPE = np.uint32

PREDICTOR_RECORD_DTYPE_2 = np.dtype([
    ('value', Y_DTYPE),
    ('count', np.uint32),
    ('feature_idx', np.int32),
    ('num_threshold', X_DTYPE),
    ('missing_go_to_left', np.uint8),
    ('left', np.uint32),
    ('right', np.uint32),
    ('gain', Y_DTYPE),
    ('depth', np.uint32),
    ('is_leaf', np.uint8),
    ('bin_threshold', X_BINNED_DTYPE),
    ('is_categorical', np.uint8),
    # The index of the corresponding bitsets in the Predictor's bitset arrays.
    # Only used if is_categorical is True
    ('bitset_idx', np.uint32)
])

model  = joblib.load("/model.joblib")

for i,_ in enumerate(model._predictors):
    model._predictors[i][0].nodes = model._predictors[i][0].nodes.astype(PREDICTOR_RECORD_DTYPE_2)

model.predict(data)

Versions

python version 3.11.3 (main, May 15 2023, 10:43:03) [Clang 14.0.6 ]
sklearn version 1.3.1

System:
    python: 3.11.3 (main, May 15 2023, 10:43:03) [Clang 14.0.6 ]
executable: /Users/slynn/miniconda3/envs/demoland/bin/python
   machine: macOS-10.16-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.3.1
          pip: 23.3
   setuptools: 68.0.0
        numpy: 1.25.2
        scipy: 1.11.3
       Cython: None
       pandas: 1.5.3
   matplotlib: None
       joblib: 1.3.2
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 10
         prefix: libomp
       filepath: /Users/slynn/miniconda3/envs/demoland/lib/python3.11/site-packages/sklearn/.dylibs/libomp.dylib
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 10
         prefix: libopenblas
       filepath: /Users/slynn/miniconda3/envs/demoland/lib/python3.11/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: Nehalem

       user_api: blas
   internal_api: openblas
    num_threads: 10
         prefix: libopenblas
       filepath: /Users/slynn/miniconda3/envs/demoland/lib/python3.11/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: Nehalem
lesteve commented 11 months ago

A while ago, I worked on fixing something similar for the trees see https://github.com/scikit-learn/scikit-learn/pull/21552 for context.

I am pretty sure at the time I realised that other estimators were problematic but I left them for later.

From my notes: the common approach is to try to convert attributes at unpickling time in __setstate__, so that cython functions which are more picky with types can correctly be called

In the mean-time, your work-around seems completely fine. I would recommend using .astype(PREDICTOR_RECORD_DTYPE_2, kind='same_kind') (default is casting='unsafe') to fail early if dtypes (the pickle dtype, and the expected target dtype) are not compatible.

And needless to say a PR making it work for HistGradientBoosting would be more than welcome!

stuartlynn commented 11 months ago

Thanks! That's super useful. Will try and use this as a guide to put together a PR.

lesteve commented 11 months ago

Sounds good!

Just curious, can you tell a bit more about your use case? Maybe you want to show the prediction of a HistGradientBoosting for pedagogical reasons inside Pyodide but it is too expensive to train inside Pyodide?

For completeness, I have been involved in making scikit-learn work better in Pyodide and I am curious what people use it for :wink:

For example:

stuartlynn commented 10 months ago

Hey sorry for the delay in replying. The project we are working on is this one : https://urban-analytics-technology-platform.github.io/demoland-web/

The goal is to let policy makers change land use details in a city and see how that effects several key indicator variables (air pollution / house prices etc). We developed the model and train it on UK wide data but at inference time we only need to apply it to smaller areas. So we train outside pyodide and are using pyodide to get the predictions in the browser where they can be visualized.

The core modeling package also has to be available in regular old python so we kind of need a solution that works for both. My first attempt at this was using pyodide to train a model and then store it and use that, but then we end up with two pickle files, one for pyodide and one for regular python which is just a little harder to manage. We also envision training larger models in future and would rather to do that outside pyodide.

I was actually surprised how well scikit worked in pyodide, it was just this one little hiccup but everything else was pretty smooth

lesteve commented 10 months ago

OK, super interesting, thanks for the info!

I was actually surprised how well scikit worked in pyodide, it was just this one little hiccup but everything else was pretty smooth

Glad to hear that, if you ever bump into other issues, don't hesitate to report them!