Closed stuartlynn closed 9 months ago
A while ago, I worked on fixing something similar for the trees see https://github.com/scikit-learn/scikit-learn/pull/21552 for context.
I am pretty sure at the time I realised that other estimators were problematic but I left them for later.
From my notes: the common approach is to try to convert attributes at unpickling time in
__setstate__
, so that cython functions which are more picky with types can correctly be called
In the mean-time, your work-around seems completely fine. I would recommend using .astype(PREDICTOR_RECORD_DTYPE_2, kind='same_kind')
(default is casting='unsafe'
) to fail early if dtypes (the pickle dtype, and the expected target dtype) are not compatible.
And needless to say a PR making it work for HistGradientBoosting
would be more than welcome!
Thanks! That's super useful. Will try and use this as a guide to put together a PR.
Sounds good!
Just curious, can you tell a bit more about your use case? Maybe you want to show the prediction of a HistGradientBoosting
for pedagogical reasons inside Pyodide but it is too expensive to train inside Pyodide?
For completeness, I have been involved in making scikit-learn work better in Pyodide and I am curious what people use it for :wink:
For example:
Hey sorry for the delay in replying. The project we are working on is this one : https://urban-analytics-technology-platform.github.io/demoland-web/
The goal is to let policy makers change land use details in a city and see how that effects several key indicator variables (air pollution / house prices etc). We developed the model and train it on UK wide data but at inference time we only need to apply it to smaller areas. So we train outside pyodide and are using pyodide to get the predictions in the browser where they can be visualized.
The core modeling package also has to be available in regular old python so we kind of need a solution that works for both. My first attempt at this was using pyodide to train a model and then store it and use that, but then we end up with two pickle files, one for pyodide and one for regular python which is just a little harder to manage. We also envision training larger models in future and would rather to do that outside pyodide.
I was actually surprised how well scikit worked in pyodide, it was just this one little hiccup but everything else was pretty smooth
OK, super interesting, thanks for the info!
I was actually surprised how well scikit worked in pyodide, it was just this one little hiccup but everything else was pretty smooth
Glad to hear that, if you ever bump into other issues, don't hesitate to report them!
Describe the bug
HistGradinetBoosting models use
np.intp
to represent thefeature_idx
in TreePredictor nodeshttps://github.com/scikit-learn/scikit-learn/blob/0f8a7775ad248b9aa4be63291ae71d9212a46e6c/sklearn/ensemble/_hist_gradient_boosting/common.pyx#L19-L36
This seems to cause issues with using pickled HistGradientBoosting models which are trained on a 64 bit environment, in 32 bit environments ( like Pyodide which is where I encountered this issue).
I know that for a while the other Tree models in sklearn had a similar problem but I am not 100% what the solution was.
Would changing the type to be
np.uint32
be an acceptable solution here?Steps/Code to Reproduce
Steps to reproduce
see this repo for a full example: https://github.com/stuartlynn/hist_gradient_boost_bug
Expected Results
The pyodide code to run and give the expected output
Actual Results
Error message
Running the above gives the following error message when trying to execute the Pyodide code
Things I have already checked
Hacky fix
So what I found to work is the following. In pyodide, after loading the model if we manually change the types of the nodes for the predictors, then the model runs fine. There is an example of this in the example repo
Versions