Support for multi-output `KNeighborsRegressor` models

jpmml / jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML

GNU Affero General Public License v3.0

531 stars 117 forks source link

Support for multi-output `KNeighborsRegressor` models #172

Closed PFloyd0 closed 2 years ago

PFloyd0 commented 2 years ago

Sorry There are some errors when I convert my model into pmml. It seems the data size has been changed but I do not know why.

Thank you!

vruusmann commented 2 years ago

The Java stack trace indicates that the K-neighbors regressor object inconsistency is detected on line 71 of KNeighborsUtil.java.

This does not match the latest state of the SkLearn2PMML/JPMML-SkLearn stack (ie. should be line 65): https://github.com/jpmml/jpmml-sklearn/blob/1.7.1/pmml-sklearn/src/main/java/sklearn/neighbors/KNeighborsUtil.java#L65

In other words, please upgrade your SkLearn2PMML package version to the latest (should be 0.78.1), and re-run your experiment.

PFloyd0 commented 2 years ago

Hello, I have already updated my sklearn2pmml package from 0.78.0 to 0.78.1 but this problem still occurs. Look forward to your reply. Thank you!

vruusmann commented 2 years ago

@PFloyd0 Take a look at the Java exception stack trace that you just posted - it still points to line 71.

It means that your SkLearn2PMML package update didn't work.

vruusmann commented 2 years ago

It means that your SkLearn2PMML package update didn't work.

@PFloyd0 My bad, I'm taking back the above comment.

The SkLearn2PMML package is currently based on JPMML-SkLearn 1.6.X codebase (not the recent 1.7.X codebase), so we use legacy line numbers: https://github.com/jpmml/jpmml-sklearn/blob/1.6.X/src/main/java/sklearn/neighbors/KNeighborsUtil.java#L71

vruusmann commented 2 years ago

In short, there is a mismatch between KNeighborsRegressor._fit_X shape (reports 60'000 instances) and KNeightborsRegressor._y attributes (reports 30'000 instances).

Can you print the values of these two attributes, and see if you see the same mismatch in Python environment.

Second, what are your offline_rss and offline_location data matrix types? They don't seem to be Pandas data matrix types (pandas.DataFrame and pandas.Series, respectively), because the PMMLPipeline.fit(X, y) method is unable to extract feature and target names.

Are they raw Numpy arrays?

TLDR: What gets printed to console?

print(knn_reg._fit_X.shape)
print(knn_reg._y.shape) # Alternatively, do `len(knn_reg._y)`

print(offline_rss.shape)
print(offline_rss.__class__)
print(offline_location.shape) # Alternatively, do `len(offline_location)`
print(offline_location.__class__)

PFloyd0 commented 2 years ago

Yes, they are raw numpy arrays. I load them from .mat file Should I convert them to dataframe before training?

vruusmann commented 2 years ago

print(offline_location.shape)
(30000, 2)

This is the error - the shape of the y variable is (30000, 2) (ie. 60k values), but it should be (30000, 1) (ie. 30k values).

Is this intentional - are you trying to fit a multi-output KNN regressor model?

The SkLearn2PMML/JPMML-SkLearn stack assumes that regressor models are for a single output column only. Therefore, it sees a Numpy array with 60k elements, and assumes that it's (60000, 1).

The converter should be checking the dimensionality of the embedded _y variable, and raise a targeted exception if the target is not a 1D array-like object.

PFloyd0 commented 2 years ago

Ok, thank you very much. The input are four intensity values from bluetooth device and output is location information, so I need two coordinates. I still try to think how to convert the output. Anyway, thank you again for your help.

vruusmann commented 2 years ago

The input are four intensity values from bluetooth device and output is location information, so I need two coordinates.

Your model is basically a giant look-up table?

You could use a helper object "location" (some unique integer). So, you'd first map from 4D to "location", and then from "location" to 2D.

The latter transformation could be implemented using the PMMLPipeline.predict_transformer attribute. You'd currently need two look-up tables there - one for the "location -> x" mapping, and another one to "location -> y" mapping.

vruusmann commented 2 years ago

In principle, the SkLearn2PMML/JPMML-Stack is internally pretty close to multi-output support. It's already well supported on the model evaluation side.

The biggest obstacle right now is that the org.jpmml.converter.Schema does not support the multi-label use case (there needs to be an org.jpmml.converter.MultiSchema subclass for that). I have it on a pretty elevated position in my internal TODO list. Quick GitHub search didn't reveal any public issues about it, though.

PFloyd0 commented 2 years ago

It works. Thank you very much!

vruusmann commented 2 years ago

It works.

What works? The suggested work-around using PMMLPipeline.predict_transformer?

Anyway, KNN models are prime examples of the multi-output use case.

This is already well supported on the JPMML-Evaluator side. What is missing are some lines of code into the JPMML-Converter and JPMML-SkLearn libraries.

Will work on this already this spring. Don't close the issue, otherwise I might forget!