onnx / sklearn-onnx

Convert scikit-learn models and pipelines to ONNX
Apache License 2.0
551 stars 102 forks source link

LocalOutlierFactor not implemented #736

Open penguinyaro opened 3 years ago

penguinyaro commented 3 years ago

Requesting support be added for sklearn.neighbors.LocalOutlierFactor (https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html#sklearn.neighbors.LocalOutlierFactor). It is another anomaly detection method.

xadupre commented 3 years ago

That does not seem too complicated. It is urgent?

penguinyaro commented 3 years ago

I'm working with a team to export our trained models (even though LOF doesn't get "trained" in the usual ML sense) to ONNX in hopes that we can pass them to a team using Java. We are still in the earlier stages of development. That is to say, if this could be done within a week, we would be ecstatic. If it takes a month, I think we would still work with it. If it took any longer, we would be looking for other solutions.

Really do appreciate your responsiveness @xadupre.

penguinyaro commented 3 years ago

Additionally, during implementation, would like to request that either negative_outlier_factor_ (https://github.com/scikit-learn/scikit-learn/blob/2beed55847ee70d363bdbfe14ee4401438fba057/sklearn/neighbors/_lof.py#L130) or score_samples() (https://github.com/scikit-learn/scikit-learn/blob/2beed55847ee70d363bdbfe14ee4401438fba057/sklearn/neighbors/_lof.py#L460) was accessible as an output.

xadupre commented 3 years ago

I exposed methods predict, decision_function. Option score_samples can be True to add a third result for method score_samples. I'll look into negative_outlierfactor once everything else is fixed. It would be great if you have time to test it before it gets merged.

xadupre commented 3 years ago

The first PR was merge. The converter is now available. About negative_outlierfactor, it is a constant and I prefer not to add an option to expose it as a result. However, I create a more generic function which lets you add any constant into the graph and retrieve it as an output. See example in PR #743.

xadupre commented 3 years ago

The converter released on PYPI. Closing the issue. Feel free to reopen.

penguinyaro commented 3 years ago

@xadupre thanks for implementing this. Sorry I wasn't able to check it sooner. But I was able to get it works here as follows:

import numpy as np
import pandas as pd

from sklearn.neighbors import LocalOutlierFactor

import onnxruntime as rt
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType

"""
Classifier setup
"""

# Construct an array to process through LOF
A = np.random.rand(25,3) # uniform on interval [0, 1)
A[-1] = A[-1]*-10 # this should make the last entry an outlier

# Define the classifier
classifier_lof = LocalOutlierFactor(contamination='auto', novelty=True)
# Use LOF to classfy the data
classifier_lof.fit(A) # if the novelty is true, then we need to use 
abm_mask = classifier_lof.predict(A) == -1 # LOF does training and classification in the same step
scores = -1 * classifier_lof.negative_outlier_factor_ # this gives us the scores for the predictions

# we use `abm_mask` and `scores` to report  findings
print('abnormal mask:')
print(abm_mask) # True means outlier
print('anomaly scores')
print(scores) # a larger number means more anomalous

"""
ONNX creation
"""
lof_types = [('float_input', FloatTensorType([None, A.shape[1]]))] # our input is the number of columns and they are all floats

options = {id(classifier_lof): {'score_samples': True}}
onx_lof = convert_sklearn(classifier_lof, initial_types=lof_types, options=options)

# onx = convert_sklearn(classifier_lof, initial_types=lof_smtp_types)
with open("lof.onnx", "wb") as f:
    f.write(onx_lof.SerializeToString())

"""
ONNX use
"""
sesh_lof = rt.InferenceSession("new_lof.onnx")

# this is the outputs we can get from the model
cols = [o.name for o in sesh_lof.get_outputs()] 
# label == predictions (-1 outlier, 1 inilier)
# scores == what was thresholded to make the predictions
# score_samples == negative raw anomaly score for a **new** point

# this is the outputs we can get from the model
input_name = sesh_lof.get_inputs()[0].name
C = [sesh_lof.run([sesh_lof.get_outputs()[i].name], {input_name: A.astype(np.float32)})[0] for i in np.arange(len(cols))] 

# just pack in dataframe for nice viewing
pd.DataFrame(np.hstack(C), columns=cols)

"""
Adding in `negative_outlier_factor_` - the anomaly scores for the training points
*if you score training point with `novelty=True`, you may get unexpected results*
"""

from skl2onnx.helpers.onnx_helper import add_output_initializer

new_onx = add_output_initializer(
    onx_lof,
    'negative_outlier_factor',
    classifier_lof.negative_outlier_factor_.reshape(-1,1)) # reshape here so it stacks nicely later

with open("new_lof.onnx", "wb") as f:
    f.write(new_onx.SerializeToString())

sesh_lof = rt.InferenceSession("new_lof.onnx")
input_name = sesh_lof.get_inputs()[0].name
[o.name for o in sesh_lof.get_outputs()]

cols = [out.name for out in sesh_lof.get_outputs()]
C = [sesh_lof.run([sesh_lof.get_outputs()[i].name], {input_name: A.astype(np.float32)})[0] for i in np.arange(len(cols))] 

pd.DataFrame(np.hstack(C), columns=cols)

While this works, our use case for LOF is a bit different than this. In practice we only use novelty=False and then we have constant negative_outlier_factor_ as the score for the training points. In this sense the training of our LOF models is also our classification step. This is quite different than the usual paradigm. That is usually:

data set --> model = trained_model

then

new data point --> trained_model = prediction and/or anomaly_score

Our LOF setup

data set --> model = predictionS and/or anomaly_scoreS

I think this might be possible with an onnxruntime.InferenceSession if it is possible to expose fit_predict(). i.e. in Python

classifier_lof = LocalOutlierFactor(contamination='auto', novelty=False)
# Use LOF to classfy the data
abm_mask = classifier_lof.fit_predict(A) == -1 # LOF does training and classification in the same step
scores = -1 * classifier_lof.negative_outlier_factor_ # this gives us the scores for the predictions

Note with novelty=False, there won't be a score_samples() option.

xadupre commented 3 years ago

Sorry the for the delay, the training was initially out of scope. One way to solve your issue is to add a function to replace a constant in the ONNX graph and implement the training algorithm for LOF. But I'm still hesitating between implementing the training just for this model or imagine an more generic API which could make a whole pipeline trainable.

penguinyaro commented 3 years ago

Either of those solutions sound like they would work.

I have a very shallow understanding of ONNX, but I'll try to study some more on this to see if I can contribute somehow.

penguinyaro commented 3 years ago

After poking around a bit, my feeling is that generic training would be good, but training for LOF would have to be implemented specifically using ONNX nodes.

@xadupre Is this correct thinking?

xadupre commented 3 years ago

The training is specific to LOF, I did not have time to think about a generic design but both can be handled separately.