Open penguinyaro opened 3 years ago
That does not seem too complicated. It is urgent?
I'm working with a team to export our trained models (even though LOF doesn't get "trained" in the usual ML sense) to ONNX in hopes that we can pass them to a team using Java. We are still in the earlier stages of development. That is to say, if this could be done within a week, we would be ecstatic. If it takes a month, I think we would still work with it. If it took any longer, we would be looking for other solutions.
Really do appreciate your responsiveness @xadupre.
Additionally, during implementation, would like to request that either negative_outlier_factor_
(https://github.com/scikit-learn/scikit-learn/blob/2beed55847ee70d363bdbfe14ee4401438fba057/sklearn/neighbors/_lof.py#L130) or score_samples()
(https://github.com/scikit-learn/scikit-learn/blob/2beed55847ee70d363bdbfe14ee4401438fba057/sklearn/neighbors/_lof.py#L460) was accessible as an output.
I exposed methods predict, decision_function. Option score_samples can be True to add a third result for method score_samples. I'll look into negative_outlierfactor once everything else is fixed. It would be great if you have time to test it before it gets merged.
The first PR was merge. The converter is now available. About negative_outlierfactor, it is a constant and I prefer not to add an option to expose it as a result. However, I create a more generic function which lets you add any constant into the graph and retrieve it as an output. See example in PR #743.
The converter released on PYPI. Closing the issue. Feel free to reopen.
@xadupre thanks for implementing this. Sorry I wasn't able to check it sooner. But I was able to get it works here as follows:
import numpy as np
import pandas as pd
from sklearn.neighbors import LocalOutlierFactor
import onnxruntime as rt
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
"""
Classifier setup
"""
# Construct an array to process through LOF
A = np.random.rand(25,3) # uniform on interval [0, 1)
A[-1] = A[-1]*-10 # this should make the last entry an outlier
# Define the classifier
classifier_lof = LocalOutlierFactor(contamination='auto', novelty=True)
# Use LOF to classfy the data
classifier_lof.fit(A) # if the novelty is true, then we need to use
abm_mask = classifier_lof.predict(A) == -1 # LOF does training and classification in the same step
scores = -1 * classifier_lof.negative_outlier_factor_ # this gives us the scores for the predictions
# we use `abm_mask` and `scores` to report findings
print('abnormal mask:')
print(abm_mask) # True means outlier
print('anomaly scores')
print(scores) # a larger number means more anomalous
"""
ONNX creation
"""
lof_types = [('float_input', FloatTensorType([None, A.shape[1]]))] # our input is the number of columns and they are all floats
options = {id(classifier_lof): {'score_samples': True}}
onx_lof = convert_sklearn(classifier_lof, initial_types=lof_types, options=options)
# onx = convert_sklearn(classifier_lof, initial_types=lof_smtp_types)
with open("lof.onnx", "wb") as f:
f.write(onx_lof.SerializeToString())
"""
ONNX use
"""
sesh_lof = rt.InferenceSession("new_lof.onnx")
# this is the outputs we can get from the model
cols = [o.name for o in sesh_lof.get_outputs()]
# label == predictions (-1 outlier, 1 inilier)
# scores == what was thresholded to make the predictions
# score_samples == negative raw anomaly score for a **new** point
# this is the outputs we can get from the model
input_name = sesh_lof.get_inputs()[0].name
C = [sesh_lof.run([sesh_lof.get_outputs()[i].name], {input_name: A.astype(np.float32)})[0] for i in np.arange(len(cols))]
# just pack in dataframe for nice viewing
pd.DataFrame(np.hstack(C), columns=cols)
"""
Adding in `negative_outlier_factor_` - the anomaly scores for the training points
*if you score training point with `novelty=True`, you may get unexpected results*
"""
from skl2onnx.helpers.onnx_helper import add_output_initializer
new_onx = add_output_initializer(
onx_lof,
'negative_outlier_factor',
classifier_lof.negative_outlier_factor_.reshape(-1,1)) # reshape here so it stacks nicely later
with open("new_lof.onnx", "wb") as f:
f.write(new_onx.SerializeToString())
sesh_lof = rt.InferenceSession("new_lof.onnx")
input_name = sesh_lof.get_inputs()[0].name
[o.name for o in sesh_lof.get_outputs()]
cols = [out.name for out in sesh_lof.get_outputs()]
C = [sesh_lof.run([sesh_lof.get_outputs()[i].name], {input_name: A.astype(np.float32)})[0] for i in np.arange(len(cols))]
pd.DataFrame(np.hstack(C), columns=cols)
While this works, our use case for LOF is a bit different than this. In practice we only use novelty=False
and then we have constant negative_outlier_factor_
as the score for the training points. In this sense the training of our LOF models is also our classification step. This is quite different than the usual paradigm. That is usually:
data set
--> model
= trained_model
then
new data point
--> trained_model
= prediction
and/or anomaly_score
Our LOF setup
data set
--> model
= predictionS
and/or anomaly_scoreS
I think this might be possible with an onnxruntime.InferenceSession
if it is possible to expose fit_predict()
. i.e. in Python
classifier_lof = LocalOutlierFactor(contamination='auto', novelty=False)
# Use LOF to classfy the data
abm_mask = classifier_lof.fit_predict(A) == -1 # LOF does training and classification in the same step
scores = -1 * classifier_lof.negative_outlier_factor_ # this gives us the scores for the predictions
Note with novelty=False
, there won't be a score_samples()
option.
Sorry the for the delay, the training was initially out of scope. One way to solve your issue is to add a function to replace a constant in the ONNX graph and implement the training algorithm for LOF. But I'm still hesitating between implementing the training just for this model or imagine an more generic API which could make a whole pipeline trainable.
Either of those solutions sound like they would work.
I have a very shallow understanding of ONNX, but I'll try to study some more on this to see if I can contribute somehow.
After poking around a bit, my feeling is that generic training would be good, but training for LOF would have to be implemented specifically using ONNX nodes.
@xadupre Is this correct thinking?
The training is specific to LOF, I did not have time to think about a generic design but both can be handled separately.
Requesting support be added for sklearn.neighbors.LocalOutlierFactor (https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html#sklearn.neighbors.LocalOutlierFactor). It is another anomaly detection method.