jpmml / sklearn2pmml

Python library for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
684 stars 113 forks source link

Get attribute of PCA object and custom predict function #77

Open bbzzzz opened 6 years ago

bbzzzz commented 6 years ago

Hi Villu,

I am building an anomaly detection classifier based on PCA.

I need to

1) Extract singular values (lambdas)

standard_scalar = StandardScaler()
centered_training_data = standard_scalar.fit_transform(train)
pca = PCA()
pca.fit(centered_training_data)
lambdas = pca.singular_values_

2) Calculate distance vectors based on PCA transformed data and eigen values

class CalcDist(TransformerMixin):

    def __init__(self, lambdas):
        self.lambdas = lambdas

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        dist = X * X / self.lambdas
        return dist

3) Get first q and last r elements of distance vectors, make transformation and output

class PCC(TransformerMixin):

    def __init__(self, q, r):
        self.q = q
        self.r = r

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        major_comp = np.sum(X[:,range(self.q)],axis=1)
        minor_comp = np.sum(X[:,range(self.r, X.shape[1])],axis=1)

        return np.dstack((major_comp, minor_comp))

I would like to use PCC as my classifier. My PMMLpipeline would be:

mapper = DataFrameMapper([(list(train), [ContinuousDomain(), StandardScaler(), PCA(), CalcDist(lambdas)])])
pipeline = PMMLPipeline([("mapper", mapper), ("classifier", PCC(q,r))])

Or PCC can be moved to mapper and connect with a DecisionTreeClassifier():

mapper = DataFrameMapper([(list(train), [ContinuousDomain(), StandardScaler(), PCA(), CalcDist(lambdas), PCC(q,r)])])
pipeline = PMMLPipeline([("mapper", mapper), ("classifier", DecisionTreeClassifier())])

Would any of this be possible?

Thanks, Bohan

vruusmann commented 6 years ago

Would any of this be possible?

Everything seems possible/doable.

If the pipeline is simplified a bit, then it should be possible to implement custom PMML converters for CalcDist and PCC classes (as exemplified by the SkLearn2PMML-Plugin project).

Calculate distance vectors based on PCA transformed data and eigen values

It's technically difficult to transfer PCA.singular_values_ attribute value from one pipeline step to another. Therefore, you should make class CalcDist a subclass of PCA:

class CalcDist(PCA):

  def transform(self, X):
    dist = X * X / self.singular_values_
    return dist

Or, if you don't want to subclass PCA directly, then keep a PCA object as pca_ attribute, and interact with it directly in CalcDist.fit(X) and CalcDist.transform(X) methods.

This class could be then renamed to something like SVDist as well?

I would like to use PCC as my classifier.

Please excuse my ignorance, but how should the output of PCC.transform(X) be interpreted (in terms of anomaly score)? A sample is more anomalous if the difference between the first component ("q" component) and the second component ("r" component) is greater?

I'm asking this, because I'd like to better understand how to encode the PCC class using one of the top-level PMML model elements.

bbzzzz commented 6 years ago

Hi Villu,

Thank you for your reply.

Here I am trying to implement an algorithm brought by this paper.

Technically speaking, dist = X * X / self.singular_values_ is not a distance. It is a p dimension vector, where p is the number of singular_values_.

major_component is the sum of the first q element of dist minor_component is the sum of the last r element of dist

major_component and minor_component will be both used as score. If either of the two are greater than the threshold, we predict a sample as anomalous.

If connected with a DecisionTreeClassifier(), major_component and minor_component will be used as feature.

Does this make sense?

vruusmann commented 6 years ago

I am trying to implement an algorithm brought by this paper.

Thanks for the reference - now I can relate to your idea more closely.

In principle, "PCC" stands for "Principal Component Classifier". The first outlier category ("q") represents instances that are are outliers with respect to one or more of the original variables. The second outlier category ("r") represents instances that are inconsistent with the correlation structure of the data, but are not outliers with respect to the original variables.

The PCC would be a regression-type model, because it outputs two numeric scores. Do you know "q" and "r" threshold values at the time when training and exporting the model? If so, then we could turn PCC into a classification-type model, which would output two booleans instead (eg. "is_outlier(q)" and "is_outlier(r)").

Anyway, from the API perspective, all this logic could be captured into one Scikit-Learn class:

class PCC(RegressorMixin):

  def __init__(n, q, r):
    self.pca_ = PCA(n_components = n)
    self.q = q
    self.r = r

  def fit(X, y):
    self.pca_.fit(X)

  def predict(X):
    dist = X * X / self.pca_.singular_values_
    major_comp = np.sum(dist[:, range(self.q)], axis=1)
    minor_comp = np.sum(dist[:, range(self.r, X.shape[1])], axis=1)
    return np.dstack((major_comp, minor_comp))

The above code violates some Scikit-Learn's API conventions, because the method PCC.predict(X) is returning a 2-d array (when most regressors typically return a 1-d array). It should be possible to work around it somehow (by inheriting from a different base class?), because Scikit-Learn already provides several multi-output classifiers and regressors.

I want to encapsulate everything (PCA fitting, and distance calculation) into one Python class, because this way my PMML converter can see and analyze all information together, and generate the most compact and efficient PMML representation possible. For example, I've got a feeling, that PCC prediction logic can be mapped directly to RegressionTable element (see http://dmg.org/pmml/v4-3/Regression.html#xsdElement_RegressionTable).

vruusmann commented 6 years ago

Using the above "all-in-one" PCC class, then the pipeline would be simplified to the following:

pipeline = PMMLPipeline([
  ("mapper", DataFrameMapper([
    (df_X.columns.values, [ContinuousDomain(), StandardScaler()])
  ])),
  ("pcc", PCC(n, q, r))
])