jpmml / sklearn2pmml

Python library for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
684 stars 113 forks source link

Add PMMLCountVectorizer and PMMLTfIdfVectorizer classes #74

Closed nejatb closed 3 years ago

nejatb commented 6 years ago

Hi Villu,

Following issue #68 , we have been attempting to add support for normalized tfidfVectorizer, since our tfidfVectorizer was part of the pipeline. The way I went about doing this is rather hacky but is: I created a python class called NormalizedFeatureExtraction and in that class, I copied the logistics of scikit learn's tfidfVectorizer but I enforced normalization by removing the norm parameter from the init method and also removing self.norm and instead, normalizing the vectors wherever needed. However, to satisfy the need of the java side, I still return None for

    @property
    def norm(self):
        return None

and I kept your implementation of jpmml tfidfVectorizer as its corresponding java side. I wanted to ask you if there is any better way to do this? And also, when making this change to the python side, do I have to make any changes to the java side as well? Is there any reason why the jpmml tfidfVectorizer throws an exception when norm is not None? Could having normalization the way I have added it, without any change to the java side, potentially cause any logistic problems? Many thanks for your clarifications.

vruusmann commented 6 years ago

Is there any reason why the jpmml tfidfVectorizer throws an exception when norm is not None?

This exception is thrown to inform you that normalization in general, and the "norm" attribute in particular, is not supported at the moment. If the converter simply ignored this fact, then the PMML representation of the Scikit-Learn pipeline wouldn't be exact/correct, and you'd be getting mismatching predictions later on.

I remember that I have discussed the background of this "not supported" decision someplace else before. In brief, Scikit-Learn calculates term frequencies all in one go, whereas (J)PMML calculates them one by one, as they are needed (kind of lazy evaluation approach). It would be terribly inefficient to perform normalization over the complete vocabulary - (J)PMML would need to invoke term frequency calculation both for "active" and "inactive" terms. In most cases, the majority of terms fall into the "inactive" category.

You should consider rearranging your Scikit-Learn pipeline to perform normalization after you've identified which terms are "active":

pipeline = Pipeline([
  ("generate terms", TfIdfVectorizer()), # All terms
  ("select terms", SelectKBest(k = 500)) # Identify/select 500 "active" terms
  ("normalize terms", Normalizer()) # Perform normalization over the subset of 500 "active" terms
  ("classifier", RandomForestClassifier())
])

Would the explicit normalization using the sklearn.preprocessing.Normalizer transformation work for you?

Some estimator types contain "internal" normalization logic, so the explicit normalization step might be unnecessary/redundant.

And also, when making this change to the python side, do I have to make any changes to the java side as well?

The Java side must reproduce the prediction logic of the Scikit-Learn/Python side.

In other words, if some Python class attribute is significant from the prediction logic perspective, then the converter must look it up, and generate appropriate PMML markup for it.

Could having normalization the way I have added it, without any change to the java side, potentially cause any logistic problems?

When in doubt, compare Scikit-Learn predictions with (J)PMML predictions over a representative data set.

vruusmann commented 6 years ago

I refactored this issue to address a more generic usability concern.

Namely, it's rather tricky to get the paramerization of Scikit-Learn's CountVectorizer and TfidfVectorizer classes correct - there are a number of constructor parameters (eg. norm, tokenizer) that only accept fixed values.

The solution would be to introduce PMMLCountVectorizer and PMMLTfidfVectorizer subclasses, respectively, that automatically fill in "fixed value" arguments, and only query the end user for "variable value" arguments.