jpmml / jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
531 stars 117 forks source link

Invalid markup for `TfIdfVectorizer.sublinear_tf = True` attribute #157

Closed amoldavsky closed 3 years ago

amoldavsky commented 3 years ago

We were training a SKlearn model to flag spammy and fraudulent conversations on our platform and came across an issue using TFIDF and the SGDClassifier.

It seems to be a bug wither with either jpmml-sklearn or with jpmml-evaluator.

Here is an example repo illustrating the problem: https://github.com/amoldavsky/jpmml-tfodf-SGDClassifier-bug

All frameworks and versions are defined in requirements.txt. For the JPMML jar using latest release 1.5.7 but also tried a few versions down with the same issue.

I am doing TF-IDF + SGDClassifier, when doing a sample prediction in Python code I get all inputs classified all 0s, while doing the same predictions via JPMML I get all 1s.

The test data used for both cases is in jpmml-test-input.csv

Prediction in Python code: https://github.com/amoldavsky/jpmml-tfodf-SGDClassifier-bug/blob/master/train.py#L69

Prediction via JPMML:

./jpmml-test.sh

The jar is a copy of your example from the README with a few lines of debug printout.

amoldavsky commented 3 years ago

The JPMML prediction will work if you do:

  1. comment out sublinear_tf = True in TfidfVectorizer on line 37
  2. replace SGDClassifier() with LogisticRegression() on line 43

There is surely some issue around sublinear_tf in addition to TFIDF + SGDClassifier not doing well...

Am I doing something unsupported or is there an actual bug here?

vruusmann commented 3 years ago

If one can solve a prediction mismatch issue by toggling a boolean attribute in Python model training code, then it indicates a problem with the JPMML conversion vertical.

Anyway, thanks for taking the time to identify the problematic attribute.

I can't recall implementing this attribute before. Perhaps it's some newer functionality (introduced into Scikit-Learn over the past year or so?). Will investigate.

vruusmann commented 3 years ago

I can't recall implementing this attribute before.

Nah, I must be getting old, because I can't remember what I've coded not that long time ago: https://github.com/jpmml/jpmml-sklearn/blob/1.6.8/src/main/java/sklearn/feature_extraction/text/TfidfVectorizer.java#L88-L91

amoldavsky commented 3 years ago

haha! fair enough, you know it happens to all us. Thank you for the super swift response time, was not expecting to see a reply for a while ( as it usually the case with opensource projects unless sponsored and have a huge community ).

The question is what's going wrong, I am not really sure, I am still absorbing and familiarizing myself with the internals of your framework, but happy to dive into things / test if needed.

vruusmann commented 3 years ago

Both SGDClassifier and LogisticRegression are linear classifiers, which use the same PMML conversion code.

The functional difference between the two is that SGDClassifier is non-probabilistic (only gives a class label), whereas LogisticRegression is probabilistic (class label plus the associated probability distribution).

I would be more worried if you were reporting that LogisticRegression is predicting invalid probabilities.

I'm currently running my own tests using the Sentiment dataset. Will be more informative than pseudo-random text.

amoldavsky commented 3 years ago

Makes sense!

The pseudo-random text is do keep the model training close to our original testing. It is an obfuscated sample of the real data. But I assume you would probably be adding a unit test for this case afterwards so generalizing it to a open data set make all the sense.

vruusmann commented 3 years ago

Everything is fine when sublinear_tf = False.

If this attribute is toggled, then the original markup (centered around the TextIndex element) is surrounded with two Apply elements to convey the computation ln(term_frequency(sentence, term)) + 1:

<Apply function="+">
    <Apply function="ln">
        <TextIndex textField="document">
            <FieldRef field="term"/>
        </TextIndex>
    </Apply>
    <Constant dataType="double">1.0</Constant>
</Apply>

However, if the sentence does not contain a term, then the evaluation of the TextIndex element returns 0 (ie. zero count), which then trips the surrounding logarithm computation.

numpy.log(0) yields -infinity as a result.

Scikit-Learn appears to work around this situation by using sparse data matrices (the frequency of a unrecognized term is NaN (not 0), and numpy.log simply skips those cells).

vruusmann commented 3 years ago

The PMML markup for sublinear_tf = True should reflect Scikit-Learn's behaviour - the logarithm function should be applied only if the term was actually found in the sentence (ie. count >0)

vruusmann commented 3 years ago

@amoldavsky Quick tip about comparing Scikit-Learn and (J)PMML predictions.

Suppose you have training data in Sentiment.csv text file. You then run a Scikit-Learn script which generates two new files - SGDSentiment.pmml containing the PMML model and SGDSentiment.csv containing Scikit-Learn predictions.

Now, the JPMML-Evaluator executable JAR file contains a special-purpose application called org.jpmml.evaluator.example.TestingExample, which asserts that Sentiment.csv + SGDClassifier.pmml = SGDClassifier.csv:

$ java -cp pmml-evaluator-example-executable-1.5.7.jar org.jpmml.evaluator.example.TestingExample --model SGDSentiment.pmml --input Sentiment.csv --expected-output SGDSentiment.csv --separator ","

It's easy to verify a million-row dataset this way; if there's even a single mismatching prediction (outside of the specified acceptance criteria), it will be printed to the console.

amoldavsky commented 3 years ago

Got it! Will incorporate into our testing process.

BTW another small thing, JPMML fails with CSV input where there is a single column header. This is why you see an idx column in my input file: https://github.com/amoldavsky/jpmml-tfodf-SGDClassifier-bug/blob/master/jpmml-test-input.csv

If you remove it, leaving only a single column, you should expect an exception. I debugged it a while back and do not remember exactly the line where the exception was fired but it was in the the JPMML's CSV parser code. If that does not ring a bell I can produce an example.

amoldavsky commented 3 years ago

Opened an issue on JPMML evaluator with link to example: https://github.com/jpmml/jpmml-evaluator/issues/212

vruusmann commented 3 years ago

Will fix this PMML markup generation issue in the next JPMML-SkLearn/SkLearn2PMML release, which is at least a week away.

In the meantime, simply try living without sublinear_tf = True.

Another thing you might try is transpiling TF-IDF based linear classifiers to Java bytecode using the JPMML-Transpiler library/command-line tool. Depending on the size of your TF-IDF vocabulary, this should give 10-100x performance improvement!

amoldavsky commented 3 years ago

This is a big improvement we are talking about.

Wouldn't we hit the same issue with JPMML-Transpiler because the issue is with pmml-sklearn? Or is the idea to take a PMML file and covert it into compiled Java classes? I would still need to produce the PMML with something.

vruusmann commented 3 years ago

This is a big improvement we are talking about.

In very simplified terms, with JPMML-Evaluator the prediction time scales linearly with the number of terms in the vocabulary, whereas with JPMML-Translator the prediction time scales linearly with the lengh of the sentence being scored.

The difference between these two approaches is not very noticeable when dealing with small vocabularies (say, less than 200 terms). The difference will be very noticeable with larger vocabularies (say, more than 10'000 terms).

The JPMML-Transpiler library converts your arbitrary-size TF-IDF classifier into a fixed size (~20 kB) Java class (all terms and their contributions/weights are stored in an external resources file). This class loads instantly (cf. parsing a 10 MB PMML XML file), and consumes very minimal amounts of memory.

Wouldn't we hit the same issue with JPMML-Transpiler because the issue is with pmml-sklearn?

Yes, it's instrumental to fix the PMML markup first, because everything downstream from here is dependent on the PMML markup being correct.

But I believe you can continue your testing work with TfIdfVectorizer.sublinear_tf = True being unavailable for some time.

If the magnitude of term frequencies is really important, then there's a workaround of applying the log transform separately (eg. using the sklearn2pmml.preprocessing.ExpressionTransformer pre-processing step).

amoldavsky commented 3 years ago

I just tested the sublinear_tf property worked well with the new model it was applied to! Thank you for taking care once again, you have given a good bump to our ROC scores 👍