Closed amoldavsky closed 3 years ago
The JPMML prediction will work if you do:
sublinear_tf = True
in TfidfVectorizer
on line 37SGDClassifier()
with LogisticRegression()
on line 43There is surely some issue around sublinear_tf
in addition to TFIDF + SGDClassifier not doing well...
Am I doing something unsupported or is there an actual bug here?
If one can solve a prediction mismatch issue by toggling a boolean attribute in Python model training code, then it indicates a problem with the JPMML conversion vertical.
Anyway, thanks for taking the time to identify the problematic attribute.
I can't recall implementing this attribute before. Perhaps it's some newer functionality (introduced into Scikit-Learn over the past year or so?). Will investigate.
I can't recall implementing this attribute before.
Nah, I must be getting old, because I can't remember what I've coded not that long time ago: https://github.com/jpmml/jpmml-sklearn/blob/1.6.8/src/main/java/sklearn/feature_extraction/text/TfidfVectorizer.java#L88-L91
haha! fair enough, you know it happens to all us. Thank you for the super swift response time, was not expecting to see a reply for a while ( as it usually the case with opensource projects unless sponsored and have a huge community ).
The question is what's going wrong, I am not really sure, I am still absorbing and familiarizing myself with the internals of your framework, but happy to dive into things / test if needed.
Both SGDClassifier
and LogisticRegression
are linear classifiers, which use the same PMML conversion code.
The functional difference between the two is that SGDClassifier
is non-probabilistic (only gives a class label), whereas LogisticRegression
is probabilistic (class label plus the associated probability distribution).
I would be more worried if you were reporting that LogisticRegression
is predicting invalid probabilities.
I'm currently running my own tests using the Sentiment dataset. Will be more informative than pseudo-random text.
Makes sense!
The pseudo-random text
is do keep the model training close to our original testing. It is an obfuscated sample of the real data. But I assume you would probably be adding a unit test for this case afterwards so generalizing it to a open data set make all the sense.
Everything is fine when sublinear_tf = False
.
If this attribute is toggled, then the original markup (centered around the TextIndex
element) is surrounded with two Apply
elements to convey the computation ln(term_frequency(sentence, term)) + 1
:
<Apply function="+">
<Apply function="ln">
<TextIndex textField="document">
<FieldRef field="term"/>
</TextIndex>
</Apply>
<Constant dataType="double">1.0</Constant>
</Apply>
However, if the sentence does not contain a term, then the evaluation of the TextIndex
element returns 0
(ie. zero count), which then trips the surrounding logarithm computation.
numpy.log(0)
yields -infinity
as a result.
Scikit-Learn appears to work around this situation by using sparse data matrices (the frequency of a unrecognized term is NaN
(not 0
), and numpy.log
simply skips those cells).
The PMML markup for sublinear_tf = True
should reflect Scikit-Learn's behaviour - the logarithm function should be applied only if the term was actually found in the sentence (ie. count >0)
@amoldavsky Quick tip about comparing Scikit-Learn and (J)PMML predictions.
Suppose you have training data in Sentiment.csv
text file. You then run a Scikit-Learn script which generates two new files - SGDSentiment.pmml
containing the PMML model and SGDSentiment.csv
containing Scikit-Learn predictions.
Now, the JPMML-Evaluator executable JAR file contains a special-purpose application called org.jpmml.evaluator.example.TestingExample
, which asserts that Sentiment.csv + SGDClassifier.pmml = SGDClassifier.csv
:
$ java -cp pmml-evaluator-example-executable-1.5.7.jar org.jpmml.evaluator.example.TestingExample --model SGDSentiment.pmml --input Sentiment.csv --expected-output SGDSentiment.csv --separator ","
It's easy to verify a million-row dataset this way; if there's even a single mismatching prediction (outside of the specified acceptance criteria), it will be printed to the console.
Got it! Will incorporate into our testing process.
BTW another small thing, JPMML fails with CSV input where there is a single column header.
This is why you see an idx
column in my input file:
https://github.com/amoldavsky/jpmml-tfodf-SGDClassifier-bug/blob/master/jpmml-test-input.csv
If you remove it, leaving only a single column, you should expect an exception. I debugged it a while back and do not remember exactly the line where the exception was fired but it was in the the JPMML's CSV parser code. If that does not ring a bell I can produce an example.
Opened an issue on JPMML evaluator with link to example: https://github.com/jpmml/jpmml-evaluator/issues/212
Will fix this PMML markup generation issue in the next JPMML-SkLearn/SkLearn2PMML release, which is at least a week away.
In the meantime, simply try living without sublinear_tf = True
.
Another thing you might try is transpiling TF-IDF based linear classifiers to Java bytecode using the JPMML-Transpiler library/command-line tool. Depending on the size of your TF-IDF vocabulary, this should give 10-100x performance improvement!
This is a big improvement we are talking about.
Wouldn't we hit the same issue with JPMML-Transpiler
because the issue is with pmml-sklearn?
Or is the idea to take a PMML file and covert it into compiled Java classes? I would still need to produce the PMML with something.
This is a big improvement we are talking about.
In very simplified terms, with JPMML-Evaluator the prediction time scales linearly with the number of terms in the vocabulary, whereas with JPMML-Translator the prediction time scales linearly with the lengh of the sentence being scored.
The difference between these two approaches is not very noticeable when dealing with small vocabularies (say, less than 200 terms). The difference will be very noticeable with larger vocabularies (say, more than 10'000 terms).
The JPMML-Transpiler library converts your arbitrary-size TF-IDF classifier into a fixed size (~20 kB) Java class (all terms and their contributions/weights are stored in an external resources file). This class loads instantly (cf. parsing a 10 MB PMML XML file), and consumes very minimal amounts of memory.
Wouldn't we hit the same issue with JPMML-Transpiler because the issue is with pmml-sklearn?
Yes, it's instrumental to fix the PMML markup first, because everything downstream from here is dependent on the PMML markup being correct.
But I believe you can continue your testing work with TfIdfVectorizer.sublinear_tf = True
being unavailable for some time.
If the magnitude of term frequencies is really important, then there's a workaround of applying the log transform separately (eg. using the sklearn2pmml.preprocessing.ExpressionTransformer
pre-processing step).
I just tested the sublinear_tf
property worked well with the new model it was applied to! Thank you for taking care once again, you have given a good bump to our ROC scores 👍
We were training a SKlearn model to flag spammy and fraudulent conversations on our platform and came across an issue using TFIDF and the SGDClassifier.
It seems to be a bug wither with either
jpmml-sklearn
or withjpmml-evaluator
.Here is an example repo illustrating the problem: https://github.com/amoldavsky/jpmml-tfodf-SGDClassifier-bug
All frameworks and versions are defined in
requirements.txt
. For the JPMML jar using latest release 1.5.7 but also tried a few versions down with the same issue.I am doing TF-IDF + SGDClassifier, when doing a sample prediction in Python code I get all inputs classified all 0s, while doing the same predictions via JPMML I get all 1s.
The test data used for both cases is in jpmml-test-input.csv
Prediction in Python code: https://github.com/amoldavsky/jpmml-tfodf-SGDClassifier-bug/blob/master/train.py#L69
Prediction via JPMML:
The jar is a copy of your example from the README with a few lines of debug printout.