jpmml / jpmml-evaluator

Java Evaluator API for PMML
GNU Affero General Public License v3.0
892 stars 255 forks source link

SelectKBest leading to Logistic Regression probability discrepancies between scikit-learn and jpmml-evaluator #93

Closed johncliu closed 6 years ago

johncliu commented 6 years ago

Similar to #82, I noticed a sizable inconsistency when I incorporated SelectKBest feature selection with the Logistic Regression classifier and altered the number of features k.

I'm using the following sklearn snippet:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import LogisticRegression
from sklearn2pmml import sklearn2pmml, PMMLPipeline
from sklearn2pmml.feature_extraction.text import Splitter

vectorizer = CountVectorizer(ngram_range=(1, 1), token_pattern=None, tokenizer=Splitter())
feature_selector = SelectKBest(chi2,k=95)
classifier = LogisticRegression()
pipeline = PMMLPipeline([
    ("vectorizer", vectorizer),
    ("feature_selector", feature_selector),
    ("classifier", classifier)
])
pipeline.fit(x, y)
sklearn2pmml(pipeline, "pipeline.pmml", with_repr=True)

and jpmml snippet:

import org.jpmml.evaluator.ModelEvaluatorFactory
import scala.collection.JavaConverters._

val pmml = org.jpmml.model.PMMLUtil.unmarshal(new java.io.FileInputStream("pipeline.pmml"))
val pipeline = ModelEvaluatorFactory.newInstance.newModelEvaluator(pmml)
val inputName = pipeline.getInputFields.get(0).getName

val cleanText = "  bathroom is clean..... now on to more enjoyable tasks......"
val outputMap = pipeline.evaluate(Map(inputName -> cleanText).asJava).asScala

For the sentence above, the class 1 predictions for different values of k are:

k sklearn pmml
90 0.3987 0.3737
95 0.3921 0.3488
100 0.3891 0.3891
105 0.3859 0.6257
110 0.3830 0.8382

When I run the pipeline without feature selection, the results match perfectly. I ran this across multiple datasets and got the same strange behavior. Below is a 100 line training file (extracted from the UofMich Sentiment Analysis Challenge corpus on Kaggle) that I used to generate above results:

sentiment,text
0,                     is so sad for my APL friend.............
0,                   I missed the New Moon trailer...
1,              omg its already 7:30 :O
0,          .. Omgaga. Im sooo  im gunna CRy. I've been at this dentist since 11.. I was suposed 2 just get a crown put on (30mins)...
0,         i think mi bf is cheating on me!!!       T_T
0,         or i just worry too much?        
1,       Juuuuuuuuuuuuuuuuussssst Chillin!!
0,       Sunny Again        Work Tomorrow  :-|       TV Tonight
1,      handed in my uniform today . i miss you already
1,      hmmmm.... i wonder how she my number @-)
0,      I must think about positive..
1,      thanks to all the haters up in my face all day! 112-102
0,      this weekend has sucked so far
0,     jb isnt showing in australia any more!
0,     ok thats it you win.
0,    <-------- This is the way i feel right now...
0,"    awhhe man.... I'm completely useless rt now. Funny, all I can do is twitter. http://myloc.me/27HX"
1,    Feeling strangely fine. Now I'm gonna go listen to some Semisonic to celebrate
0,    HUGE roll of thunder just now...SO scary!!!!
0,    I just cut my beard off. It's only been growing for well over a year. I'm gonna start it over. @shaunamanu is happy in the meantime.
0,    Very sad about Iran.
0,    wompppp wompp
1,    You're the only one who can see this cause no one else is following me this is for you because you're pretty awesome
0,   <---Sad level is 3. I was writing a massive blog tweet on Myspace and my comp shut down. Now it's all lost *lays in fetal position*
0,   ...  Headed to Hospitol : Had to pull out of the Golf Tourny in 3rd place!!!!!!!!!!! I Think I Re-Ripped something !!! Yeah THAT !!
0,   BoRinG   ): whats wrong with him??     Please tell me........   :-/
0,   can't be bothered. i wish i could spend the rest of my life just sat here and going to gigs. seriously.
0,"   Feeeling like shit right now. I really want to sleep, but nooo I have 3 hours of dancing and an art assignment to finish. "
1,"   goodbye exams, HELLO ALCOHOL TONIGHT "
0,   I didn't realize it was THAT deep. Geez give a girl a warning atleast!
0,   I hate it when any athlete appears to tear an ACL on live television.
0,   i miss you guys too     i think i'm wearing skinny jeans a cute sweater and heels   not really sure   what are you doing today
0,  -- Meet your Meat http://bit.ly/15SSCI
0,   My horsie is moving on Saturday morning.
0,   No Sat off...Need to work 6 days a week 
0,   Really Dont Like Doing my Room Its So Boring  Sick Of Doing My Wardrobe Out Cant Waiit Till I Have My Walk In One  Yay
0,"   SOX!     Floyd was great, but relievers need a scolding!"
0,   times by like a million
1,   uploading pictures on friendster 
0,   what type of a spaz downloads a virus? my brother that's who :\ MSN is now fucked forever    :'(
0,  &&Fightiin Wiit The Babes...
1,  (: !!!!!! - so i wrote something last week. and i got a call from someone in the new york office... http://tumblr.com/xcn21w6o7
0,  *enough said*
1,"  ... Do I need to even say it?  Do I?  Well, here I go anyways:  CHRIS CORNELL IN CHICAGO!  ... TONIGHT!    "
1,  ... health class (what a joke!)
1,  @ginaaa <3 GO TO THE SHOW TONIGHT
0,  @Spiral_galaxy @YMPtweet  it really makes me sad when i look at Muslims reality now
0, - All Time Low shall be my motivation for the rest of the week.
0,"  and the entertainment is over, someone complained properly..   @rupturerapture experimental you say? he should experiment with a melody"
0,  another year of Lakers .. That's neither magic nor fun ...
0,  baddest day eveer. 
1,  bathroom is clean..... now on to more enjoyable tasks......
1,  boom boom pow
0,  but i'm proud.
0,  congrats to helio though
0,  David must be hospitalized for five days end of July (palatine tonsils). I will probably never see Katie in concert. 
0,  friends are leaving me 'cause of this stupid love  http://bit.ly/ZoxZC
1,  go give ur mom a hug right now. http://bit.ly/azFwv
1,  Going To See Harry Sunday Happiness 
0,  Hand quilting it is then...
0,  hate u ...  leysh t9ar5 ... =((((((( ..
1,-  I always get what I want
1,  I bend backwards  
1,  i get off work sooooon! i miss cody booo. haven't seen him in foreverr!
1,  I hate allergies. Should I get my hair cut tomorrow? I'm taking a public poll...
0, - I love you guys so much that it hurts. http://tumblr.com/xkh1z19us
0,  I miss Earl
0,  I miss New Jersey
0,"  I missed the first hour of SYTYCD last night, and I can't find it online!"
0,  I need a U2 fix NOW!
0,  I never thought I'd become second choice...
0,  I think I may be too friendly...lol... o well...
0,  I think Manuel (my Basil plant) only has days to live   
0,  I wanna be at home @ church...I wonder wht they are doing?
0,  i wanna make my own pizza
0,"  i want a 120gb harddrive, or a 37 inch tv, or a new guitar.  anyonefeeling generous?  =p   x"
0,  i want a hug
0,  I want Miley to tour Australia
0,  I wanted to sleep in this morning but a mean kid through a popsicle stick at me head. I wish I could fly away like those squirrels
0,  i was too slow to get $1 Up tix
0,"  I will send sunshine to Northern Ireland, are you going swimming today @kezbat"
0,  I wish I could go to T4 On The Beach :'(    Would be great to see @Shontelle_Layne & @DanMerriweather   
0,"  i would be so much happier if the walls of my bedroom were painted white,"
0,  idk wat 2 do who can i trust me im sorry 4 all da pain i have caused nebody ima take dis time out 2 straighten myself out i luv yall
0,  I'm finding the intercept slope..and banging my head against the wall..Math brain heads come save me
1,  I'm really going to bed now...
0,  im sick  'cough cough'
0,  in cab headed to the airport!  going home.... <christy>
0,  In case I feel emo in camp (feeling a wee bit of it alr)...am bringing in the Human Rights Watch World Report 2009..hope it'll work
1,  Jin has a twitter.
0,  jonas day is almost over... 
0,  Jus Got Hom Fr. TDa Funeral... I'm So Sad! I Cried So Much Times! Much Love Grandpa!<3 I Never Got To Say My Last "Goodbye" to Him.
1,  just gonna smile...cuz it is what it is..and im not sure what more they could want..
1,"  Just got home, and I got to see my friend Zahra whom I haven't seen since We graduated!!!  That makes me so happy."
0, - Longest night ever.. ugh! http://tumblr.com/xwp1yxhi6
0,  mi momacita won't let me go to my bf's bball game!!! grrr!!!
0,  Mom says I have to get a new phone IMMEDIATELY....off to T-Mobile.  she paying....
0,  My new car was stolen....by my mother who wanted to go pose at church.
0,"  no hang out with the girls 2day. 2moro, hope so......"

and read with this python snippet:

import pandas as pd

data = pd.read_csv("snippet.csv", sep=',', skiprows=1, \
                   quotechar='"', names = ['id', 'sentiment', 'source','text'])

y = data["sentiment"].head(100)
x = data["text"].head(100)
johncliu commented 6 years ago

My setup for reference:

python:  3.5.4
sklearn:  0.18.1
sklearn.externals.joblib: 0.10.3
pandas:  0.21.1
sklearn_pandas:  1.6.0
sklearn2pmml:  0.28.0

java -cp /root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/guava-20.0.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/istack-commons-runtime-3.0.5.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jaxb-core-2.3.0.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jaxb-runtime-2.3.0.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jcommander-1.48.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jpmml-converter-1.2.6.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jpmml-lightgbm-1.1.3.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jpmml-sklearn-1.4.2.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jpmml-xgboost-1.2.4.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-agent-1.3.8.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-model-1.3.8.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-model-metro-1.3.8.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-schema-1.3.8.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/pyrolite-4.19.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/serpent-1.18.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/slf4j-api-1.7.25.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/slf4j-jdk14-1.7.25.jar org.jpmml.sklearn.Main --pkl-pipeline-input /tmp/pipeline-buwtw3w9.pkl.z --pmml-output pipeline.pmml

vruusmann commented 6 years ago

When I run the pipeline without feature selection, the results match perfectly.

Very interesting observation.

What happens if you replace the "direct use" of SelectKBest with an "indirect use" of SelectorProxy(SelectKBest())? The meta-selector class SelectorProxy shields you from the internals of the actual feature selection logic.

Please try rearranging your code like this, and report back!

from sklearn2pmml import SelectorProxy

pipeline = PMMLPipeline([
    ("vectorizer", vectorizer),
    ("feature_selector", SelectorProxy(feature_selector)), # THIS!
    ("classifier", classifier)
])
johncliu commented 6 years ago

Using SelectorProxy(feature_selector), the results align perfectly between sklearn and jpmml-sklearn:

k sklearn pmml
90 0.8055624 0.8055624
95 0.8011768 0.8011768
100 0.8011904 0.8011904
105 0.7944084 0.7944084
110 0.7970723 0.7970723
150 0.7964169 0.7964169

If that's the suggested workaround, we'll go with it. Thanks!

vruusmann commented 6 years ago

Using SelectorProxy(feature_selector), the results align perfectly between sklearn and jpmml-sklearn:

Thanks for reporting back such great news!

The results between Scikit-Learn and (J)PMML should actually align up to 14th or 15th decimal place (you're only checking the first seven decimal places). In the future, if you continue your research and happen to find a discrepany in the area of 12th or 13th decimal place, then you should let me know about it again.

If that's the suggested workaround, we'll go with it.

Apparently, the JPMML-SkLearn library handles the SelectKBest(score_func = chi2) case incorrectly.

There are several other bug reports about Scikit-Learn and (J)PMML prediction mismatches, and all these pipelines appear to contain the SelectKBest(score_func = chi2) step: https://github.com/jpmml/sklearn2pmml/issues/69#issue-276313176 https://github.com/jpmml/sklearn2pmml/issues/68#issuecomment-346227053

johncliu commented 6 years ago

Yup, SelectKBest could also be causing those discrepancies in #68 and #69. I'll take a look at SelectKBest.java to see if I can track it down, but in the meanwhile will close this ticket given the workaround with SelectorProxy(). Thanks!