Text normalization fails due to excessive whitespace filtering

vinluvie commented 5 years ago

Hi,

I have saved a RandomForestClassifier model in python, and load it in Scala and compare the test data result, for my test, 18% of them are different from each other. I have been trying different things on setting and could not get them match, would you mind please take a look of my code and see if I have any bug. I have a data set which has three columns, title, summary and target, i would like to apply TfidfVectorizer on title and summary, and then ran them with the RandomForestClassifier

feature_def = gen_features(
    columns=["summary", "title"],
    classes=[
        {
            "class": TfidfVectorizer,
            "max_df": max_df,
            "ngram_range": (ngram_min, ngram_max),
            "max_features": max_features,
            "stop_words": "english",
            "norm": None,
            "preprocessor": None,
            "strip_accents": None,
            "token_pattern": None,
            "tokenizer": Splitter(),
        }
    ],
)

mapper = DataFrameMapper(feature_def, input_df=True, sparse=True)
pmml_pipeline = PMMLPipeline(
    [("tf-idf", mapper), ("classifier", RandomForestClassifier())]  # All terms
)
pmml_pipeline.fit(X, Y)
sklearn2pmml(pmml_pipeline, "model.pmml", with_repr=True, debug=True)

On Scala side, this is my code

val pmml = PMMLUtil.unmarshal(inputStream)
val evaluator = ModelEvaluatorFactory.newInstance().newModelEvaluator(pmml)
val inputFields: util.List[InputField] = evaluator.getInputFields
val target: TargetField = evaluator.getTargetFields.get(0)
val tname = target.getName
val arguments = mutable.Map[FieldName, FieldValue]()
// bufferedSource has the test data
for (line <- bufferedSource.getLines) {
      val cols = line.split(";").map(_.trim)
      val target = cols(0)
      val summary = cols(1)
      val title = cols(2)
      arguments.clear()

      inputFields.forEach(field => {
        if(field.getName.getValue == "summary") {
          arguments.put(field.getName, field.prepare(summary))
        } else if(field.getName.getValue == "title") {
          arguments.put(field.getName, field.prepare(title))
        }
      })

      val result = evaluator.evaluate(arguments.asJava)
      val output = result.get(tname)
      val o = EvaluatorUtil.decode(output).asInstanceOf[String]
      print(s"${title};${summary};${target};${o}")
    }

Thank you very much, really appreciate your work

vinluvie commented 5 years ago

some addition info in the pmml file

<MiningBuildTask>
        <Extension>PMMLPipeline(steps=[('tf-idf', DataFrameMapper(default=False, df_out=False,
        features=[('summary', [TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=&lt;class 'numpy.float64'&gt;, encoding='utf-8', input='content',
        lowercase=True, max_df=0.95, max_features=10000, min_df=1,
        ngram_range=(1, 2), norm=None, preprocessor=None, smooth_i....feature_extraction.text.Splitter object at 0x1a21321898&gt;,
        use_idf=True, vocabulary=None)])],
        input_df=True, sparse=True)),
       ('classifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])</Extension>
    </MiningBuildTask>|

i do notice it only has one TfidfVectorizer in the DataFrameMapper with "summary" but i do see tf-idf@1 and tf-idf@2 in the rest of the XML, so i am not sure if really apply the TfidfVectorizer on two columns separately

vruusmann commented 5 years ago

Sorry, there is not enough information in this issue report for me to do any serious troubleshooting (would need to see the actual data). The problem, if any, should reside on the SkLearn2PMML/JPMML-SkLearn side, not the JPMML-Evaluator side.

Some things that you might try: 1) Always perform PMMLPipeline.verify(X) before generating the PMML file. The so-called model verification mechanism should be able to detect and report Python vs. Java mismatching predictions automatically. 2) Simplify your problem, and see if any of those simpler configurations work. For example, train only using the "subject" field, then train only using the "title" field. 3) Manually inspect those 18% of failing data rows. I bet they have something in common. Find it out, and change the configuration of the TfIdfVectorizer step accordingly.

PS. What's your definition of a failure? A wrong class label, or a wrong probability value (after 13th decimal place)?

vinluvie commented 5 years ago

actually i tried to do the evaluator.verify(), and it give me an exception about value 0.0 and 0.1. the PMMLPipeline.verify(X) on python side actually gave me no error

The failure i was talking about is the mismatch of the predicted value. as i ran the same test data set against the PMMLPipeline in python, and the value is different from the evaluator

vruusmann commented 5 years ago

Nevertheless, I cannot do anything until I have access to sample data.

The JPMML-SkLearn library has adequate test coverage in this area, and everything is reproducible/works as advertised: https://github.com/jpmml/jpmml-sklearn/blob/master/src/test/resources/main.py#L430-L448

What are you doing differently? Did you try the second suggestion, which is simplifying your pipeline from two TfIdfs to one TfIdf?

vinluvie commented 5 years ago

here is some sample data separated by ; as the separator summary and title are the two columns that i want to apply TfidfVectorizer on them individually, and discipline is the label

discipline;summary;title
Academic experience;qwertyuiop;Research Assistant
Software Developer;qwertyuiop;Staff Software Engineer
Software Developer;qwertyuiop;Web Development Intern
Academic experience;qwertyuiop;Graduate Researcher
Software Developer;Building and maintaining a single page web application for viewing high throughput genomic data https tumormap ucsc edu Responsibilities include Fully understand and accurately explain entire pipeline from tertiary data to visualization Build and maintain ability to transition between maps via a selection of samples Reflection functionality MeteorJS MongoDB Research build maintain Spatial Correlation Analysis functionality Python Scipy sklearn Provide endpoints for computation server Python Flask Unit tests unit tests unit tests Python Pytest Mentor students on computational biology projects Write technical documentation for methods provided by the software Python Sphinx Implement prototypes for hierarchical viewer leafletJS plotlyJS Python Flask;Junior Software Engineer
Software Developer;qwertyuiop;Programmer
Academic experience;qwertyuiop;Intern
Software Developer;qwertyuiop;Infrastructure Developer
Academic experience;Conducted an industry analysis to help identify trends from past five years market data about sales and pricing Made Comparisons between the data of internal competitors to help do SWOT analysis;Summer Intern
Other engineering;Managed the scheduling of recording sessions and maintenance of studio equipment Executed the planning setup engineering breakdown editing and mixing of projects Built good relations with clients and maintained effective lines of communication;Audio Engineer
Software Developer;qwertyuiop;Web Developer
Software Developer;Developed a script to read a csv file containing success or failures of different recipes on a variety of tools and generate a heat map showing success rates for recipes across tools Created Excel workbook to transform measurement data into a different coordinate system to enable more meaningful data analysis Linked data from data warehouse and measurement data provided by members of the testing lab to identify flaws in the recipe tool or measurements;Engineering Intern
Software Developer;qwertyuiop;Assistant Engineer
Recruiting;qwertyuiop;IT Technical Recruiter
Management;qwertyuiop;President
Software Developer;qwertyuiop;Research Intern
Software Developer;qwertyuiop;Frontend Developer
Software Developer;qwertyuiop;Software Consultant
Data Science;qwertyuiop;Data Analyst
Software Developer;I worked in backend group to maintain a Restaurant Recommendation Website Developed a web service using Java servlet REST API to fetch restaurant data from Yelp API Utilized MySQL MongoDB to store user preference and restaurant information Designed g a content based recommendation algorithm to match similar restaurants based on categories Improved precision of recommendation by ordering restaurants based on distance and stars Deployed the application to Amazon EC for better performance Tested the web service and app with unit tests JUnit and load tests JMeter;Software Engineer
Software Developer;qwertyuiop;System analyst
Management;Responsible for managing commercial residential technology solutions and management;President
Sales;qwertyuiop;Account Executive
QA;qwertyuiop;Test Engineer
Academic experience;qwertyuiop;intern
Software Developer;qwertyuiop;Software Engineer Internship
Software Developer;qwertyuiop;Research Intern

my latest code in python is

train_data = pd.read_csv("sample.csv", sep=";")
train_data["title"] = train_data["title"].astype("str")
train_data["summary"] = train_data["summary"].astype("str")
feature_def = gen_features(
    columns=["summary", "title"],
    classes=[
        {
            "class": TfidfVectorizer,
            "max_df": max_df,
            "ngram_range": (ngram_min, ngram_max),
            "max_features": max_features,
            "stop_words": "english",
            "norm": None,
            "preprocessor": None,
            "strip_accents": None,
            "dtype": numpy.float32,
            "tokenizer": Splitter(),
            "analyzer": "word",
            "use_idf": True,
            "binary": False,
        }
    ],
)

mapper = DataFrameMapper(feature_def, input_df=True, df_out=True)
pmml_pipeline = PMMLPipeline(
    [("mapper1", mapper1), ("classifier", RandomForestClassifier())]  # All terms
)

X = train_data.drop("discipline", axis=1)
Y = train_data["discipline"]
pmml_pipeline.fit(X, Y)

sklearn2pmml(pmml_pipeline, "model.pmml", with_repr=True, debug=True)

vinluvie commented 5 years ago

as always , thank you very much

vruusmann commented 5 years ago

Simplified your Python code to the following:

from pandas import DataFrame
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.feature_extraction.text import Splitter
from sklearn2pmml.pipeline import PMMLPipeline

import numpy
import pandas

train_data = pandas.read_csv("sample.csv", sep=";")
train_data["title"] = train_data["title"].astype("str")
train_data["summary"] = train_data["summary"].astype("str")
train_data["discipline"] = train_data["discipline"].astype("str")

pmml_pipeline = PMMLPipeline([
    ("tfidf", TfidfVectorizer(analyzer = "word", preprocessor = None, lowercase = True, strip_accents = None, token_pattern = None, tokenizer = Splitter(), stop_words = "english", ngram_range = (1, 2), norm = None, dtype = numpy.float32)),
    ("classifier", RandomForestClassifier())  # All terms
])

pmml_pipeline.fit(train_data["summary"], train_data["discipline"])

sklearn2pmml(pmml_pipeline, "model.pmml", with_repr=True, debug=False)

proba = DataFrame(pmml_pipeline.predict_proba(train_data["summary"]))
proba.to_csv("python-output.csv", sep = ";")

Indeed, six or seven data rows out of total 27 data rows are giving different predictions between Scikit-Learn and Java.

vinluvie commented 5 years ago

I did try running it again with 0.19.0 Sklearn but i still get mismatches Here are the versions that I ran with

python: 3.6.6
sklearn: 0.19.0
sklearn.externals.joblib: 0.11
pandas: 0.23.4
sklearn_pandas: 1.7.0
sklearn2pmml: 0.39.0
java: 1.8.0_171

vruusmann commented 5 years ago

@vinluvie Please post all further comments regarding this issue here: https://github.com/jpmml/jpmml-sklearn/issues/89

This is a JPMML-SkLearn problem.

vruusmann commented 5 years ago

Unlocking, as this really is a JPMML-Evaluator problem - the scoring works correctly in 1.4.2, but not in 1.4.3.

vruusmann commented 5 years ago

The regression happened in commit 7baf37d9f6

vruusmann commented 5 years ago

The root cause of this issue is excessive SAX whitespace filtering: https://github.com/jpmml/jpmml-model/issues/19

It's possible to make JPMML-Evaluator 1.4.X work by disabling the SAX whitespace filtering during PMML document unmarshalling.

Default:

PMML pmml = org.jpmml.model.PMMLUtil.unmarshal(is);

Temporary workaround:

// Bad
// Source source = SAXUtil.createFilteredSource(is, new ImportFilter(), WhitespaceFilter());

// Good
Source source = SAXUtil.createFilteredSource(is, new ImportFilter());

PMML pmml = org.jpmml.model.JAXBUtil.unmarshalPMML(source);

jpmml / jpmml-evaluator

Text normalization fails due to excessive whitespace filtering #136