Closed vinluvie closed 5 years ago
some addition info in the pmml file
<MiningBuildTask>
<Extension>PMMLPipeline(steps=[('tf-idf', DataFrameMapper(default=False, df_out=False,
features=[('summary', [TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
lowercase=True, max_df=0.95, max_features=10000, min_df=1,
ngram_range=(1, 2), norm=None, preprocessor=None, smooth_i....feature_extraction.text.Splitter object at 0x1a21321898>,
use_idf=True, vocabulary=None)])],
input_df=True, sparse=True)),
('classifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
oob_score=False, random_state=None, verbose=0,
warm_start=False))])</Extension>
</MiningBuildTask>|
i do notice it only has one TfidfVectorizer in the DataFrameMapper with "summary" but i do see tf-idf@1 and tf-idf@2 in the rest of the XML, so i am not sure if really apply the TfidfVectorizer on two columns separately
Sorry, there is not enough information in this issue report for me to do any serious troubleshooting (would need to see the actual data). The problem, if any, should reside on the SkLearn2PMML/JPMML-SkLearn side, not the JPMML-Evaluator side.
Some things that you might try:
1) Always perform PMMLPipeline.verify(X)
before generating the PMML file. The so-called model verification mechanism should be able to detect and report Python vs. Java mismatching predictions automatically.
2) Simplify your problem, and see if any of those simpler configurations work. For example, train only using the "subject" field, then train only using the "title" field.
3) Manually inspect those 18% of failing data rows. I bet they have something in common. Find it out, and change the configuration of the TfIdfVectorizer
step accordingly.
PS. What's your definition of a failure? A wrong class label, or a wrong probability value (after 13th decimal place)?
actually i tried to do the evaluator.verify(), and it give me an exception about value 0.0 and 0.1. the PMMLPipeline.verify(X) on python side actually gave me no error
The failure i was talking about is the mismatch of the predicted value. as i ran the same test data set against the PMMLPipeline in python, and the value is different from the evaluator
Nevertheless, I cannot do anything until I have access to sample data.
The JPMML-SkLearn library has adequate test coverage in this area, and everything is reproducible/works as advertised: https://github.com/jpmml/jpmml-sklearn/blob/master/src/test/resources/main.py#L430-L448
What are you doing differently? Did you try the second suggestion, which is simplifying your pipeline from two TfIdfs to one TfIdf?
here is some sample data separated by ; as the separator summary and title are the two columns that i want to apply TfidfVectorizer on them individually, and discipline is the label
discipline;summary;title
Academic experience;qwertyuiop;Research Assistant
Software Developer;qwertyuiop;Staff Software Engineer
Software Developer;qwertyuiop;Web Development Intern
Academic experience;qwertyuiop;Graduate Researcher
Software Developer;Building and maintaining a single page web application for viewing high throughput genomic data https tumormap ucsc edu Responsibilities include Fully understand and accurately explain entire pipeline from tertiary data to visualization Build and maintain ability to transition between maps via a selection of samples Reflection functionality MeteorJS MongoDB Research build maintain Spatial Correlation Analysis functionality Python Scipy sklearn Provide endpoints for computation server Python Flask Unit tests unit tests unit tests Python Pytest Mentor students on computational biology projects Write technical documentation for methods provided by the software Python Sphinx Implement prototypes for hierarchical viewer leafletJS plotlyJS Python Flask;Junior Software Engineer
Software Developer;qwertyuiop;Programmer
Academic experience;qwertyuiop;Intern
Software Developer;qwertyuiop;Infrastructure Developer
Academic experience;Conducted an industry analysis to help identify trends from past five years market data about sales and pricing Made Comparisons between the data of internal competitors to help do SWOT analysis;Summer Intern
Other engineering;Managed the scheduling of recording sessions and maintenance of studio equipment Executed the planning setup engineering breakdown editing and mixing of projects Built good relations with clients and maintained effective lines of communication;Audio Engineer
Software Developer;qwertyuiop;Web Developer
Software Developer;Developed a script to read a csv file containing success or failures of different recipes on a variety of tools and generate a heat map showing success rates for recipes across tools Created Excel workbook to transform measurement data into a different coordinate system to enable more meaningful data analysis Linked data from data warehouse and measurement data provided by members of the testing lab to identify flaws in the recipe tool or measurements;Engineering Intern
Software Developer;qwertyuiop;Assistant Engineer
Recruiting;qwertyuiop;IT Technical Recruiter
Management;qwertyuiop;President
Software Developer;qwertyuiop;Research Intern
Software Developer;qwertyuiop;Frontend Developer
Software Developer;qwertyuiop;Software Consultant
Data Science;qwertyuiop;Data Analyst
Software Developer;I worked in backend group to maintain a Restaurant Recommendation Website Developed a web service using Java servlet REST API to fetch restaurant data from Yelp API Utilized MySQL MongoDB to store user preference and restaurant information Designed g a content based recommendation algorithm to match similar restaurants based on categories Improved precision of recommendation by ordering restaurants based on distance and stars Deployed the application to Amazon EC for better performance Tested the web service and app with unit tests JUnit and load tests JMeter;Software Engineer
Software Developer;qwertyuiop;System analyst
Management;Responsible for managing commercial residential technology solutions and management;President
Sales;qwertyuiop;Account Executive
QA;qwertyuiop;Test Engineer
Academic experience;qwertyuiop;intern
Software Developer;qwertyuiop;Software Engineer Internship
Software Developer;qwertyuiop;Research Intern
my latest code in python is
train_data = pd.read_csv("sample.csv", sep=";")
train_data["title"] = train_data["title"].astype("str")
train_data["summary"] = train_data["summary"].astype("str")
feature_def = gen_features(
columns=["summary", "title"],
classes=[
{
"class": TfidfVectorizer,
"max_df": max_df,
"ngram_range": (ngram_min, ngram_max),
"max_features": max_features,
"stop_words": "english",
"norm": None,
"preprocessor": None,
"strip_accents": None,
"dtype": numpy.float32,
"tokenizer": Splitter(),
"analyzer": "word",
"use_idf": True,
"binary": False,
}
],
)
mapper = DataFrameMapper(feature_def, input_df=True, df_out=True)
pmml_pipeline = PMMLPipeline(
[("mapper1", mapper1), ("classifier", RandomForestClassifier())] # All terms
)
X = train_data.drop("discipline", axis=1)
Y = train_data["discipline"]
pmml_pipeline.fit(X, Y)
sklearn2pmml(pmml_pipeline, "model.pmml", with_repr=True, debug=True)
as always , thank you very much
Simplified your Python code to the following:
from pandas import DataFrame
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.feature_extraction.text import Splitter
from sklearn2pmml.pipeline import PMMLPipeline
import numpy
import pandas
train_data = pandas.read_csv("sample.csv", sep=";")
train_data["title"] = train_data["title"].astype("str")
train_data["summary"] = train_data["summary"].astype("str")
train_data["discipline"] = train_data["discipline"].astype("str")
pmml_pipeline = PMMLPipeline([
("tfidf", TfidfVectorizer(analyzer = "word", preprocessor = None, lowercase = True, strip_accents = None, token_pattern = None, tokenizer = Splitter(), stop_words = "english", ngram_range = (1, 2), norm = None, dtype = numpy.float32)),
("classifier", RandomForestClassifier()) # All terms
])
pmml_pipeline.fit(train_data["summary"], train_data["discipline"])
sklearn2pmml(pmml_pipeline, "model.pmml", with_repr=True, debug=False)
proba = DataFrame(pmml_pipeline.predict_proba(train_data["summary"]))
proba.to_csv("python-output.csv", sep = ";")
Indeed, six or seven data rows out of total 27 data rows are giving different predictions between Scikit-Learn and Java.
I did try running it again with 0.19.0 Sklearn but i still get mismatches Here are the versions that I ran with
python: 3.6.6
sklearn: 0.19.0
sklearn.externals.joblib: 0.11
pandas: 0.23.4
sklearn_pandas: 1.7.0
sklearn2pmml: 0.39.0
java: 1.8.0_171
@vinluvie Please post all further comments regarding this issue here: https://github.com/jpmml/jpmml-sklearn/issues/89
This is a JPMML-SkLearn problem.
Unlocking, as this really is a JPMML-Evaluator problem - the scoring works correctly in 1.4.2, but not in 1.4.3.
The regression happened in commit 7baf37d9f6
The root cause of this issue is excessive SAX whitespace filtering: https://github.com/jpmml/jpmml-model/issues/19
It's possible to make JPMML-Evaluator 1.4.X work by disabling the SAX whitespace filtering during PMML document unmarshalling.
Default:
PMML pmml = org.jpmml.model.PMMLUtil.unmarshal(is);
Temporary workaround:
// Bad
// Source source = SAXUtil.createFilteredSource(is, new ImportFilter(), WhitespaceFilter());
// Good
Source source = SAXUtil.createFilteredSource(is, new ImportFilter());
PMML pmml = org.jpmml.model.JAXBUtil.unmarshalPMML(source);
Hi,
I have saved a RandomForestClassifier model in python, and load it in Scala and compare the test data result, for my test, 18% of them are different from each other. I have been trying different things on setting and could not get them match, would you mind please take a look of my code and see if I have any bug. I have a data set which has three columns, title, summary and target, i would like to apply TfidfVectorizer on title and summary, and then ran them with the RandomForestClassifier
On Scala side, this is my code
Thank you very much, really appreciate your work