Closed rostIvan closed 1 year ago
I'm intrigued by this issue, but cannot look deeper into it unless I'm provided with a fully reproducible test case. Specifically, I need data to fit a pipeline locally. Maybe it's about language (UA, non-latin), which breaks some regexes?
To triangulate the issue a bit more:
SVC
? Does it happen if you replace SVC
with eg. LogisticRegression
?Is the issue specific to SVC? Does it happen if you replace SVC with eg. LogisticRegression?
With LogisticRegression the issue persists
pipeline = Pipeline([
('vec', TfidfVectorizer(
lowercase=True,
# stop_words=stop_words_ua,
ngram_range=(1, 2),
norm=None,
)),
("clf", LogisticRegression(multi_class="ovr"))
])
{Label=ProbabilityDistribution{result=1, probability_entries=[-1=0.051933355493484615, 0=1.5082486472225713E-5, 1=0.9480515620200431]}, probability(-1)=0.051933355493484615, probability(0)=1.5082486472225713E-5, probability(1)=0.9480515620200431}
1
{Label=ProbabilityDistribution{result=1, probability_entries=[-1=0.051933355493484615, 0=1.5082486472225713E-5, 1=0.9480515620200431]}, probability(-1)=0.051933355493484615, probability(0)=1.5082486472225713E-5, probability(1)=0.9480515620200431}
1
{Label=ProbabilityDistribution{result=0, probability_entries=[-1=1.330182050313439E-5, 0=0.8808597118587532, 1=0.1191269863207437]}, probability(-1)=1.330182050313439E-5, probability(0)=0.8808597118587532, probability(1)=0.1191269863207437}
0
{probability(1)=0.9942817718574339, probability(0)=1.3516436779607912E-5, probability(-1)=0.005704711705786472}
probability(1)=0.9942817718574339
{probability(1)=7.33566669438287E-4, probability(0)=3.1791754536221345E-6, probability(-1)=0.9992632541551081}
probability(-1)=0.9992632541551081
{probability(1)=2.8958519676396626E-4, probability(0)=0.9995890921678172, probability(-1)=1.2132263541879416E-4}
probability(0)=0.9995890921678172
Does the issue happen if you don't use stop words?
Seems yeah
pipeline = Pipeline([
('vec', TfidfVectorizer(
lowercase=True,
# stop_words=stop_words_ua,
ngram_range=(1, 2),
norm=None,
)),
("clf", svm.SVC(kernel='linear', probability=True))
])
{Label=VoteProbabilityDistribution{result=-1, vote_entries=[-1=2.0, 1=1.0]}}
-1
{Label=VoteProbabilityDistribution{result=-1, vote_entries=[-1=2.0, 1=1.0]}}
-1
{Label=VoteProbabilityDistribution{result=1, vote_entries=[0=1.0, 1=2.0]}}
1
{predicted_Label=1, probability_-1=0.3333333333333333, probability_0=0.0, probability_1=0.6666666666666666, probability=0.6666666666666666}
1
{predicted_Label=-1, probability_-1=0.6666666666666666, probability_0=0.0, probability_1=0.3333333333333333, probability=0.6666666666666666}
-1
{predicted_Label=0, probability_-1=0.0, probability_0=0.6666666666666666, probability_1=0.3333333333333333, probability=0.6666666666666666}
0
Does the issue happen if you keep the pipeline config exactly the same, but train using english language (not UA)?
Not sure, I haven't tested it
Maybe it's about language (UA, non-latin), which breaks some regexes?
Potentially it could be also something with emojis
I need data to fit a pipeline locally
I have provided data via the direct email listed in your Github profile
Potentially it could be also something with emojis
That's a very good suggestion!
PMML performs TF(-IDF) on tokens that have been stripped of leading and ending punctuation characters. JPMML-Evaluator uses a custom code for identifying punctuation characters, and it is a possibility that emojis are not handled by it.
If so, then the bug is somewhere here (the list should be extended with more character classes): https://github.com/jpmml/jpmml-model/blob/1.5.15/pmml-model/src/main/java/org/jpmml/model/TermUtil.java#L45-L61
@rostIvan I've been experimenting with the data files that you sent to me privately, and my conclusion is that this is a JPMML-SkLearn bug related to the encoding of text tokenization instructions (the TfidfVectorizer.tokenizer
attribute).
This issue has got nothing to do with the PMML evaluation side. The JPMML-Evaluator library is "correctly" following incorrect tokenization instructions, and therefore fails. The PMML4S library is "incorrectly" using internal text tokenization instructions (ignoring the ones encoded in the PMML document) and, on the surface, appears to be giving more correct predictions, but is fundamentally completely off the tracks.
I'll detail my investigative actions below.
My error diagnostics procedure:
Generating a clean test data file. The original data file contains sentences that contain "unexpected" comma character, which confuses my naive CSV parser. These sentences will be removed:
$ grep -v '^"' data.csv > data-clean.csv
Updating the Scikit-Learn pipeline to use a probabilistic classifier (ex. LogisticRegression
) instead of a non-probabilistic one (ex. SVC
). Checking PMML predicted probabilities against Scikit-Learn predicted probabilities is will be the criterion for deciding if the text tokenization happens correctly or not.
from pandas import DataFrame
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.feature_extraction.text import Matcher, Splitter
from sklearn2pmml.pipeline import PMMLPipeline
import pandas
df = pandas.read_csv("data-clean.csv")
X = df['Text']
y = df['Label']
with open("stopwords_ua.txt") as file:
stop_words_ua = file.readlines()
stop_words_ua = [stop_word_ua.strip() for stop_word_ua in stop_words_ua]
pipeline = PMMLPipeline([
('vec', TfidfVectorizer(
lowercase=True,
stop_words=stop_words_ua,
ngram_range=(1, 2),
norm=None
)),
("clf", LogisticRegression())
])
pipeline.fit(X, y)
label = DataFrame(pipeline.predict(X), columns = ["Label"])
label_proba = DataFrame(pipeline.predict_proba(X), columns = ["probability(-1)", "probability(0)", "probability(1)"])
label = pandas.concat((label, label_proba), axis = 1)
label.to_csv("pipeline.csv", index = False)
sklearn2pmml(pipeline, "pipeline.pmml")
Finally, checking probabilities using the org.jpmml.evaluator.example.TestingExample
command-line application:
$ java -cp pmml-evaluator-example-executable-1.5-SNAPSHOT.jar org.jpmml.evaluator.example.TestingExample --model pipeline.pmml --input data-clean.csv --expected-output pipeline.csv --separator ","
Current results (JPMML-SkLearn 1.6.28 plus JPMML-Evaluator 1.5.15):
Config:
tfidf = TfidfVectorizer(
lowercase=True,
stop_words=stop_words_ua,
ngram_range=(1, 2),
norm=None,
tokenizer = None
)
There are 3450 conflicts (for a validaton data set of 3720 data records).
Config:
from sklearn2pmml.feature_extraction.text import Splitter
tfidf = TfidfVectorizer(
lowercase=True,
stop_words=stop_words_ua,
ngram_range=(1, 2),
norm=None,
tokenizer = Splitter(),
max_features=100
)
There are 14 conflicts. Many problematic sentences appear to contain Ukrainian-specific(?) whitespace character(s). For example, in Pfizer і Moderna
, the first whitespace is a space character, but the second one is something else.
Config:
from sklearn2pmml.feature_extraction.text import Matcher
tfidf = TfidfVectorizer(
lowercase=True,
stop_words=stop_words_ua,
ngram_range=(1, 2),
norm=None,
tokenizer = Matcher(),
max_features=100
)
There are 14 conflicts again.
@rostIvan TLDR: When working with Ukrainian text, you'd need to specify a custom text tokenizer (one of sklearn2pmml.feature_extraction.text.Splitter
or sklearn2pmml.feature_extraction.text.Matcher
), plus sanitize/standardize the whitespace.
Both custom text tokenizers allow you to override the regular expression. Please experiment, perhaps you can find a regular expression that captures Ukrainian whitespace characters as well. Maybe replacing latin-style regex Splitter(word_separator_re = "\s+")
with unicode-style regex Splitter(word_separator_re = "(?u)\s+")
will suffice?
Finally, the PMML4S library is not doing a correct job by any means - it simply ignores the regex pattern that is enclosed in the PMML document, and uses an internal/hard-coded one. The correct behaviour would be to make incorrect predictions :-)
@vruusmann, that's right, when I added tokenizer=Splitter()
now it works ok
pipeline = Pipeline([
('vec', TfidfVectorizer(
lowercase=True,
tokenizer=Splitter(),
stop_words=stop_words_ua,
ngram_range=(1, 2),
norm=None,
)),
("clf", svm.SVC(kernel='linear', probability=True))
])
pipeline.fit(X, y)
public static void main(String[] args) throws Exception {
final Evaluator evaluator = loadModel();
System.out.println(predict(evaluator, "Чудово"));
System.out.println(predict(evaluator, "Погано"));
System.out.println(predict(evaluator, "Зробив друге щеплення Pfizer."));
}
private static Evaluator loadModel() throws Exception {
Evaluator evaluator = new LoadingModelEvaluatorBuilder()
.load(new FileInputStream(MODEL_PMML))
.build();
return evaluator.verify();
}
private static int predict(final Evaluator evaluator, final String text) {
final Map<FieldName, ?> evaluate = evaluator.evaluate(
Collections.singletonMap(
FieldName.create("Text"),
FieldValueUtil.create(text)
)
);
System.out.println(evaluate);
final Object value = EvaluatorUtil.decodeAll(evaluate).get("Label");
return (int) value;
}
{Label=VoteProbabilityDistribution{result=1, vote_entries=[-1=1.0, 1=2.0]}}
1
{Label=VoteProbabilityDistribution{result=-1, vote_entries=[-1=2.0, 1=1.0]}}
-1
{Label=VoteProbabilityDistribution{result=0, vote_entries=[0=2.0, 1=1.0]}}
0
Thank you for investigating this :)
that's right, when I added tokenizer=Splitter() now it works ok
If you want to build an integration test, then I'd suggest replacing SVC
with LogisticRegression
(or some other probabilistic classifier), and assert that predicted probabilities are correct within 1e-13.
I did exactly this; and I found that 14 data records out of 3720 are incorrect (with SVC
they would likely appear to be OK).
I'm re-opening this issue, because I'd like to figure out how to make these 14 data records behave correctly. Around 10 of them suffer from an irregular whitespace character...
What can be done right now:
Closing as non-actionable.
Looks like an input issue (very extravagant Unicode whitespace characters, which don't match the "whitespace" RegEx pattern) rather than a SkLearn2PMML/Scikit-Learn level technical issue.
Hello, I'm trying to use jpmml/jpmml-evaluator and seems I am faced with the issue. I saved sklearn pipeline as pmml model and I'm not sure if
Evaluator.evaluate()
works as expected.Here the expected predictions are
[1, -1, 0]
respectivly and it works, but when I do it from java:This java code output looks:
But at the same time when I use pmml4s
I get