The tokenization of non-latin text (Ukrainian) is not reproducible between Scikit-Learn and PMML

rostIvan commented 3 years ago

Hello, I'm trying to use jpmml/jpmml-evaluator and seems I am faced with the issue. I saved sklearn pipeline as pmml model and I'm not sure if Evaluator.evaluate() works as expected.


X = df['Text']
y = df['Label']

svc_clf = svm.SVC(kernel='linear')
pipeline = Pipeline([
    ('vec', TfidfVectorizer(
        lowercase=True,
        stop_words=stop_words_ua,
        ngram_range=(1, 2),
        norm=None,
    )),
    ("clf", svc_clf)
])
pipeline.fit(X, y)
print(pipeline.predict(['Чудово'])) # [1]
print(pipeline.predict(['Погано'])) # [-1]
print(pipeline.predict(['Зробив друге щеплення Pfizer.'])) # [0]

pmml_pipeline = make_pmml_pipeline(
    pipeline,
    active_fields=["Text"],
    target_fields=["Label"]
)
print(pmml_pipeline.predict(['Чудово'])) # [1]
print(pmml_pipeline.predict(['Погано'])) # [-1]
print(pmml_pipeline.predict(['Зробив друге щеплення Pfizer.'])) # [0]

sklearn2pmml(pmml_pipeline, "model/model.pmml")

model = Model.fromFile('model/model.pmml')
print(model.predict(['Чудово'])) # [1, 0.6666666666666666, 0.3333333333333333, 0.0, 0.6666666666666666]
print(model.predict(['Погано'])) # [-1, 0.6666666666666666, 0.6666666666666666, 0.0, 0.3333333333333333]
print(model.predict(['Зробив друге щеплення Pfizer.'])) # [0, 0.6666666666666666, 0.0, 0.6666666666666666, 0.3333333333333333]

Here the expected predictions are [1, -1, 0] respectivly and it works, but when I do it from java:

public static void main(String[] args) throws Exception {
        final Evaluator evaluator = loadModel();
        System.out.println(predict(evaluator, "Чудово")); // 1
        System.out.println(predict(evaluator, "Погано")); // 1
        System.out.println(predict(evaluator, "Зробив друге щеплення Pfizer.")); // 0

        System.out.println(predict_(evaluator, "Чудово")); // 1
        System.out.println(predict_(evaluator, "Погано")); // 1
        System.out.println(predict_(evaluator, "Зробив друге щеплення Pfizer.")); // 0
    }

    private static Evaluator loadModel() throws Exception {
        Evaluator evaluator = new LoadingModelEvaluatorBuilder()
            .load(new FileInputStream(MODEL_PMML))
            .build();
        return evaluator.verify();
    }

    private static int predict(final Evaluator evaluator, final String text) {
        final Map<FieldName, ?> evaluate = evaluator.evaluate(
            Collections.singletonMap(
                FieldName.create("Text"),
                FieldValueUtil.create(text)
            )
        );
        System.out.println(evaluate);
        final Object value = EvaluatorUtil.decodeAll(evaluate).get("Label");
        return (int) value;
    }

    private static int predict_(final Evaluator evaluator, final String text) {
        Map<String, String> features = new HashMap<>();
        features.put("Text", text);
        final List<InputField> inputFields = evaluator.getInputFields();
        Map<FieldName, FieldValue> arguments = new LinkedHashMap<>();
        for (InputField inputField : inputFields) {
            FieldName inputName = inputField.getName();
            String value = features.get(inputName.toString());
            FieldValue inputValue = inputField.prepare(value);
            arguments.put(inputName, inputValue);
        }
        Map<FieldName, ?> results = evaluator.evaluate(arguments);
        Map<String, ?> resultRecord = EvaluatorUtil.decodeAll(results);
        Integer yPred = (Integer) resultRecord.get("Label");
        System.out.printf("PMML output %s\n", results);
        return yPred;
    }

This java code output looks:

{Label=VoteProbabilityDistribution{result=1, vote_entries=[-1=1.0, 1=2.0]}}
1
{Label=VoteProbabilityDistribution{result=1, vote_entries=[-1=1.0, 1=2.0]}}
1
{Label=VoteProbabilityDistribution{result=0, vote_entries=[0=2.0, 1=1.0]}}
0
PMML output {Label=VoteProbabilityDistribution{result=1, vote_entries=[-1=1.0, 1=2.0]}}
1
PMML output {Label=VoteProbabilityDistribution{result=1, vote_entries=[-1=1.0, 1=2.0]}}
1
PMML output {Label=VoteProbabilityDistribution{result=0, vote_entries=[0=2.0, 1=1.0]}}
0

But at the same time when I use pmml4s

    public static void main(String[] args) {
        Model model = loadModel();
        System.out.println(predict(model, "Чудово")); // 1
        System.out.println(predict(model, "Погано")); // -1
        System.out.println(predict(model, "Зробив друге щеплення Pfizer.")); // 0
    }

    private static int predict(final Model model, final String text) {
        final Map<String, Object> predict = model.predict(Collections.singletonMap("Text", text));
        System.out.println(predict);
        return ((Long) predict.get("predicted_Label")).intValue();
    }

    private static Model loadModel() {
        return Model.fromFile(MODEL_PMML);
    }

I get

{predicted_Label=1, probability_-1=0.3333333333333333, probability_0=0.0, probability_1=0.6666666666666666, probability=0.6666666666666666}
1
{predicted_Label=-1, probability_-1=0.6666666666666666, probability_0=0.0, probability_1=0.3333333333333333, probability=0.6666666666666666}
-1
{predicted_Label=0, probability_-1=0.0, probability_0=0.6666666666666666, probability_1=0.3333333333333333, probability=0.6666666666666666}
0

vruusmann commented 3 years ago

I'm intrigued by this issue, but cannot look deeper into it unless I'm provided with a fully reproducible test case. Specifically, I need data to fit a pipeline locally. Maybe it's about language (UA, non-latin), which breaks some regexes?

To triangulate the issue a bit more:

Is the issue specific to SVC? Does it happen if you replace SVC with eg. LogisticRegression?
Does the issue happen if you don't use stop words?
Does the issue happen if you keep the pipeline config exactly the same, but train using english language (not UA)?

rostIvan commented 3 years ago

Is the issue specific to SVC? Does it happen if you replace SVC with eg. LogisticRegression?

With LogisticRegression the issue persists

pipeline = Pipeline([
    ('vec', TfidfVectorizer(
        lowercase=True,
        # stop_words=stop_words_ua,
        ngram_range=(1, 2),
        norm=None,
    )),
    ("clf", LogisticRegression(multi_class="ovr"))
])

jpmml-evaluator

{Label=ProbabilityDistribution{result=1, probability_entries=[-1=0.051933355493484615, 0=1.5082486472225713E-5, 1=0.9480515620200431]}, probability(-1)=0.051933355493484615, probability(0)=1.5082486472225713E-5, probability(1)=0.9480515620200431}
1
{Label=ProbabilityDistribution{result=1, probability_entries=[-1=0.051933355493484615, 0=1.5082486472225713E-5, 1=0.9480515620200431]}, probability(-1)=0.051933355493484615, probability(0)=1.5082486472225713E-5, probability(1)=0.9480515620200431}
1
{Label=ProbabilityDistribution{result=0, probability_entries=[-1=1.330182050313439E-5, 0=0.8808597118587532, 1=0.1191269863207437]}, probability(-1)=1.330182050313439E-5, probability(0)=0.8808597118587532, probability(1)=0.1191269863207437}
0

pmml4s

{probability(1)=0.9942817718574339, probability(0)=1.3516436779607912E-5, probability(-1)=0.005704711705786472}
probability(1)=0.9942817718574339
{probability(1)=7.33566669438287E-4, probability(0)=3.1791754536221345E-6, probability(-1)=0.9992632541551081}
probability(-1)=0.9992632541551081
{probability(1)=2.8958519676396626E-4, probability(0)=0.9995890921678172, probability(-1)=1.2132263541879416E-4}
probability(0)=0.9995890921678172

Does the issue happen if you don't use stop words?

Seems yeah

pipeline = Pipeline([
    ('vec', TfidfVectorizer(
        lowercase=True,
        # stop_words=stop_words_ua,
        ngram_range=(1, 2),
        norm=None,
    )),
    ("clf", svm.SVC(kernel='linear', probability=True))
])

jpmml-evaluator

{Label=VoteProbabilityDistribution{result=-1, vote_entries=[-1=2.0, 1=1.0]}}
-1
{Label=VoteProbabilityDistribution{result=-1, vote_entries=[-1=2.0, 1=1.0]}}
-1
{Label=VoteProbabilityDistribution{result=1, vote_entries=[0=1.0, 1=2.0]}}
1

pmml4s

{predicted_Label=1, probability_-1=0.3333333333333333, probability_0=0.0, probability_1=0.6666666666666666, probability=0.6666666666666666}
1
{predicted_Label=-1, probability_-1=0.6666666666666666, probability_0=0.0, probability_1=0.3333333333333333, probability=0.6666666666666666}
-1
{predicted_Label=0, probability_-1=0.0, probability_0=0.6666666666666666, probability_1=0.3333333333333333, probability=0.6666666666666666}
0

Does the issue happen if you keep the pipeline config exactly the same, but train using english language (not UA)?

Not sure, I haven't tested it

Maybe it's about language (UA, non-latin), which breaks some regexes?

Potentially it could be also something with emojis

I need data to fit a pipeline locally

I have provided data via the direct email listed in your Github profile

vruusmann commented 3 years ago

Potentially it could be also something with emojis

That's a very good suggestion!

PMML performs TF(-IDF) on tokens that have been stripped of leading and ending punctuation characters. JPMML-Evaluator uses a custom code for identifying punctuation characters, and it is a possibility that emojis are not handled by it.

If so, then the bug is somewhere here (the list should be extended with more character classes): https://github.com/jpmml/jpmml-model/blob/1.5.15/pmml-model/src/main/java/org/jpmml/model/TermUtil.java#L45-L61

vruusmann commented 3 years ago

@rostIvan I've been experimenting with the data files that you sent to me privately, and my conclusion is that this is a JPMML-SkLearn bug related to the encoding of text tokenization instructions (the TfidfVectorizer.tokenizer attribute).

This issue has got nothing to do with the PMML evaluation side. The JPMML-Evaluator library is "correctly" following incorrect tokenization instructions, and therefore fails. The PMML4S library is "incorrectly" using internal text tokenization instructions (ignoring the ones encoded in the PMML document) and, on the surface, appears to be giving more correct predictions, but is fundamentally completely off the tracks.

I'll detail my investigative actions below.

vruusmann commented 3 years ago

My error diagnostics procedure:

Generating a clean test data file. The original data file contains sentences that contain "unexpected" comma character, which confuses my naive CSV parser. These sentences will be removed:

$ grep -v '^"' data.csv > data-clean.csv

Updating the Scikit-Learn pipeline to use a probabilistic classifier (ex. LogisticRegression) instead of a non-probabilistic one (ex. SVC). Checking PMML predicted probabilities against Scikit-Learn predicted probabilities is will be the criterion for deciding if the text tokenization happens correctly or not.

from pandas import DataFrame
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.feature_extraction.text import Matcher, Splitter
from sklearn2pmml.pipeline import PMMLPipeline

import pandas

df = pandas.read_csv("data-clean.csv")

X = df['Text']
y = df['Label']

with open("stopwords_ua.txt") as file:
    stop_words_ua = file.readlines()
    stop_words_ua = [stop_word_ua.strip() for stop_word_ua in stop_words_ua]

pipeline = PMMLPipeline([
    ('vec', TfidfVectorizer(
        lowercase=True,
        stop_words=stop_words_ua,
        ngram_range=(1, 2),
        norm=None
    )),
    ("clf", LogisticRegression())
])
pipeline.fit(X, y)

label = DataFrame(pipeline.predict(X), columns = ["Label"])
label_proba = DataFrame(pipeline.predict_proba(X), columns = ["probability(-1)", "probability(0)", "probability(1)"])
label = pandas.concat((label, label_proba), axis = 1)

label.to_csv("pipeline.csv", index = False)

sklearn2pmml(pipeline, "pipeline.pmml")

Finally, checking probabilities using the org.jpmml.evaluator.example.TestingExample command-line application:

$ java -cp pmml-evaluator-example-executable-1.5-SNAPSHOT.jar org.jpmml.evaluator.example.TestingExample --model pipeline.pmml --input data-clean.csv --expected-output pipeline.csv --separator ","

vruusmann commented 3 years ago

Current results (JPMML-SkLearn 1.6.28 plus JPMML-Evaluator 1.5.15):

Default tokenizer

Config:

tfidf = TfidfVectorizer(
    lowercase=True,
    stop_words=stop_words_ua,
    ngram_range=(1, 2),
    norm=None,
    tokenizer = None
)

There are 3450 conflicts (for a validaton data set of 3720 data records).

Splitter-mode tokenizer

Config:

from sklearn2pmml.feature_extraction.text import Splitter

tfidf = TfidfVectorizer(
    lowercase=True,
    stop_words=stop_words_ua,
    ngram_range=(1, 2),
    norm=None,
    tokenizer = Splitter(),
    max_features=100
)

There are 14 conflicts. Many problematic sentences appear to contain Ukrainian-specific(?) whitespace character(s). For example, in Pfizer і Moderna, the first whitespace is a space character, but the second one is something else.

Matcher-mode tokenizer:

Config:

from sklearn2pmml.feature_extraction.text import Matcher

tfidf = TfidfVectorizer(
    lowercase=True,
    stop_words=stop_words_ua,
    ngram_range=(1, 2),
    norm=None,
    tokenizer = Matcher(),
    max_features=100
)

There are 14 conflicts again.

vruusmann commented 3 years ago

@rostIvan TLDR: When working with Ukrainian text, you'd need to specify a custom text tokenizer (one of sklearn2pmml.feature_extraction.text.Splitter or sklearn2pmml.feature_extraction.text.Matcher), plus sanitize/standardize the whitespace.

Both custom text tokenizers allow you to override the regular expression. Please experiment, perhaps you can find a regular expression that captures Ukrainian whitespace characters as well. Maybe replacing latin-style regex Splitter(word_separator_re = "\s+") with unicode-style regex Splitter(word_separator_re = "(?u)\s+") will suffice?

Finally, the PMML4S library is not doing a correct job by any means - it simply ignores the regex pattern that is enclosed in the PMML document, and uses an internal/hard-coded one. The correct behaviour would be to make incorrect predictions :-)

rostIvan commented 2 years ago

@vruusmann, that's right, when I added tokenizer=Splitter() now it works ok

pipeline = Pipeline([
    ('vec', TfidfVectorizer(
        lowercase=True,
        tokenizer=Splitter(),
        stop_words=stop_words_ua,
        ngram_range=(1, 2),
        norm=None,
    )),
    ("clf", svm.SVC(kernel='linear', probability=True))
])
pipeline.fit(X, y)

    public static void main(String[] args) throws Exception {
        final Evaluator evaluator = loadModel();
        System.out.println(predict(evaluator, "Чудово"));
        System.out.println(predict(evaluator, "Погано"));
        System.out.println(predict(evaluator, "Зробив друге щеплення Pfizer."));
    }

    private static Evaluator loadModel() throws Exception {
        Evaluator evaluator = new LoadingModelEvaluatorBuilder()
            .load(new FileInputStream(MODEL_PMML))
            .build();
        return evaluator.verify();
    }

    private static int predict(final Evaluator evaluator, final String text) {
        final Map<FieldName, ?> evaluate = evaluator.evaluate(
            Collections.singletonMap(
                FieldName.create("Text"),
                FieldValueUtil.create(text)
            )
        );
        System.out.println(evaluate);
        final Object value = EvaluatorUtil.decodeAll(evaluate).get("Label");
        return (int) value;
    }

{Label=VoteProbabilityDistribution{result=1, vote_entries=[-1=1.0, 1=2.0]}}
1
{Label=VoteProbabilityDistribution{result=-1, vote_entries=[-1=2.0, 1=1.0]}}
-1
{Label=VoteProbabilityDistribution{result=0, vote_entries=[0=2.0, 1=1.0]}}
0

Thank you for investigating this :)

vruusmann commented 2 years ago

that's right, when I added tokenizer=Splitter() now it works ok

If you want to build an integration test, then I'd suggest replacing SVC with LogisticRegression (or some other probabilistic classifier), and assert that predicted probabilities are correct within 1e-13.

I did exactly this; and I found that 14 data records out of 3720 are incorrect (with SVC they would likely appear to be OK).

I'm re-opening this issue, because I'd like to figure out how to make these 14 data records behave correctly. Around 10 of them suffer from an irregular whitespace character...

vruusmann commented 2 years ago

What can be done right now:

The converter should check if TF(-IDF) "vocabulary" consists of ISO-latin characters or not. If there are non-ISO latin characters present, then the converter should raise an exception if the tokenizer has not been properly configured.
The tokenization regexes need updating for non-ISO latin languages. Right now, Python and Java generate different word sets, but they should be generating the same. I believe that the Java side is doing correct job (is unicode aware).
Add a non-ISO latin integration test!

vruusmann commented 1 year ago

Closing as non-actionable.

Looks like an input issue (very extravagant Unicode whitespace characters, which don't match the "whitespace" RegEx pattern) rather than a SkLearn2PMML/Scikit-Learn level technical issue.

jpmml / jpmml-sklearn