Add XML invalid chars filter instead punctuation interrupt

mrk-andreev commented 1 year ago

I try to export models that detect programming languages from input strings. That means I need special symbols like { or : but current implementations don't allow this. Moreover it allows me to create invalid xml documents when I have invalid XML Characters. I suggest replacing the current implementation that doesn't allow exporting models with punctuated symbols with a new implementation that filters invalid xml chars.

package com.sun.org.apache.xml.internal.utils;

public class XMLChar {

/**
     * Returns true if the specified character is valid. This method
     * also checks the surrogate character range from 0x10000 to 0x10FFFF.
     * <p>
     * If the program chooses to apply the mask directly to the
     * <code>CHARS</code> array, then they are responsible for checking
     * the surrogate character range.
     *
     * @param c The character to check.
     */
    public static boolean isValid(int c) {
        return (c < 0x10000 && (CHARS[c] & MASK_VALID) != 0) ||
               (0x10000 <= c && c <= 0x10FFFF);
    }

vruusmann commented 1 year ago

I need special symbols like { or : but current implementations don't allow this.

Can you describe me your pipeline configuration that feeds into this CountVectorizer stage? Is there any text pre-processing going on, how is the text tokenized etc?

The point is that the JPMML-SparkML library is following the PMML specification when deciding what can and what cannot be allowed. Punctuation chars can be significant (depending on the text tokenization mode), so they cannot be enabled/disabled at will.

Moreover it allows me to create invalid xml documents when I have invalid XML Characters.

Care to provide an example about this behaviour?

It should be the case that the JPMML-SparkML is populating an org.dmg.pmml.PMML object with whatever strings it pleases, and in the end this object is marshalled to an PMML XML document using the standard JAXB technology.

You're basically claiming that the JAXB marshaller is somehow misbehaving (eg. not encoding/escaping some characters).

mrk-andreev commented 1 year ago

In this example I use patched CountVectorizerModelConverter that removes punctuation from vocabulary:

for(int i = 0; i < vocabulary.length; i++){
    String term = vocabulary[i];

    if(TermUtil.hasPunctuation(term)){
        result.add(new TermFeature(encoder, defineFunction, documentFeature, "-"));
    } else {
        result.add(new TermFeature(encoder, defineFunction, documentFeature, term));
    }
}

patched-pmml.zip

I put this jars into pyspark jars directory (venv/lib/python3.8/site-packages/pyspark/jars) and use pyspark2pmml for model export:

from pyspark.sql import SparkSession
from pyspark.ml import PipelineModel
from pyspark2pmml import PMMLBuilder

spark = SparkSession.builder.master('local[*]').getOrCreate()
pm = PipelineModel.load('./model.bin')
pmmlBuilder = PMMLBuilder(spark.sparkContext, spark.createDataFrame([('', '')], schema=['lang', 'content']), pm)
pmmlBuilder.buildFile('./model.pmml')

(remove .zip suffix from .z01 , z02. required for upload) parts_model.bin.zip parts_model.bin.z01.zip parts_model.bin.z02.zip

Output model pmml model will contains invalid xml:

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import java.io.File;

public class Main {
    public static void main(String[] args) throws Exception {
        DocumentBuilder parser = DocumentBuilderFactory.newInstance().newDocumentBuilder();
        parser.parse(new File("./model.pmml"));
    }
}

[Fatal Error] model.pmml:5444:29: An invalid XML character (Unicode: 0x0) was found in the value of attribute "name" and element is "DerivedField".
Exception in thread "main" org.xml.sax.SAXParseException; systemId: file:./model.pmml; lineNumber: 5444; columnNumber: 29; An invalid XML character (Unicode: 0x0) was found in the value of attribute "name" and element is "DerivedField".
    at java.xml/com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:262)
    at java.xml/com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:342)
    at java.xml/javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:206)
    at ai.conundrum.Main.main(Main.java:10)

(remove .zip suffix from .z01 , z02. required for upload) parts_model.pmml.z01.zip parts_model.pmml.z02.zip parts_model.pmml.z03.zip parts_model.pmml.zip

In vim this parts of file (model.pmml:5444) look like:

jpmml / jpmml-sparkml

Add XML invalid chars filter instead punctuation interrupt #136