Open mrk-andreev opened 1 year ago
I need special symbols like { or : but current implementations don't allow this.
Can you describe me your pipeline configuration that feeds into this CountVectorizer
stage? Is there any text pre-processing going on, how is the text tokenized etc?
The point is that the JPMML-SparkML library is following the PMML specification when deciding what can and what cannot be allowed. Punctuation chars can be significant (depending on the text tokenization mode), so they cannot be enabled/disabled at will.
Moreover it allows me to create invalid xml documents when I have invalid XML Characters.
Care to provide an example about this behaviour?
It should be the case that the JPMML-SparkML is populating an org.dmg.pmml.PMML
object with whatever strings it pleases, and in the end this object is marshalled to an PMML XML document using the standard JAXB technology.
You're basically claiming that the JAXB marshaller is somehow misbehaving (eg. not encoding/escaping some characters).
In this example I use patched CountVectorizerModelConverter
that removes punctuation from vocabulary:
for(int i = 0; i < vocabulary.length; i++){
String term = vocabulary[i];
if(TermUtil.hasPunctuation(term)){
result.add(new TermFeature(encoder, defineFunction, documentFeature, "-"));
} else {
result.add(new TermFeature(encoder, defineFunction, documentFeature, term));
}
}
I put this jars into pyspark jars directory (venv/lib/python3.8/site-packages/pyspark/jars
) and use pyspark2pmml
for model export:
from pyspark.sql import SparkSession
from pyspark.ml import PipelineModel
from pyspark2pmml import PMMLBuilder
spark = SparkSession.builder.master('local[*]').getOrCreate()
pm = PipelineModel.load('./model.bin')
pmmlBuilder = PMMLBuilder(spark.sparkContext, spark.createDataFrame([('', '')], schema=['lang', 'content']), pm)
pmmlBuilder.buildFile('./model.pmml')
(remove .zip suffix from .z01 , z02. required for upload) parts_model.bin.zip parts_model.bin.z01.zip parts_model.bin.z02.zip
Output model pmml model will contains invalid xml:
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import java.io.File;
public class Main {
public static void main(String[] args) throws Exception {
DocumentBuilder parser = DocumentBuilderFactory.newInstance().newDocumentBuilder();
parser.parse(new File("./model.pmml"));
}
}
[Fatal Error] model.pmml:5444:29: An invalid XML character (Unicode: 0x0) was found in the value of attribute "name" and element is "DerivedField".
Exception in thread "main" org.xml.sax.SAXParseException; systemId: file:./model.pmml; lineNumber: 5444; columnNumber: 29; An invalid XML character (Unicode: 0x0) was found in the value of attribute "name" and element is "DerivedField".
at java.xml/com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:262)
at java.xml/com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:342)
at java.xml/javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:206)
at ai.conundrum.Main.main(Main.java:10)
(remove .zip suffix from .z01 , z02. required for upload) parts_model.pmml.z01.zip parts_model.pmml.z02.zip parts_model.pmml.z03.zip parts_model.pmml.zip
In vim this parts of file (model.pmml:5444
) look like:
I try to export models that detect programming languages from input strings. That means I need special symbols like
{
or:
but current implementations don't allow this. Moreover it allows me to create invalid xml documents when I have invalid XML Characters. I suggest replacing the current implementation that doesn't allow exporting models with punctuated symbols with a new implementation that filters invalid xml chars.