Closed CarlaFernandez closed 2 years ago
I'm looking at a simple model built on the titanic dataset and transformed using pmml, but I'm not seeing any specific reference to the encoding.
One-Hot-Encoding (OHE) is a "helper"-transformation, not a proper "do something useful with data"-transformation. The JPMML conversion libraries do their best to clean pipelines of no-op/helper transformations. So, the OHE transformation is erased on purpose; it's safe to do so, because it does not change the behaviour of the pipeline in any way.
Technical explanation: Apache Spark transformers and estimators typically only accept all-numerical feature vectors; you can't have a feature vector that contains six numeric elements and one string element (ie. can't put a string into floating point array). Therefore, Apache Spark transforms the string column to numeric column, by simply performing a dict mapping from unique string values to unique integer indexes.
Now, JPMML converters are able to detect such dummy re-mappings from string to integer index, and during the conversion, perform a reverse mapping - replacing integer index back with strings.
If you open the PMML document in a text editor you shall see category levels as original strings. It's very easy for a human to see where particular category levels are used, what are their contributions. Would you really prefer to see those re-mapped OHE indices in their place?
TLDR: The JPMML-SparkML deletes OHE transformations on purpose, because they only obfuscate the pipeline/model. It's a feature, not a bug. The pipeline will still make correct predictions.
Hello @vruusmann , thank you for this amazing library. I was wondering how does PMML handle the transformation of a Spark pipeline with a OneHotEncoder stage? I'm looking at a simple model built on the titanic dataset and transformed using pmml, but I'm not seeing any specific reference to the encoding. Thanks