jpmml / jpmml-sparkml

Java library and command-line application for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
267 stars 80 forks source link

Why One-Hot-Encoding is not visible in PMML? #124

Closed CarlaFernandez closed 2 years ago

CarlaFernandez commented 2 years ago

Hello @vruusmann , thank you for this amazing library. I was wondering how does PMML handle the transformation of a Spark pipeline with a OneHotEncoder stage? I'm looking at a simple model built on the titanic dataset and transformed using pmml, but I'm not seeing any specific reference to the encoding. Thanks

vruusmann commented 2 years ago

I'm looking at a simple model built on the titanic dataset and transformed using pmml, but I'm not seeing any specific reference to the encoding.

One-Hot-Encoding (OHE) is a "helper"-transformation, not a proper "do something useful with data"-transformation. The JPMML conversion libraries do their best to clean pipelines of no-op/helper transformations. So, the OHE transformation is erased on purpose; it's safe to do so, because it does not change the behaviour of the pipeline in any way.

Technical explanation: Apache Spark transformers and estimators typically only accept all-numerical feature vectors; you can't have a feature vector that contains six numeric elements and one string element (ie. can't put a string into floating point array). Therefore, Apache Spark transforms the string column to numeric column, by simply performing a dict mapping from unique string values to unique integer indexes.

Now, JPMML converters are able to detect such dummy re-mappings from string to integer index, and during the conversion, perform a reverse mapping - replacing integer index back with strings.

If you open the PMML document in a text editor you shall see category levels as original strings. It's very easy for a human to see where particular category levels are used, what are their contributions. Would you really prefer to see those re-mapped OHE indices in their place?

TLDR: The JPMML-SparkML deletes OHE transformations on purpose, because they only obfuscate the pipeline/model. It's a feature, not a bug. The pipeline will still make correct predictions.