jpmml / jpmml-sparkml

Java library and command-line application for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
267 stars 80 forks source link

Fix stringindexermodel missing value issue #105

Closed python279 closed 3 years ago

python279 commented 3 years ago

add missingValueReplacement="__unknown" missingValueTreatment="asValue" in jpmml MiningField to avoid evaluation failed when the string field is missed.

vruusmann commented 3 years ago

Missing values and invalid values are two different concepts.

The suggested fix is not viable, because it behaves differently than Apache Spark ML would behave in such a situation.

If your data contains missing values, then you need to use an imputer to convert them to non-missing values.

python279 commented 3 years ago

spark ml works fine even the string field is missing, but jpmml evaluator doesn't. The missingValueReplacement="__unknown" missingValueTreatment="asValue" in jpmml MiningField can fix the problem.

vruusmann commented 3 years ago

AFAIK, Apache Spark ML models do not accept missing values. You must handle missing values outside of the model, typically using an imputer.

Are you suggesting that StringIndexer is accepting missing values, and treats them the same as invalid values?

For example, consider a StringIndexer that has been trained on a three-value value space "green", "yellow" and "red". What is the output of this StringIndexer when it is asked to encode a "blue" value (invalid - was not present in the original training dataset) and a null value (missing)? Is it __unknown in both cases?

Also, does the identical behaviour persist when customizing the StringIndexer#handleInvalid attribute?

What's your target Apache Spark ML version? This PR was submitted against the 3.0.X version.