Closed python279 closed 3 years ago
Missing values and invalid values are two different concepts.
The suggested fix is not viable, because it behaves differently than Apache Spark ML would behave in such a situation.
If your data contains missing values, then you need to use an imputer to convert them to non-missing values.
spark ml works fine even the string field is missing, but jpmml evaluator doesn't. The missingValueReplacement="__unknown" missingValueTreatment="asValue" in jpmml MiningField can fix the problem.
AFAIK, Apache Spark ML models do not accept missing values. You must handle missing values outside of the model, typically using an imputer.
Are you suggesting that StringIndexer
is accepting missing values, and treats them the same as invalid values?
For example, consider a StringIndexer
that has been trained on a three-value value space "green", "yellow" and "red". What is the output of this StringIndexer
when it is asked to encode a "blue" value (invalid - was not present in the original training dataset) and a null
value (missing)? Is it __unknown
in both cases?
Also, does the identical behaviour persist when customizing the StringIndexer#handleInvalid
attribute?
What's your target Apache Spark ML version? This PR was submitted against the 3.0.X version.
add missingValueReplacement="__unknown" missingValueTreatment="asValue" in jpmml MiningField to avoid evaluation failed when the string field is missed.