Open woodly0 opened 9 months ago
It's pretty unusual to start the pipeline with an ExpressionTransformer
object. There should be some "clarifications" in front of it.
The simplest way to make a clarification is to give feature specification using one of SkLearn2PMML decorators (eg. sklearn2pmml.decoration.CategoricalDomain
, OrdinalDomain
or ContinuousDomain
).
You already have CategoricalDomain
in place, but have it commented out. You probably didn't like that it captured the valid value space of your X
dataset:
<DataDictionary>
<DataField name="colors" optype="categorical" dataType="string">
<Value value="BLACK"/>
<Value value="blue"/>
<Value value="green"/>
<Value value="red"/>
<Value value="yellow "/>
</DataField>
</DataDictionary>
Well, if you don't like the valid value space information, then simply disable it using the with_data = False
flag:
color_transformers = [
# THIS!
CategoricalDomain(dtype = str, with_data = False),
ExpressionTransformer("X[0].lower()"),
MatchesTransformer("green"),
]
It's pretty unusual to start the pipeline with an ExpressionTransformer object. There should be some "clarifications" in front of it.
The ExpressionTransformer
can triangulate its position in the pipeline by observing if there are any wildcard features (ie. org.jpmml.converter.WildcardFeature
objects) among the arguments.
If there are, then it should make effort to rectify their types. For example, if there are string methods being called on a wildcard feature, then it's reasonable to assume that the type of this feature should be categorical+string
(instead of continuous+double
).
It is likely that such type rectification should happen during expression parsing phase, which means that the code change should land in the JPMML-Python library instead.
Well, if you don't like the valid value space information, then simply disable it using the with_data = False flag
This was exactly what I was looking for. Thank you!
The ExpressionTransformer can triangulate its position in the pipeline by observing if there are any wildcard features (ie. org.jpmml.converter.WildcardFeature objects) among the arguments.
So you are saying that it is still OK to start the pipeline with an ExpressionTransformer
or should it be generally avoided?
So you are saying that it is still OK to start the pipeline with an
ExpressionTransformer
or should it be generally avoided?
The ExpressionTransformer
works best with non-wildcard features.
The easiest way to convert a wildcard feature (has continuous+double
type) to a non-wildcard feature is to use SkLearn2PMML decorators.
Alternatively, the ExpressionTransformer
should simply raise a value error when there are wildcard features among the arguments.
IMO, it's better to have the conversion fail, rather than to have it produce a invalid/incomplete PMML document.
Hello Villu,
it's been a while and I hope you're fine. I've come back more questions. Let's start with some code:
The following pipeline doesn't make much sense from a machine learning poit of view, but it shows the issue very well:
In Python, everything works as expected. Now the issue is within the generated
output.pmml
file, where you can find the following:Knowing that the input has an infinte amount of possible values, how can I set this data type to "string"?