Closed PowerToThePeople111 closed 1 year ago
The CountVectorizer does not allow strings to contain punctuation anymore.
What Apache Spark ML version are you talking about? If it's 3.3.X, then please append your complaint to https://github.com/jpmml/jpmml-sparkml/issues/129
I realised that the replace function is not supported yet in SQLTransformers.
The "replace" SQL function is fully cupported. See https://github.com/jpmml/pyspark2pmml/issues/40
I am currently using Apache Spark 3.2.1. And I am using scala. I am unsure if that is of importance, but since you mentioned pyspark2pmml I thought I should tell you.
And I got the message that replace is not supported when trying to export the pipeline. I would have to rerun the job if i want to reproduce the exact error message, but if that would help you, i can try to do it until end of next week latest.
For now i just replaced all non-alphanumeric characters in my words before training the pipeline with a constant string that will not turn up and did the same in the restserver. It seems to work.
Began to doubt myself, so I went and checked the list of SQL functions here: https://spark.apache.org/docs/latest/api/sql/index.html
Looks like replace
and regexp_replace
are two different things:
https://spark.apache.org/docs/latest/api/sql/index.html#replace
https://spark.apache.org/docs/latest/api/sql/index.html#regexp_replace
The PMML built-in function replace
is functionally equivalent to Apache Spark ML's regexp_replace
SQL function.
The replace
SQL function is currently unsupported.
The workaround is obvious - use the regexp_replace
SQL function, and specify its regexp
and rep
arguments as literal strings (ie. should not contain any regexp meta-characters and stuff).
The workaround is obvious - use the regexp_replace SQL function, and specify its regexp and rep arguments as literal strings (ie. should not contain any regexp meta-characters and stuff).
OK, reopening this issue, because the JPMML-SparkML library could/should be able to do this replace
-> regexp_replace
substitution automatically.
The CountVectorizer does not allow strings to contain punctuation anymore. Which is sad because the words in my use case contain dots.
@PowerToThePeople111 Could you please generate a reproducible test case, and open a new issue around this topic?
The PMML approach would be to tokenize using RegexTokenizer
and then count using CountVectorizer
. I wonder, if the RegexTokenizer
is generating a "punctuated token", then is CounVectorizer
really rejecting it? When did this regression happen (eg. some JIRA issue ref)?
Thank you for having a look into this! I will create some short example to reproduce this. But I am very busy atm so it might take until end of next week.
Hello,
i have been trying to update models that i used to put to production using openscoring. I have recognized 2 things tho:
Is there another way to get the exported pipeline working except for manually doing this replace within the openscoring service?
Edit: I just realized that also "_" are not allowed in words that shall be input for the CountVectorizer.