jpmml / jpmml-sparkml

Java library and command-line application for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
267 stars 80 forks source link

Support for `replace` SQL function #131

Closed PowerToThePeople111 closed 1 year ago

PowerToThePeople111 commented 1 year ago

Hello,

i have been trying to update models that i used to put to production using openscoring. I have recognized 2 things tho:

  1. The CountVectorizer does not allow strings to contain punctuation anymore. Which is sad because the words in my use case contain dots.
  2. When trying to just add an SQLTransformer that replaces "." with "_" in the words, I realised that the replace function is not supported yet in SQLTransformers. I checked if maybe the function is not contained in the pmml standart, but it actually is.

Is there another way to get the exported pipeline working except for manually doing this replace within the openscoring service?

Edit: I just realized that also "_" are not allowed in words that shall be input for the CountVectorizer.

vruusmann commented 1 year ago

The CountVectorizer does not allow strings to contain punctuation anymore.

What Apache Spark ML version are you talking about? If it's 3.3.X, then please append your complaint to https://github.com/jpmml/jpmml-sparkml/issues/129

I realised that the replace function is not supported yet in SQLTransformers.

The "replace" SQL function is fully cupported. See https://github.com/jpmml/pyspark2pmml/issues/40

PowerToThePeople111 commented 1 year ago

I am currently using Apache Spark 3.2.1. And I am using scala. I am unsure if that is of importance, but since you mentioned pyspark2pmml I thought I should tell you.

And I got the message that replace is not supported when trying to export the pipeline. I would have to rerun the job if i want to reproduce the exact error message, but if that would help you, i can try to do it until end of next week latest.

For now i just replaced all non-alphanumeric characters in my words before training the pipeline with a constant string that will not turn up and did the same in the restserver. It seems to work.

vruusmann commented 1 year ago

Began to doubt myself, so I went and checked the list of SQL functions here: https://spark.apache.org/docs/latest/api/sql/index.html

Looks like replace and regexp_replace are two different things: https://spark.apache.org/docs/latest/api/sql/index.html#replace https://spark.apache.org/docs/latest/api/sql/index.html#regexp_replace

The PMML built-in function replace is functionally equivalent to Apache Spark ML's regexp_replace SQL function.

The replace SQL function is currently unsupported.

The workaround is obvious - use the regexp_replace SQL function, and specify its regexp and rep arguments as literal strings (ie. should not contain any regexp meta-characters and stuff).

vruusmann commented 1 year ago

The workaround is obvious - use the regexp_replace SQL function, and specify its regexp and rep arguments as literal strings (ie. should not contain any regexp meta-characters and stuff).

OK, reopening this issue, because the JPMML-SparkML library could/should be able to do this replace -> regexp_replace substitution automatically.

vruusmann commented 1 year ago

The CountVectorizer does not allow strings to contain punctuation anymore. Which is sad because the words in my use case contain dots.

@PowerToThePeople111 Could you please generate a reproducible test case, and open a new issue around this topic?

The PMML approach would be to tokenize using RegexTokenizer and then count using CountVectorizer. I wonder, if the RegexTokenizer is generating a "punctuated token", then is CounVectorizer really rejecting it? When did this regression happen (eg. some JIRA issue ref)?

PowerToThePeople111 commented 1 year ago

Thank you for having a look into this! I will create some short example to reproduce this. But I am very busy atm so it might take until end of next week.