jpmml / sklearn2pmml

Python library for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
685 stars 113 forks source link

Replace function & capture groups #228

Open bindernet opened 4 years ago

bindernet commented 4 years ago

Hi Villu. I'm trying to do a replace with a regex, and use the captured groups in the substitution. Is this supported? It's not clear to me from the standard and I can't find any examples. It's currently outputing the substitute string verbatim to the next transformation.

I am using the replacement transformer from sklearn2pmml and it's generating PMML like this:

<DerivedField name="trxNorm2" optype="categorical" dataType="string">
    <Apply function="replace">
        <FieldRef field="normalizedDateTime"/>
        <Constant dataType="string">(\d{1,2})/(\d{1,2})/(\d{2,4})( .*?)(:\d\d\d)$</Constant>
        <Constant dataType="string">\3-\2-\1 \4</Constant>
    </Apply>
</DerivedField>

Am I missing something obvious? EDIT: I had a look at Functions.java which uses the Matcher replaceAll method According to the documentation this uses a $ for group substitution where as Python uses a \ Could that be the issue?

Thanks.

vruusmann commented 4 years ago

I'm trying to do a replace with a regex, and use the captured groups in the substitution. Is this supported?

The standard says that regexes must be processed as "Perl Compatible Regular Expressions" (PCREs).

Java's built-in implementation should be very similar to PCRE, but there are some minor differences listed nonetheless: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html

It's currently outputing the substitute string verbatim to the next transformation.

I have some unit tests here: https://github.com/jpmml/jpmml-evaluator/blob/1.5.1/pmml-evaluator/src/test/java/org/jpmml/evaluator/FunctionTest.java#L340-L363

Perhaps you'd need to reference the capturing group using $<index> instead? See this: https://github.com/jpmml/jpmml-evaluator/blob/1.5.1/pmml-evaluator/src/test/java/org/jpmml/evaluator/FunctionTest.java#L350

vruusmann commented 4 years ago

According to the documentation this uses a $ for group substitution where as Python uses a \ Could that be the issue?

We spotted the same issue/solution at the same time.

The golden reference for regex syntax is PCRE. Does it use \ or $?

bindernet commented 4 years ago

Replacement is only in PCRE2 Judging from here it's $: https://www.regular-expressions.info/pcre2.html

vruusmann commented 4 years ago

During the initialization of RegexTransformer Python class, did you see this warning printed? https://github.com/jpmml/sklearn2pmml/blob/master/sklearn2pmml/preprocessing/__init__.py#L18

It can be ignored oftentimes, but looks like when you're doing replacement with capturing groups, it should be taken seriously.

Right now I'm inclined to think that the Java side of the JPMML-SkLearn stack is correct - uses the $ symbol. Will need to figure out what to do with the Python side when there's no PCRE library available internally - perhaps the ReplaceTransformer should maintain some extra attribute (eg. pcre_compatible), and during conversion it would be possible to take some extra compatibility actions (for example, replace \d with $d in the replacement string).

vruusmann commented 4 years ago

Given this new information, I've transfered this issue to SkLearn2PMML/JPMML-SkLearn ecosystem.

Most likely, will need to tweak Python code.

bindernet commented 4 years ago

Believe it or not the python pcre lib uses another convention. {} by default. https://pypi.org/project/python-pcre/

It's a difficult one because I'm guessing that doing a regex on a regex is a route to madness 😀

vruusmann commented 4 years ago

the python pcre lib uses another convention

The python-pcre package looks quite outdated - last updated in Oct 2015.

Perhaps SkLearn2PMML package should include a native PCRE library itself, and configure it in full PCRE-compatibility mode. So that group references would all be $\d in Python, PMML and JPMML-Evaluator by default.

vruusmann commented 4 years ago

At minimum, the ReplaceTransformer could detect if the replacement string contains any backslash characters, and if true, raise a more meaningful error.