Hello,
here is an issue I'm facing when using RegexTokenizer:
When using RegexTokenizer in Spark pipeline, jpmml-sparkml allows two types of patterns:
"\s+" and "\W+".
When using "\W+" with gaps=True, it removes non alphanumerical characters, but also underscores ("_") are not removed.
However, in the case when underscores appear in the text, the function toPMMLBytes returns an error which is related to the underscore.
So it looks like underscores can not be removed, but also can't be left inside.
Hello, here is an issue I'm facing when using RegexTokenizer: When using RegexTokenizer in Spark pipeline, jpmml-sparkml allows two types of patterns: "\s+" and "\W+". When using "\W+" with gaps=True, it removes non alphanumerical characters, but also underscores ("_") are not removed. However, in the case when underscores appear in the text, the function toPMMLBytes returns an error which is related to the underscore. So it looks like underscores can not be removed, but also can't be left inside.
Thanks