jpmml / jpmml-sparkml

Java library and command-line application for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
267 stars 80 forks source link

Problem with underscore when using RegexTokenizer() #40

Open ajolles-kenshoo opened 6 years ago

ajolles-kenshoo commented 6 years ago

Hello, here is an issue I'm facing when using RegexTokenizer: When using RegexTokenizer in Spark pipeline, jpmml-sparkml allows two types of patterns: "\s+" and "\W+". When using "\W+" with gaps=True, it removes non alphanumerical characters, but also underscores ("_") are not removed. However, in the case when underscores appear in the text, the function toPMMLBytes returns an error which is related to the underscore. So it looks like underscores can not be removed, but also can't be left inside.

Thanks

vruusmann commented 6 years ago

The function toPMMLBytes returns an error which is related to the underscore.

Can you paste this error here?