Request matching mode and setMinTokenLength suports for RegexTokenizer

tong-zeng commented 6 years ago

Hello Villu, Thank you for this great package for exporting the spark ml models. But this package seems not easy to work with:

My input: a column named 'sentence' My output: a column named 'prediction' produced by logistic classification for the column 'sentence' My pipeline: RegexTokenizer -> NGram -> CountVectorizer -> IDF -> VectorAssembler -> LogisticRegression

Problem 1: my RegexTokenizer code as below

tokenizer = feature.RegexTokenizer()
  .setGaps(False)\
  .setPattern("\\b[a-zA-Z]{3,}\\b")\
  .setInputCol("sentence")\
  .setOutputCol("words")\

But it throws a error

IllegalArgumentException: 'Expected splitter mode, got token matching mode'

So, I'm think to implement the tokenizer by myself, pass a column of array of tokens as input, then I got:

Problem 2:

IllegalArgumentException: Expected string, integral, double or boolean type, got vector type

After tracking some issues, I understand that vector type is not support, so, I have to consider building the pipeline from tokenizer again. Then, I changed my tokenizer to splitter mode:

tokenizer = feature.RegexTokenizer()
  .setGaps(True)\
  .setPattern("\\s+")\
  .setInputCol("sentence")\
  .setOutputCol("words")\

Then I got:

Problem 3:

 java.lang.IllegalArgumentException: .
    at org.jpmml.sparkml.feature.CountVectorizerModelConverter.encodeFeatures(CountVectorizerModelConverter.java:118)
    at org.jpmml.sparkml.FeatureConverter.registerFeatures(FeatureConverter.java:48)
    at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:80)

After checking with the source code at line 118, there is a requirement that the token cannot startwith and endwith a punctuation. But this happens a lot. For example, "This is a sentence.", after split by space, the last token endwith a period ( . ). In this case, if use catching mode, the pattern "\b[a-zA-Z]{3,}\b") can extract 'clean' tokens easily. I have no choice, but continue hacking. Then, I try to split sentence by pattern \\b[^a-zA-Z]{0,}\\b which split the text by non English letter, then, filter the token by set the min token length at 3. This works fine in Spark, but when I export the pipeline, I got another error

__Problem 4:___

java.lang.IllegalArgumentException: Expected 1 as minimum token length, got 3 as minimum token length
    at org.jpmml.sparkml.feature.RegexTokenizerConverter.encodeFeatures(RegexTokenizerConverter.java:51)

As what it reads, the getMinTokenLength is not supported in jpmml-sparkml.

I'm really frustrated, since this is a simply and typical task, I've tiried different means to overcome it, but all failed. Could you please point me a right direction, thank you.

tong-zeng commented 6 years ago

I found that PMML only support splitter mode, use wordSeparatorCharacterRE to pass a regular expression as the separator character, according to its specification v4.3. It seems not possible to add matching mode unless the specification updates.

tong-zeng commented 6 years ago

I also notice that the tokens cannot startwith and endwith punctuations is a requirement of PMML standard.

But spark doesn't require this. It's good to add these difference in the documentation. To help user considering this when training the model.

tong-zeng commented 6 years ago

In the end, I've adjusted my input data to cater to jpmml, re-trained the model in spark, then export to pmml. Haha, It works, thank you.

jpmml / jpmml-sparkml

Request matching mode and setMinTokenLength suports for RegexTokenizer #42