Open tong-zeng opened 6 years ago
I found that PMML only support splitter mode
, use wordSeparatorCharacterRE
to pass a regular expression as the separator character, according to its specification v4.3.
It seems not possible to add matching mode
unless the specification updates.
I also notice that the tokens cannot startwith and endwith punctuations is a requirement of PMML standard.
But spark doesn't require this. It's good to add these difference in the documentation. To help user considering this when training the model.
In the end, I've adjusted my input data to cater to jpmml, re-trained the model in spark, then export to pmml. Haha, It works, thank you.
Hello Villu, Thank you for this great package for exporting the spark ml models. But this package seems not easy to work with:
My input: a column named 'sentence' My output: a column named 'prediction' produced by logistic classification for the column 'sentence' My pipeline: RegexTokenizer -> NGram -> CountVectorizer -> IDF -> VectorAssembler -> LogisticRegression
Problem 1: my RegexTokenizer code as below
But it throws a error
So, I'm think to implement the tokenizer by myself, pass a column of array of tokens as input, then I got:
Problem 2:
After tracking some issues, I understand that vector type is not support, so, I have to consider building the pipeline from tokenizer again. Then, I changed my tokenizer to splitter mode:
Then I got:
Problem 3:
After checking with the source code at line 118, there is a requirement that the token cannot startwith and endwith a punctuation. But this happens a lot. For example, "This is a sentence.", after split by space, the last token endwith a period (
.
). In this case, if use catching mode, the pattern "\b[a-zA-Z]{3,}\b") can extract 'clean' tokens easily. I have no choice, but continue hacking. Then, I try to split sentence by pattern\\b[^a-zA-Z]{0,}\\b
which split the text by non English letter, then, filter the token by set the min token length at 3. This works fine in Spark, but when I export the pipeline, I got another error__Problem 4:___
As what it reads, the getMinTokenLength is not supported in jpmml-sparkml.
I'm really frustrated, since this is a simply and typical task, I've tiried different means to overcome it, but all failed. Could you please point me a right direction, thank you.