Use spark regex extraction for tokens when bloom pattern is present

elliVM commented 3 weeks ago

Add | teragrep exec regexextract <regexPattern> <inputColumn> <outputColumn> command.

Uses sparks regex tokenizer to collect regex matching tokens from a string column to a list of string tokens.
Use pth-03 version 9.0.0
Remove Regex filtering features from TokenizerStep

kortemik commented 3 weeks ago

I think we are mixing a two concepts here.

dpf_03 and org.apache.spark.ml.feature.RegexTokenizer are tokenization methods.

regex extraction for tokens is different, it is string-match-as-a-token based method which should work like following

| makeresults count=1 | eval _raw="foo bar@biz baz" | rex4j max_match=0 "(?<words>([a-z ]+)+)"

and should produce two tokens "foo bar" and "biz baz", however max_match is currently not supported.

with this small case, where @-sign delimits the two tokens, it is possible to use both methods to get the same results.

however in more complex scenario string-match-to-token excels. to get single tokens from string within the parentheses only string-match-to-token can be used, at least with decent configuration.

biz baz boz data has no content today (very important though) but it would still have if one had a means to extract it from (here is something else important as well) the strange patterns called parentheses that it seems to have been put in.

which would happen with

\((.*?)\)

and result into tokens "very important though" and "here is something else important as well".

these methods must be separated in the search time and database as well because the methods can not be mixed.

in example

"foo bar" results via tokenizer to tokens "foo" and "bar".
"foo bar" results via string-match-to-token and regex "...\s..." to "foo bar".

and using bloomfilter generated for the 1 case can not be used for the second or the way around.

as a conclusion i recommend that we drop regex-filter from tokenizer generated tokens and use regex only for string-match-to-token based bloom filtering. this way we will have in the database present either full tokenizer produced bloomfilter and/or regex-enabled string-match-to-token bloomfilter which are enabled depending on the availability.

availability is deemed for

string-match-to-token based ones by indexStatement matching the regex and such bloomfilter being present
tokenizer based ones by having tokenizer generated bloomfilter being present

to aid development without too many changes in this short timeframe, i recommend to temporarily postpone the tokenizer based support and refactor the current implementation to support the string-match-to-token way and later enable the tokenizer based one.

elliVM commented 3 weeks ago

Will move this feature into a new command | teragrep exec regexextract <options>

and remove any regex filtering from tokenizer step.

kortemik commented 3 weeks ago

@51-code please open a new issue if there are something that should still be fixed

teragrep / pth_10

Use spark regex extraction for tokens when bloom pattern is present #363