Closed elliVM closed 3 weeks ago
I think we are mixing a two concepts here.
dpf_03 and org.apache.spark.ml.feature.RegexTokenizer are tokenization methods.
regex extraction for tokens is different, it is string-match-as-a-token based method which should work like following
| makeresults count=1 | eval _raw="foo bar@biz baz" | rex4j max_match=0 "(?<words>([a-z ]+)+)"
and should produce two tokens "foo bar" and "biz baz", however max_match is currently not supported.
with this small case, where @-sign delimits the two tokens, it is possible to use both methods to get the same results.
however in more complex scenario string-match-to-token excels. to get single tokens from string within the parentheses only string-match-to-token can be used, at least with decent configuration.
biz baz boz data has no content today (very important though) but it would still have if one had a means to extract it from (here is something else important as well) the strange patterns called parentheses that it seems to have been put in.
which would happen with
\((.*?)\)
and result into tokens "very important though" and "here is something else important as well".
these methods must be separated in the search time and database as well because the methods can not be mixed.
in example
and using bloomfilter generated for the 1 case can not be used for the second or the way around.
as a conclusion i recommend that we drop regex-filter from tokenizer generated tokens and use regex only for string-match-to-token based bloom filtering. this way we will have in the database present either full tokenizer produced bloomfilter and/or regex-enabled string-match-to-token bloomfilter which are enabled depending on the availability.
availability is deemed for
to aid development without too many changes in this short timeframe, i recommend to temporarily postpone the tokenizer based support and refactor the current implementation to support the string-match-to-token way and later enable the tokenizer based one.
Will move this feature into a new command
| teragrep exec regexextract <options>
and remove any regex filtering from tokenizer step.
@51-code please open a new issue if there are something that should still be fixed
Add
| teragrep exec regexextract <regexPattern> <inputColumn> <outputColumn>
command.