Open Lewuathe opened 5 years ago
Just FYI, the characteristics is derived from their algorithms and illustrated on for example this http://lh3lh3.users.sourceforge.net/reb.shtml
Both libraries define the same set of functions, so they cannot be loaded together. It could be solved with https://github.com/prestosql/presto/issues/8.
However maybe we could have something like joni_regexp_like
and re2j_regexp_like
functions defined. This would pollute the function namespace, and it's usage would not be dynamic.
However maybe we could have something like joni_regexp_like and re2j_regexp_like functions defined. This would pollute the function namespace, and it's usage would not be dynamic.
Regarding the use cases we are considering, it's enough to define the different functions. But as you said, it's not good in terms of the namespace convention.
@martint Do you think it's acceptable to create aliases joni_regexp_like
, re2j_regexp_like
. Or #8 can be the solution for this kind of alias problem?
To be exact, joni defines something like regex_like(varchar, regex_type_for_joni)
and re2j defines something like regex_like(varchar, regex_type_for_re2j)
. User is passing passing two varchars. We cannot load them both because implicit cast would be ambiguous.
Or specifying the library as a session parameter might be another option.
This is problematic as Presto functions do not have access to system session properties.
However maybe we could have something like joni_regexp_like and re2j_regexp_like functions defined. This would pollute the function namespace, and it's usage would not be dynamic.
That would work.
What if we can cast the regex to JoniRegexpType
or Re2JRegexpType
and resolve it , then we can use similar function name right ?
In the short term, yes, #8 would be a way around it, either by defining the function in different namespaces and adding the desired on the the SQL PATH or by allowing functions to take session properties.
In the long term, we actually want to get rid of the JONI vs RE2J distinction if we can. We just haven't spent the time to make RE2J faster in every scenario. Adding the two implementation-specific aliases is problematic because it's harder to remove them in the future without breaking compatibility. If we were to expose different regex functions, it'd be based on the type of regex (POSIX, PCRE, etc), not based on their implementation library.
I agree that in the long term it should be specified by type.
Just FYI, why we want to use RE2J is to speed up a pattern which is for example matching "Presto|Hive|Spark|Big ?Query" with Web page title to query a visitor who are interested in query engines. (real example has a hundred of |
s) But JONI is not good at such patterns. (see also "Backtracking vs. state machines" section of Benchmark of Regex Libraries)
@Lewuathe @nurse Please note that https://github.com/trinodb/trino/pull/20619 is going to remove support for RE2J.
cc: @wendigo
We recently see a query pattern runs extremely slow with JONI. A query with multiple |
on long string (up to 65536) keeps entire cluster busy until reached query timeout limit (1h). So this is not only a perf issue, but also stability issue.
Changing to RE2J finishes within 20s.
Want to call this out that people hitting perf issue with JONI may still need a workaround.
Currently, we support two types of library
JONI
andRE2J
for regex function. It's statically decided at launch time. But we sometimes want to change the library dynamically because the performance characteristics of each library are different. Selecting the appropriate library that fits each use case is desirable.My suggestion is extending functions to support another field to specify the library as follows so that we can switch the library by calling.
Or specifying the library as a session parameter might be another option.