trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.32k stars 2.97k forks source link

Switch the library used in regex function by calling #768

Open Lewuathe opened 5 years ago

Lewuathe commented 5 years ago

Currently, we support two types of library JONI and RE2J for regex function. It's statically decided at launch time. But we sometimes want to change the library dynamically because the performance characteristics of each library are different. Selecting the appropriate library that fits each use case is desirable.

My suggestion is extending functions to support another field to specify the library as follows so that we can switch the library by calling.

regexp_like(col, '...', 'JONI')
regexp_like(col, '...', 'REF2J')

Or specifying the library as a session parameter might be another option.

nurse commented 5 years ago

Just FYI, the characteristics is derived from their algorithms and illustrated on for example this http://lh3lh3.users.sourceforge.net/reb.shtml

kokosing commented 5 years ago

Both libraries define the same set of functions, so they cannot be loaded together. It could be solved with https://github.com/prestosql/presto/issues/8.

However maybe we could have something like joni_regexp_like and re2j_regexp_like functions defined. This would pollute the function namespace, and it's usage would not be dynamic.

Lewuathe commented 5 years ago

However maybe we could have something like joni_regexp_like and re2j_regexp_like functions defined. This would pollute the function namespace, and it's usage would not be dynamic.

Regarding the use cases we are considering, it's enough to define the different functions. But as you said, it's not good in terms of the namespace convention.

@martint Do you think it's acceptable to create aliases joni_regexp_like, re2j_regexp_like. Or #8 can be the solution for this kind of alias problem?

kokosing commented 5 years ago

To be exact, joni defines something like regex_like(varchar, regex_type_for_joni) and re2j defines something like regex_like(varchar, regex_type_for_re2j). User is passing passing two varchars. We cannot load them both because implicit cast would be ambiguous.

sopel39 commented 5 years ago

Or specifying the library as a session parameter might be another option.

This is problematic as Presto functions do not have access to system session properties.

However maybe we could have something like joni_regexp_like and re2j_regexp_like functions defined. This would pollute the function namespace, and it's usage would not be dynamic.

That would work.

Praveen2112 commented 5 years ago

What if we can cast the regex to JoniRegexpType or Re2JRegexpType and resolve it , then we can use similar function name right ?

martint commented 5 years ago

In the short term, yes, #8 would be a way around it, either by defining the function in different namespaces and adding the desired on the the SQL PATH or by allowing functions to take session properties.

In the long term, we actually want to get rid of the JONI vs RE2J distinction if we can. We just haven't spent the time to make RE2J faster in every scenario. Adding the two implementation-specific aliases is problematic because it's harder to remove them in the future without breaking compatibility. If we were to expose different regex functions, it'd be based on the type of regex (POSIX, PCRE, etc), not based on their implementation library.

nurse commented 5 years ago

I agree that in the long term it should be specified by type.

Just FYI, why we want to use RE2J is to speed up a pattern which is for example matching "Presto|Hive|Spark|Big ?Query" with Web page title to query a visitor who are interested in query engines. (real example has a hundred of |s) But JONI is not good at such patterns. (see also "Backtracking vs. state machines" section of Benchmark of Regex Libraries)

ebyhr commented 8 months ago

@Lewuathe @nurse Please note that https://github.com/trinodb/trino/pull/20619 is going to remove support for RE2J.

cc: @wendigo

jinyangli34 commented 2 months ago

We recently see a query pattern runs extremely slow with JONI. A query with multiple | on long string (up to 65536) keeps entire cluster busy until reached query timeout limit (1h). So this is not only a perf issue, but also stability issue. Changing to RE2J finishes within 20s. Want to call this out that people hitting perf issue with JONI may still need a workaround.