teragrep / pth_10

Data Processing Language (DPL) translator for Apache Spark
GNU Affero General Public License v3.0
0 stars 2 forks source link

Wildcards in keyword slows down the search #246

Closed ronja-ui closed 4 months ago

ronja-ui commented 5 months ago

Describe the bug Wildcards in keyword slows down the search tremendously.

Expected behavior The search time should be endurable even with wildcards.

How to reproduce Following query takes forever:

%dpl
index=[dataset] earliest="10/10/2023:13:00:00" latest="10/10/2023:13:10:00" "*[keyword]*"

Versus this query is quite quick:

%dpl
index=[dataset] earliest="10/10/2023:13:00:00" latest="10/10/2023:13:10:00" "[keyword]"

Screenshots

Software version pth_07 5.17.0

Desktop (please complete the following information if relevant):

Additional context

eemhu commented 4 months ago

If wildcards are used in search, bloom.withoutFilter is set to true on the datasource. Could this be the culprit?

eemhu commented 4 months ago

Also in fact currently these * wildcards are considered normal characters and it would search for a literal *keyword*, which most likely won't result in any search results

elliVM commented 4 months ago

@eemhu If the goal is to disable bloom completely if a wildcard is used, instead of bloom.withoutFilters use the bloom.enabled option set to false. withoutFilter option limits archive files to those that don't have a generated bloom filter so you could miss some wildcard matches. I assume in a wildcard search the presence of a bloom filter should not have any effect in search results.

eemhu commented 4 months ago

changed config to bloom.enabled in pull request