bloomfilter use reported to be very slow - Githubissues

teragrep / pth_06

Teragrep Datasource for Apache Spark

GNU Affero General Public License v3.0

0 stars 5 forks source link

bloomfilter use reported to be very slow #26

Open kortemik opened 8 months ago

kortemik commented 8 months ago

Describe the bug bloomfilter use decreases the performance of the search.

the environment has these filters and their count, fpp being 0.01:

| count(*) | expectedElements | id |
+----------+------------------+----+
|      465 |              700 |  1 |
|     1472 |             5500 |  2 |
|      842 |            12500 |  3 |
|      485 |            36500 |  4 |
|     1182 |           115000 |  5 |
|     1026 |           230000 |  6 |
|     1192 |           375000 |  7 |
|     2103 |           610000 |  8 |
|     1993 |           905000 |  9 |
|     2180 |          1850000 | 10 |
|      714 |          2750000 | 11 |
|      569 |          3750000 | 12 |
|      406 |          5050000 | 13 |
|      198 |          6500000 | 14 |
|       80 |          8300000 | 15 |
|        8 |         11000000 | 16 |

Expected behavior bloomfilter use should increase the performance of the search.

How to reproduce create the filters, have >400M rows in logfile table.

Screenshots

Software version 2.3.0

Desktop (please complete the following information if relevant):

OS:
Browser:
Version:

Additional context

elliVM commented 8 months ago

initial investigation shows ConditionBuilder to be slow with large indexes adding more logging

elliVM commented 8 months ago

Memory and prosessor usage was ok in QA
Number of filters did not seem to vary the search time with 3 vs 20 same results

kortemik commented 8 months ago

how about amount of filter rows in the bloomdb.bloomfilter table?

elliVM commented 8 months ago

Example keyword query

%dpl index=search_bench earliest="1921-08-08T00:00:00" "5083157"

without bloom took 2min 16sec with bloom took 35 sec

elliVM commented 8 months ago

%dpl index=search_bench earliest="1921-08-08T00:00:00" "5083157"

After running bloom on another index (crud) query slowed down

from 35 sec to 3 minutes. at the time bloomdb.bloomfilter count was = 15822

kortemik commented 8 months ago

the same query?

elliVM commented 8 months ago

yes the search_bench query

elliVM commented 7 months ago

Indeed the number of rows of bloomdb.filters in the database slow the query down.

search times with bloom filters from index crud = 4min 20sec crud + crud_mini = 5min 30sec

elliVM commented 7 months ago

StreamDBClient joins full bloomfilter table before where condition is applied slowing it down with large number of rows of filters

from getArchivedObjects_filter_table join journaldb.logfile on ( journaldb.logfile.host_id = getArchivedObjects_filter_table.host_id and journaldb.logfile.logtag = getArchivedObjects_filter_table.tag ) left outer join bloomdb.bloomfilter on journaldb.logfile.id = bloomdb.bloomfilter.partition_id where ( journaldb.logfile.logdate = {d '1923-01-20'}

elliVM commented 7 months ago

bloom filter table and temp table are joined using join type ALL that is a full table scan. looking into making correct indexes to change that.

elliVM commented 7 months ago

Partition id was made unique on a later commit but not run through jooq code generation, running code gen with

partition_id bigint(20) unsigned NOT NULL UNIQUE

unique constraint should help with join performance with a better index.

elliVM commented 7 months ago

added index for bloomfilter temp table type_id

elliVM commented 7 months ago

bloommatch function is not present in some queries testing that it's working correctly

edit: worked in local tests

elliVM commented 7 months ago

local performance two slow actions with filters mentioned in ticket.

1800 ms StreamDBClient constructor: this.journaldbCondition = walker.fromString(config.getQuery(), false);

2000 ms pullToSliceTable.insert int rows = insert.execute();

elliVM commented 7 months ago

Using 20filters with 50% match rate, sized between 700-10m. enabling bloom slowed single pull to slice table with 4000 logfiles slowed from 0.2 (bloom disabled) to 10 seconds.

Condition building including filter insertions to temp table added up to 2 seconds
Querytime took around 8 with bloom

elliVM commented 7 months ago

Increased number of large filters +1 million slows query down. slowing down caused by increased number of filters and very large sized filters. Using 3-6 filters with the size of >100 000 was as fast as no bloom query.

Filters used map.put(700, 0.01); map.put(5500, 0.01); map.put(12500, 0.01); map.put(36500, 0.01); map.put(115000, 0.03); map.put(230000, 0.03); map.put(375000, 0.01); map.put(610000, 0.01); map.put(905000, 0.01); map.put(1850000, 0.01); map.put(2750000, 0.01); map.put(3750000, 0.01); map.put(5050000, 0.01); map.put(6500000, 0.01);

Pulling matches for a single day with 1000 logfiles

Condition build takes 2.5s

No bloomfilters

Took 0.15s
bloomfilter.filter.isNull()

with bloomfilters 50% match

Took 13s
bloommatch(generated filter, bloomfilter.filter).eq(true)

Full Query

select "journaldb"."logfile"."id" "id", "getArchivedObjects_filter_table"."directory" "directory", "getArchivedObjects_filter_table"."stream" "stream", UNIX_TIMESTAMP(STR_TO_DATE(SUBSTRING(REGEXP_SUBSTR("journaldb"."logfile"."path",'^\\d{4}\\/\\d{2}-\\d{2}\\/[\\w\\.-]+\\/([^\\p{Z}\\p{C}]+?)\\/([^\\p{Z}\\p{C}]+)(-@)?(\\d+|)-(\\d{4}\\d{2}\\d{2}\\d{2})'), -10, 10), '%Y%m%d%H')) "logtime" from "getArchivedObjects_filter_table" join "journaldb"."logfile" on ( "journaldb"."logfile"."host_id" = "getArchivedObjects_filter_table"."host_id" and "journaldb"."logfile"."logtag" = "getArchivedObjects_filter_table"."tag" ) left outer join "bloomdb"."bloomfilter" on "journaldb"."logfile"."id" = "bloomdb"."bloomfilter"."partition_id" where ( "journaldb"."logfile"."logdate" = date '2020-10-20' and "getArchivedObjects_filter_table"."directory" like 'storage' and "journaldb"."logfile"."logdate" >= date '2020-01-01' and (UNIX_TIMESTAMP(STR_TO_DATE(SUBSTRING(REGEXP_SUBSTR(path,'[0-9]+(\.log)?\.gz(\.[0-9]*)?$'), 1, 10), '%Y%m%d%H')) >= 1577829600) and ( bloommatch( ( select "bloomFiltersTable"."filter" from "bloomFiltersTable" where ( term_id = 0 and type_id = "bloomdb"."bloomfilter"."filter_type_id" ) ), "bloomdb"."bloomfilter"."filter" ) = true or "bloomdb"."bloomfilter"."filter" is null ) ) order by logtime, "journaldb"."logfile"."id" asc

Slow part analyze:

\"table\": {\n \"table_name\": \"bloomfilter\",\n \"access_type\": \"eq_ref\",\n \"possible_keys\": [\"partition_id\"],\n \"key\": \"partition_id\",\n \"key_length\": \"8\",\n \"used_key_parts\": [\"partition_id\"],\n \"ref\": [\"archiver_journal_tyrael.logfile.id\"],\n \"r_loops\": 999,\n \"rows\": 1,\n \"r_rows\": 1,\n \"r_table_time_ms\": 11081.63296,\n \"r_other_time_ms\": 13.31159884,\n \"filtered\": 100,\n \"r_filtered\": 100,\n \"attached_condition\": \"trigcond(bloommatch((subquery#2) AS (select bloomFiltersTable.filter from bloomFiltersTable where (term_id = 0 and type_id = bloomdb.bloomfilter.filter_type_id)),bloomdb.bloomfilter.filter AS filter) = 1 or bloomdb.bloomfilter.filter is null)\"\n },\n \"subqueries\": [\n {\n \"expression_cache\": {\n \"state\": \"disabled\",\n \"r_loops\": 0,\n \"query_block\": {\n \"select_id\": 2,\n \"r_loops\": 999,\n \"r_total_time_ms\": 979.7890605,\n \"outer_ref_condition\": \"bloomdb.bloomfilter.filter_type_id is not null\",\n \"table\": {\n \"table_name\": \"bloomFiltersTable\",\n \"access_type\": \"ref\",\n \"possible_keys\": [\"ix_bloomType\"],\n \"key\": \"ix_bloomType\",\n \"key_length\": \"18\",\n \"used_key_parts\": [\"type_id\", \"term_id\"],\n \"ref\": [\"bloomdb.bloomfilter.filter_type_id\", \"const\"],\n \"r_loops\": 999,\n \"rows\": 1,\n \"r_rows\": 1,\n \"r_table_time_ms\": 588.4710006,\n \"r_other_time_ms\": 377.992746,\n \"filtered\": 100,\n \"r_filtered\": 100,\n \"index_condition\": \"bloomFiltersTable.type_id = bloomdb.bloomfilter.filter_type_id\"\n }\n }\n }\n }\n ]\n }\n }\n }\n}"]]

Possible causes?

having to apply UDF multiple times per logfile, adding a filtertype adds extra UDF loop for each logfile
use of UDF inside where clause (UDF not optimizable by MariaDB since not deterministic)
large size of filter arrays that UDF has to check (smaller filters were a lot faster)
slow handling of large BLOB types inside MariaDB

elliVM commented 6 months ago

Tested possible causes. Running multiple small filters did not have significant performance issues compared to adding a single large filter (+10 million expected) the query time increased from milliseconds to multiple seconds. Looking into causes most likely a database engine issue with a large BLOB.

elliVM commented 6 months ago

Tried moving BLOB column into it's own SQL table to help database engine planning but no clear improvement found

elliVM commented 6 months ago

"subqueries": [ { "expression_cache": { "state": "disabled", "r_loops": 0, "query_block": { "select_id": 4, "r_loops": 1000, "r_total_time_ms": 565.1666425, "outer_ref_condition": "bloomdb.bloomfilter.filter_data_id is not null", "table": { "table_name": "filterdata", "access_type": "eq_ref", "possible_keys": ["PRIMARY"], "key": "PRIMARY", "key_length": "8", "used_key_parts": ["id"], "ref": ["bloomdb.bloomfilter.filter_data_id"], "r_loops": 1000, "rows": 1, "r_rows": 1, "r_table_time_ms": 545.0623699, "r_other_time_ms": 16.11708964, "filtered": 100, "r_filtered": 100 } } } }

bloommatch subquery slow taking 500ms, subquery for null checks took around 20ms, row-by-row operations slow down query in where clause. Looking for alternatives using common table expressions.

article about UDFs doing row by row checks https://www.itprotoday.com/sql-server/beware-row-row-operations-udf-clothing

elliVM commented 6 months ago

Tested different compression algorithms in MariaDB lz4 algorithm was faster than the default zlib used.

zlib: 2889 ms lz4: 1285 ms

elliVM commented 6 months ago

Changed bloomfilter condition from OR operation to CASE switch to help with query optimization with null checks

elliVM commented 1 month ago

Possibly solved with #30