teragrep / pth_06

Teragrep Datasource for Apache Spark
GNU Affero General Public License v3.0
0 stars 5 forks source link

bloomfilter use reported to be very slow #26

Open kortemik opened 8 months ago

kortemik commented 8 months ago

Describe the bug bloomfilter use decreases the performance of the search.

the environment has these filters and their count, fpp being 0.01:

| count(*) | expectedElements | id |
+----------+------------------+----+
|      465 |              700 |  1 |
|     1472 |             5500 |  2 |
|      842 |            12500 |  3 |
|      485 |            36500 |  4 |
|     1182 |           115000 |  5 |
|     1026 |           230000 |  6 |
|     1192 |           375000 |  7 |
|     2103 |           610000 |  8 |
|     1993 |           905000 |  9 |
|     2180 |          1850000 | 10 |
|      714 |          2750000 | 11 |
|      569 |          3750000 | 12 |
|      406 |          5050000 | 13 |
|      198 |          6500000 | 14 |
|       80 |          8300000 | 15 |
|        8 |         11000000 | 16 |

Expected behavior bloomfilter use should increase the performance of the search.

How to reproduce create the filters, have >400M rows in logfile table.

Screenshots

Software version 2.3.0

Desktop (please complete the following information if relevant):

Additional context

elliVM commented 8 months ago

initial investigation shows ConditionBuilder to be slow with large indexes adding more logging

elliVM commented 8 months ago
kortemik commented 8 months ago

how about amount of filter rows in the bloomdb.bloomfilter table?

elliVM commented 8 months ago

Example keyword query

%dpl index=search_bench earliest="1921-08-08T00:00:00" "5083157"

without bloom took 2min 16sec with bloom took 35 sec

elliVM commented 8 months ago

%dpl index=search_bench earliest="1921-08-08T00:00:00" "5083157"

After running bloom on another index (crud) query slowed down

from 35 sec to 3 minutes. at the time bloomdb.bloomfilter count was = 15822

kortemik commented 8 months ago

the same query?

elliVM commented 8 months ago

yes the search_bench query

elliVM commented 7 months ago

Indeed the number of rows of bloomdb.filters in the database slow the query down.

search times with bloom filters from index crud = 4min 20sec crud + crud_mini = 5min 30sec

elliVM commented 7 months ago

StreamDBClient joins full bloomfilter table before where condition is applied slowing it down with large number of rows of filters

from getArchivedObjects_filter_table join journaldb.logfile on ( journaldb.logfile.host_id = getArchivedObjects_filter_table.host_id and journaldb.logfile.logtag = getArchivedObjects_filter_table.tag ) left outer join bloomdb.bloomfilter on journaldb.logfile.id = bloomdb.bloomfilter.partition_id where ( journaldb.logfile.logdate = {d '1923-01-20'}

elliVM commented 7 months ago

bloom filter table and temp table are joined using join type ALL that is a full table scan. looking into making correct indexes to change that.

elliVM commented 7 months ago

Partition id was made unique on a later commit but not run through jooq code generation, running code gen with

partition_id bigint(20) unsigned NOT NULL UNIQUE

unique constraint should help with join performance with a better index.

elliVM commented 7 months ago

added index for bloomfilter temp table type_id

elliVM commented 7 months ago

bloommatch function is not present in some queries testing that it's working correctly

edit: worked in local tests

elliVM commented 7 months ago

local performance two slow actions with filters mentioned in ticket.

1800 ms StreamDBClient constructor: this.journaldbCondition = walker.fromString(config.getQuery(), false);

2000 ms pullToSliceTable.insert int rows = insert.execute();

elliVM commented 7 months ago

Using 20filters with 50% match rate, sized between 700-10m. enabling bloom slowed single pull to slice table with 4000 logfiles slowed from 0.2 (bloom disabled) to 10 seconds.

elliVM commented 7 months ago

Increased number of large filters +1 million slows query down. slowing down caused by increased number of filters and very large sized filters. Using 3-6 filters with the size of >100 000 was as fast as no bloom query.

Filters used map.put(700, 0.01); map.put(5500, 0.01); map.put(12500, 0.01); map.put(36500, 0.01); map.put(115000, 0.03); map.put(230000, 0.03); map.put(375000, 0.01); map.put(610000, 0.01); map.put(905000, 0.01); map.put(1850000, 0.01); map.put(2750000, 0.01); map.put(3750000, 0.01); map.put(5050000, 0.01); map.put(6500000, 0.01);

Pulling matches for a single day with 1000 logfiles

No bloomfilters

with bloomfilters 50% match

Full Query

select "journaldb"."logfile"."id" "id", "getArchivedObjects_filter_table"."directory" "directory", "getArchivedObjects_filter_table"."stream" "stream", UNIX_TIMESTAMP(STR_TO_DATE(SUBSTRING(REGEXP_SUBSTR("journaldb"."logfile"."path",'^\\d{4}\\/\\d{2}-\\d{2}\\/[\\w\\.-]+\\/([^\\p{Z}\\p{C}]+?)\\/([^\\p{Z}\\p{C}]+)(-@)?(\\d+|)-(\\d{4}\\d{2}\\d{2}\\d{2})'), -10, 10), '%Y%m%d%H')) "logtime" from "getArchivedObjects_filter_table" join "journaldb"."logfile" on ( "journaldb"."logfile"."host_id" = "getArchivedObjects_filter_table"."host_id" and "journaldb"."logfile"."logtag" = "getArchivedObjects_filter_table"."tag" ) left outer join "bloomdb"."bloomfilter" on "journaldb"."logfile"."id" = "bloomdb"."bloomfilter"."partition_id" where ( "journaldb"."logfile"."logdate" = date '2020-10-20' and "getArchivedObjects_filter_table"."directory" like 'storage' and "journaldb"."logfile"."logdate" >= date '2020-01-01' and (UNIX_TIMESTAMP(STR_TO_DATE(SUBSTRING(REGEXP_SUBSTR(path,'[0-9]+(\.log)?\.gz(\.[0-9]*)?$'), 1, 10), '%Y%m%d%H')) >= 1577829600) and ( bloommatch( ( select "bloomFiltersTable"."filter" from "bloomFiltersTable" where ( term_id = 0 and type_id = "bloomdb"."bloomfilter"."filter_type_id" ) ), "bloomdb"."bloomfilter"."filter" ) = true or "bloomdb"."bloomfilter"."filter" is null ) ) order by logtime, "journaldb"."logfile"."id" asc

Slow part analyze:

\"table\": {\n \"table_name\": \"bloomfilter\",\n \"access_type\": \"eq_ref\",\n \"possible_keys\": [\"partition_id\"],\n \"key\": \"partition_id\",\n \"key_length\": \"8\",\n \"used_key_parts\": [\"partition_id\"],\n \"ref\": [\"archiver_journal_tyrael.logfile.id\"],\n \"r_loops\": 999,\n \"rows\": 1,\n \"r_rows\": 1,\n \"r_table_time_ms\": 11081.63296,\n \"r_other_time_ms\": 13.31159884,\n \"filtered\": 100,\n \"r_filtered\": 100,\n \"attached_condition\": \"trigcond(bloommatch((subquery#2) AS (select bloomFiltersTable.filter from bloomFiltersTable where (term_id = 0 and type_id = bloomdb.bloomfilter.filter_type_id)),bloomdb.bloomfilter.filter AS filter) = 1 or bloomdb.bloomfilter.filter is null)\"\n },\n \"subqueries\": [\n {\n \"expression_cache\": {\n \"state\": \"disabled\",\n \"r_loops\": 0,\n \"query_block\": {\n \"select_id\": 2,\n \"r_loops\": 999,\n \"r_total_time_ms\": 979.7890605,\n \"outer_ref_condition\": \"bloomdb.bloomfilter.filter_type_id is not null\",\n \"table\": {\n \"table_name\": \"bloomFiltersTable\",\n \"access_type\": \"ref\",\n \"possible_keys\": [\"ix_bloomType\"],\n \"key\": \"ix_bloomType\",\n \"key_length\": \"18\",\n \"used_key_parts\": [\"type_id\", \"term_id\"],\n \"ref\": [\"bloomdb.bloomfilter.filter_type_id\", \"const\"],\n \"r_loops\": 999,\n \"rows\": 1,\n \"r_rows\": 1,\n \"r_table_time_ms\": 588.4710006,\n \"r_other_time_ms\": 377.992746,\n \"filtered\": 100,\n \"r_filtered\": 100,\n \"index_condition\": \"bloomFiltersTable.type_id = bloomdb.bloomfilter.filter_type_id\"\n }\n }\n }\n }\n ]\n }\n }\n }\n}"]]

Possible causes?

elliVM commented 6 months ago

Tested possible causes. Running multiple small filters did not have significant performance issues compared to adding a single large filter (+10 million expected) the query time increased from milliseconds to multiple seconds. Looking into causes most likely a database engine issue with a large BLOB.

elliVM commented 6 months ago

Tried moving BLOB column into it's own SQL table to help database engine planning but no clear improvement found

elliVM commented 6 months ago

"subqueries": [ { "expression_cache": { "state": "disabled", "r_loops": 0, "query_block": { "select_id": 4, "r_loops": 1000, "r_total_time_ms": 565.1666425, "outer_ref_condition": "bloomdb.bloomfilter.filter_data_id is not null", "table": { "table_name": "filterdata", "access_type": "eq_ref", "possible_keys": ["PRIMARY"], "key": "PRIMARY", "key_length": "8", "used_key_parts": ["id"], "ref": ["bloomdb.bloomfilter.filter_data_id"], "r_loops": 1000, "rows": 1, "r_rows": 1, "r_table_time_ms": 545.0623699, "r_other_time_ms": 16.11708964, "filtered": 100, "r_filtered": 100 } } } }

bloommatch subquery slow taking 500ms, subquery for null checks took around 20ms, row-by-row operations slow down query in where clause. Looking for alternatives using common table expressions.

article about UDFs doing row by row checks https://www.itprotoday.com/sql-server/beware-row-row-operations-udf-clothing

elliVM commented 6 months ago

Tested different compression algorithms in MariaDB lz4 algorithm was faster than the default zlib used.

zlib: 2889 ms lz4: 1285 ms

elliVM commented 6 months ago

Changed bloomfilter condition from OR operation to CASE switch to help with query optimization with null checks

elliVM commented 1 month ago

Possibly solved with #30