Open kortemik opened 8 months ago
initial investigation shows ConditionBuilder to be slow with large indexes adding more logging
how about amount of filter rows in the bloomdb.bloomfilter table?
Example keyword query
%dpl index=search_bench earliest="1921-08-08T00:00:00" "5083157"
without bloom took 2min 16sec with bloom took 35 sec
%dpl index=search_bench earliest="1921-08-08T00:00:00" "5083157"
After running bloom on another index (crud) query slowed down
from 35 sec to 3 minutes. at the time bloomdb.bloomfilter count was = 15822
the same query?
yes the search_bench query
Indeed the number of rows of bloomdb.filters in the database slow the query down.
search times with bloom filters from index crud = 4min 20sec crud + crud_mini = 5min 30sec
StreamDBClient joins full bloomfilter table before where condition is applied slowing it down with large number of rows of filters
from getArchivedObjects_filter_table
join journaldb
.logfile
on (
journaldb
.logfile
.host_id
= getArchivedObjects_filter_table
.host_id
and journaldb
.logfile
.logtag
= getArchivedObjects_filter_table
.tag
)
left outer join bloomdb
.bloomfilter
on journaldb
.logfile
.id
= bloomdb
.bloomfilter
.partition_id
where (
journaldb
.logfile
.logdate
= {d '1923-01-20'}
bloom filter table and temp table are joined using join type ALL that is a full table scan. looking into making correct indexes to change that.
Partition id was made unique on a later commit but not run through jooq code generation, running code gen with
partition_id bigint(20) unsigned NOT NULL UNIQUE
unique constraint should help with join performance with a better index.
added index for bloomfilter temp table type_id
bloommatch function is not present in some queries testing that it's working correctly
edit: worked in local tests
local performance two slow actions with filters mentioned in ticket.
1800 ms StreamDBClient constructor:
this.journaldbCondition = walker.fromString(config.getQuery(), false);
2000 ms pullToSliceTable.insert
int rows = insert.execute();
Using 20filters with 50% match rate, sized between 700-10m. enabling bloom slowed single pull to slice table with 4000 logfiles slowed from 0.2 (bloom disabled) to 10 seconds.
Increased number of large filters +1 million slows query down. slowing down caused by increased number of filters and very large sized filters. Using 3-6 filters with the size of >100 000 was as fast as no bloom query.
Filters used
map.put(700, 0.01); map.put(5500, 0.01); map.put(12500, 0.01); map.put(36500, 0.01); map.put(115000, 0.03); map.put(230000, 0.03); map.put(375000, 0.01); map.put(610000, 0.01); map.put(905000, 0.01); map.put(1850000, 0.01); map.put(2750000, 0.01); map.put(3750000, 0.01); map.put(5050000, 0.01); map.put(6500000, 0.01);
select "journaldb"."logfile"."id" "id", "getArchivedObjects_filter_table"."directory" "directory", "getArchivedObjects_filter_table"."stream" "stream", UNIX_TIMESTAMP(STR_TO_DATE(SUBSTRING(REGEXP_SUBSTR("journaldb"."logfile"."path",'^\\d{4}\\/\\d{2}-\\d{2}\\/[\\w\\.-]+\\/([^\\p{Z}\\p{C}]+?)\\/([^\\p{Z}\\p{C}]+)(-@)?(\\d+|)-(\\d{4}\\d{2}\\d{2}\\d{2})'), -10, 10), '%Y%m%d%H')) "logtime" from "getArchivedObjects_filter_table" join "journaldb"."logfile" on ( "journaldb"."logfile"."host_id" = "getArchivedObjects_filter_table"."host_id" and "journaldb"."logfile"."logtag" = "getArchivedObjects_filter_table"."tag" ) left outer join "bloomdb"."bloomfilter" on "journaldb"."logfile"."id" = "bloomdb"."bloomfilter"."partition_id" where ( "journaldb"."logfile"."logdate" = date '2020-10-20' and "getArchivedObjects_filter_table"."directory" like 'storage' and "journaldb"."logfile"."logdate" >= date '2020-01-01' and (UNIX_TIMESTAMP(STR_TO_DATE(SUBSTRING(REGEXP_SUBSTR(path,'[0-9]+(\.log)?\.gz(\.[0-9]*)?$'), 1, 10), '%Y%m%d%H')) >= 1577829600) and ( bloommatch( ( select "bloomFiltersTable"."filter" from "bloomFiltersTable" where ( term_id = 0 and type_id = "bloomdb"."bloomfilter"."filter_type_id" ) ), "bloomdb"."bloomfilter"."filter" ) = true or "bloomdb"."bloomfilter"."filter" is null ) ) order by logtime, "journaldb"."logfile"."id" asc
\"table\": {\n \"table_name\": \"bloomfilter\",\n \"access_type\": \"eq_ref\",\n \"possible_keys\": [\"partition_id\"],\n \"key\": \"partition_id\",\n \"key_length\": \"8\",\n \"used_key_parts\": [\"partition_id\"],\n \"ref\": [\"archiver_journal_tyrael.logfile.id\"],\n \"r_loops\": 999,\n \"rows\": 1,\n \"r_rows\": 1,\n \"r_table_time_ms\": 11081.63296,\n \"r_other_time_ms\": 13.31159884,\n \"filtered\": 100,\n \"r_filtered\": 100,\n \"attached_condition\": \"trigcond(bloommatch((subquery#2) AS (select bloomFiltersTable.filter from bloomFiltersTable where (term_id = 0 and type_id = bloomdb.bloomfilter.filter_type_id)),bloomdb.bloomfilter.filter AS filter) = 1 or bloomdb.bloomfilter.filter is null)\"\n },\n \"subqueries\": [\n {\n \"expression_cache\": {\n \"state\": \"disabled\",\n \"r_loops\": 0,\n \"query_block\": {\n \"select_id\": 2,\n \"r_loops\": 999,\n \"r_total_time_ms\": 979.7890605,\n \"outer_ref_condition\": \"bloomdb.bloomfilter.filter_type_id is not null\",\n \"table\": {\n \"table_name\": \"bloomFiltersTable\",\n \"access_type\": \"ref\",\n \"possible_keys\": [\"ix_bloomType\"],\n \"key\": \"ix_bloomType\",\n \"key_length\": \"18\",\n \"used_key_parts\": [\"type_id\", \"term_id\"],\n \"ref\": [\"bloomdb.bloomfilter.filter_type_id\", \"const\"],\n \"r_loops\": 999,\n \"rows\": 1,\n \"r_rows\": 1,\n \"r_table_time_ms\": 588.4710006,\n \"r_other_time_ms\": 377.992746,\n \"filtered\": 100,\n \"r_filtered\": 100,\n \"index_condition\": \"bloomFiltersTable.type_id = bloomdb.bloomfilter.filter_type_id\"\n }\n }\n }\n }\n ]\n }\n }\n }\n}"]]
Possible causes?
Tested possible causes. Running multiple small filters did not have significant performance issues compared to adding a single large filter (+10 million expected) the query time increased from milliseconds to multiple seconds. Looking into causes most likely a database engine issue with a large BLOB.
Tried moving BLOB column into it's own SQL table to help database engine planning but no clear improvement found
"subqueries": [ { "expression_cache": { "state": "disabled", "r_loops": 0, "query_block": { "select_id": 4, "r_loops": 1000, "r_total_time_ms": 565.1666425, "outer_ref_condition": "bloomdb.bloomfilter.filter_data_id is not null", "table": { "table_name": "filterdata", "access_type": "eq_ref", "possible_keys": ["PRIMARY"], "key": "PRIMARY", "key_length": "8", "used_key_parts": ["id"], "ref": ["bloomdb.bloomfilter.filter_data_id"], "r_loops": 1000, "rows": 1, "r_rows": 1, "r_table_time_ms": 545.0623699, "r_other_time_ms": 16.11708964, "filtered": 100, "r_filtered": 100 } } } }
bloommatch subquery slow taking 500ms, subquery for null checks took around 20ms, row-by-row operations slow down query in where clause. Looking for alternatives using common table expressions.
article about UDFs doing row by row checks https://www.itprotoday.com/sql-server/beware-row-row-operations-udf-clothing
Tested different compression algorithms in MariaDB lz4 algorithm was faster than the default zlib used.
zlib: 2889 ms lz4: 1285 ms
Changed bloomfilter condition from OR operation to CASE switch to help with query optimization with null checks
Possibly solved with #30
Describe the bug bloomfilter use decreases the performance of the search.
the environment has these filters and their count, fpp being 0.01:
Expected behavior bloomfilter use should increase the performance of the search.
How to reproduce create the filters, have >400M rows in logfile table.
Screenshots
Software version 2.3.0
Desktop (please complete the following information if relevant):
Additional context