teragrep / pth_06

Teragrep Datasource for Apache Spark
GNU Affero General Public License v3.0
0 stars 4 forks source link

Term accelerated searches using bloomfilter #30

Open elliVM opened 5 months ago

elliVM commented 5 months ago

Allow search string pattern to be accelerated without using a global bloomfilter

elliVM commented 5 months ago

Added pattern table to bloomdb can be used to select specific filter. Pattern is stored a it's own bloom filter byte array, when a search term is included in the saved patterns (using bloommatch udf) it will activate bloom search using filter of that pattern.

For simplicity when a filter is created it will be assigned a single pattern that the search term is matched against. later this can be changed to multiple patterns per filter and vice versa.

elliVM commented 5 months ago

Changing to support multiple patterns per filter

elliVM commented 5 months ago

Implemented a schema with a pattern table and a junction table between patterns and filters,. Condition walker selects only filters with pattern match with search term and run UDF bloommatch for temp table filters generated from filter types and search term.

Next:

elliVM commented 5 months ago

Changes to be made:

elliVM commented 5 months ago

Testing version with pattern matching against tokenized search terms

elliVM commented 4 months ago

New changes to be made

elliVM commented 4 months ago

Created a new walker that finds all dynamic bloomfilter tables that have a pattern match with the tokenized search term, will use this to select the tables for join with the main query. (Combined with Condition Walker)

elliVM commented 4 months ago

Created classes to hold dynamic tables and temp tables

elliVM commented 4 months ago

Internal PR

elliVM commented 3 months ago

updates to filtertype table: pattern varchar value increased to 2048 and pattern added to unique composite index

elliVM commented 2 months ago

Fixed issues with filter size selection in temp tables generated for bloommatch condition. Limited tokenizers to use only major tokens to match with dpf_03.

Working in QA with working filtering (pth-07 5.3.0-22-gbd5da88a)

Test example index=alert_examples earliest=-999d "c3468f80-4273-4867-9b66-3f470787c365"

without bloom took 16-18s with bloom 3-6s

elliVM commented 1 month ago

Fixing an issue where table pattern match filtering from meta data was fetching the whole table data to java memory, limited fetch to check only 1 row and only PK field.

elliVM commented 1 month ago

Duplicate rows on multiple pattern matches when multiple tables are joined, testing fix using group by logfile.id

update - group by too slow, false positives maybe caused by null on null bloommatch check if a pattern match table was joined that has no matching logfiles for index.

elliVM commented 1 month ago

null check after bloommatch condition for bloom filters fixed duplicate issues and speed up query with multiple joined tables.

elliVM commented 1 month ago

Fixed bug with multiple search terms, tested and working in QA.

elliVM commented 1 month ago
elliVM commented 1 month ago

refactoring: move all tokenization to PatternMatch class and move all bloommatch condition generation steps to BloomFilterTempTable class

elliVM commented 1 month ago
elliVM commented 1 week ago

will split the refactoring into another PR and implement the changes requested in review