Extend query support in leveled

martinsumner commented 7 months ago

Currently in Riak/leveled query support is based on range queries with the application of regular expressions to allow filtering of additional attributes which have been concatenated to the term.

The following extensions are proposed:

Allow for a capture_filter where a conditions can be applied to a captured output, in particular >, >=, <, <= comparisons. These are very tricky to achieve in regular expressions - so instead capture in the regular expression and apply an additional capture filter to the captured output if it exists.
Extend the return options, so in the future there could be return_terms, return_count, return_count_unique and return_group_count. The return_count would return the count of matches (without de-duplication on key), whereas return_count_unique would return the count of unique key matches. The return_group_count would return the count by unique items in a given set of captured outputs (i.e. the regular expression could capture YearOfBirth and Status - and return the count by Status and YearOfBirth).
Allow for a pre-query, where the results returned from the main query are only those that have a key that is returned from the pre-query, and where the pre-query and main query are run on the same snapshot.

The dependency on regular expressions raises potential issues with regards to performance. Regular expression queries should be added to perf_SUITE. There is a proposal to change the erlang re library to google re2 in OTP 28, with some potentially significant performance benefits. It may be worth experimenting with a RE2 NIF in preparation.

martinsumner commented 7 months ago

https://github.com/martinsumner/leveled/pull/434 - extends performance testing to understand the impact of potential changes.

For regex queries, within this test the time to apply the regular expression dominates:

Profile regex_query:

FUNCTION                                          CALLS        %      TIME  [uS / CALLS]
--------                                          -----  -------      ----  [----------]
prim_file:pread_nif/3                              5938     0.51    109603  [     18.46]
leveled_penciller:'-find_nextkeys/6-fun-1-'/1   5111790     0.89    190579  [      0.04]
leveled_codec:'-accumulate_index/2-fun-2-'/6    5276501     1.77    380060  [      0.07]
leveled_codec:maybe_accumulate/5                5441603     1.87    401815  [      0.07]
maps:update_with/3                              5276892     2.09    449656  [      0.09]
zstd:quick_decompress/1                          105536     4.66   1002559  [      9.50]
erlang:binary_to_term/1                          105536     8.25   1775847  [     16.83]
leveled_penciller:find_nextkeys/6              22701169     9.57   2059858  [      0.09]
re:run/2                                        5276501    67.90  14619748  [      2.77]
---------------------------------------------  --------  -------  --------  [----------]
Total:                                         52969030  100.00%  21531323  [      0.41]

The regex query test will filter out and return 1:170 results. On M1 Mac:

Fetch of 125073 index entries by regex in 423 queries took 31764 ms

Thats an average of 75ms per query to return about 300 results having scanned 50K entries.

martinsumner commented 6 months ago

Comparison if re2 is plugged into leveled:

Profile regex_query:

FUNCTION                                          CALLS        %      TIME  [uS / CALLS]
--------                                          -----  -------      ----  [----------]
leveled_penciller:'-find_nextkeys/6-fun-1-'/1   4986634     1.96    196044  [      0.04]
leveled_codec:'-accumulate_index/2-fun-2-'/6    5147292     3.70    370345  [      0.07]
leveled_codec:maybe_accumulate/5                5308330     4.04    404005  [      0.08]
maps:update_with/3                              5147672     4.25    425160  [      0.08]
zstd:quick_decompress/1                          102708    10.27   1027339  [     10.00]
erlang:binary_to_term/1                          102708    17.00   1700697  [     16.56]
leveled_penciller:find_nextkeys/6              20524284    18.39   1839717  [      0.09]
re2:run/2                                       5147292    33.67   3368657  [      0.65]
---------------------------------------------  --------  -------  --------  [----------]
Total:                                         49960989  100.00%  10003893  [      0.20]

So re2 at 0.65 microseconds per call is 75% faster than PCRE regex in OTP 26.

Fetch of 125252 index entries by regex in 428 queries took 20453 ms

Thats an average of 48ms per query to return about 300 results having scanned 50K entries.

martinsumner commented 6 months ago

https://github.com/martinsumner/leveled/pull/435 - switches to use a NIF of. the re2 library taken from https://github.com/dukesoferl/re2

martinsumner / leveled

Extend query support in leveled #433