PQ Optimization: Common Resultset

AbstractiveNord commented 1 year ago

Is your feature request related to a problem? Please describe. In case of percolate searching, common pattern is huge amount of identical full-text queries or highly similar full-text queries with differences in attribute queries only. Currently, ManticoreSearch will not re-use already calculated full-text resultset, which causes ineffective usage, harms usability, especially if your system uses routing based on doc IDs or tags.

Describe the solution you'd like Implement common resultset cache for re-use data in other percolate queries, if percolate rule have identical or subset full-text query.

Describe alternatives you've considered Ineffective use of compute resources for highlighted cases.

Additional context Cases:

Identical full-text queries, different attribute search only.
50% identical full-text queries, others queries may use already filtered data. Usually it's case when main (50%) queries filters one text field, and other queries filters that field and another one.

Related discuss

AbstractiveNord commented 1 year ago

Addendum: Probably ManticoreSearch may handle some kind of preprocessing of queries, like merging identical full-text queries into single one, re-ordering queries by full-text operators (less full-text filters may lead to use calculated data in others queries instead of re-calculating).

tomatolog commented 1 year ago

that could work, ie create some structure to speed up query processing like Luwak does however that structure needs rebuild in case user changes PQ often

It also worth to make Common subtree optimization work in the PQ

AbstractiveNord commented 1 year ago

Would that work include Common Query Optimization technique?

tomatolog commented 1 year ago

no PQ execution of CALL PQ does not use any of the optimizations

AbstractiveNord commented 1 year ago

I don't get, will common query optimization and common subtree optimization be implemented.

tomatolog commented 1 year ago

these optimizations do not work for PQ index and call pq statement but it could be easier to add these optimizations into pq index as these are already implemented for regular indexes and code should work in general than implement your feature

I just posted my suggestions of possible implementation of your request

tomatolog commented 9 months ago

@AbstractiveNord could you share your data that has multiple similar full-text query that could get benefit of the full-text optimization but now these perform slow?

AbstractiveNord commented 9 months ago

@AbstractiveNord could you share your data that has multiple similar full-text query that could get benefit of the full-text optimization but now these perform slow?

In general, I have about hundreds of queries, full-text part of them is absolutely identical, different filters only. I will prepare some data and upload to your S3 bucket.

sanikolaev commented 9 months ago

@AbstractiveNord

I will prepare some data and upload to your S3 bucket.

Pls ping us here when you are done with it.

AbstractiveNord commented 9 months ago

@AbstractiveNord

I will prepare some data and upload to your S3 bucket.

Pls ping us here when you are done with it.

I've planned to prepare data on this Saturday.

AbstractiveNord commented 9 months ago

@AbstractiveNord

I will prepare some data and upload to your S3 bucket.

Pls ping us here when you are done with it.

Data is uploaded. Please report any problems or missing data.

tomatolog commented 9 months ago

checking the cases you provided I see that query cache that could be added without lot of code change however it will not work for your case as it could speed up only full-text queries these full-text part matches between queries and filters are matched or a subset of the query that was already cached. That is not your case as you have different filters and some cases have different parts of the full-text queries.

and Common subtree optimization needs a large refactoring as it needs batching of the queries to capture common part and reusing the result of the sibling matching in the batch but now code process queries separately from the queue.

I'd estimate the change needs for Common subtree optimization in 20 to 40 hours but it is still not clear is it still applicable in the general case. As there could be different types of queries in the single batch there Common subtree could has no effect.

That should be also fixed by sorting the queries on inserting the new query to keep similar queries together. However keep the queries list sorted for every insert and delete operation could also slow down data population into PQ index.

AbstractiveNord commented 9 months ago

checking the cases you provided I see that query cache that could be added without lot of code change however it will not work for your case as it could speed up only full-text queries these full-text part matches between queries and filters are matched or a subset of the query that was already cached. That is not your case as you have different filters and some cases have different parts of the full-text queries.

and Common subtree optimization needs a large refactoring as it needs batching of the queries to capture common part and reusing the result of the sibling matching in the batch but now code process queries separately from the queue.

I'd estimate the change needs for Common subtree optimization in 20 to 40 hours but it is still not clear is it still applicable in the general case. As there could be different types of queries in the single batch there Common subtree could has no effect.

That should be also fixed by sorting the queries on inserting the new query to keep similar queries together. However keep the queries list sorted for every insert and delete operation could also slow down data population into PQ index.

It's also nice to speedup full-text queries without attribute filtering, tho. Meanwhile, low insertion speed if PQ rules is not a problem at all. Thanks for detailed answer.

tomatolog commented 9 months ago

another approach is use query cache but to run in first time without any filters to collect only matched doclist then use this cached query for all stored PQ with different filter settings. However that also adds dry run for most of queries and needs analyzer to make sure not to run full-text passes for the queries these are not have common parts with all other queries stored.

The analyzer could

check sequential queries and makes a full-text dry run only if siblings have the common part
queries could be reordering on insert to make sure the similar full-text queries groups together
add some background similarity check

manticoresoftware / manticoresearch

PQ Optimization: Common Resultset #1568