Set variable to result of query

carlokok commented 2 years ago

Is your feature request related to a problem? Please describe.
There are quite a few cases, especially with facets where it would be really good to store the result of a query in a (session) variable to query on. For example when first narrowing down to groups and then doing facets (since facets currently run on the original query)

Describe the solution you'd like

Something like:

set sessionVar = select id from a2  where {somecondition} group 1 by resortId limit 5000  
select id from a2  where id in (@sessionVar) limit 0 facet amenities.regions

Describe alternatives you've considered

Running facets on the result of a subquery (currently the facet runs on the inside of the query):
select id from (select id from a2 where {somecondition} group 1 by resortId limit 5000) facet amenities.regions
This returns the wrong counts because it runs on the inside, not the outside.

Additional context
I'm currently doing:
select id from a2 where {somecondition} group 1 by resortId limit 5000
(create a new query with the ids from above query): ->
select id from a2 where id in (1514968144571714605,1514968144568776334,...) limit 0 facet amenities.regions

This runs in .2 sec vs 3 sec for:

select id from a2  where {somecondition} limit 0 facet amenities.regions distinct resortId

with 6 million records.

barryhunter commented 2 years ago

If you adding 'limit 0' on the inigal query, seem you JUST want the results of the "facet" part anyway. So why not just issue it all in a single GROUP BY ? What will probably be faster than using FACET (as still computes most of the first query, even with LIMIT 0)

SELECT GROUPBY(), id, COUNT(DISTINCT resortId) FROM a2 WHERE {somecondition} GROUP BY amenities.regions

That I think should get you same data as the output of the FACET would.

carlokok commented 2 years ago

That gives:

SQL Fout (1064): index a2: group-count-distinct attribute 'resortId' not found

The big thing with my original query seems to be that I have no way to tell manticore that "amenities.regions" is the same for all the records with the same resortId, and it has to go over all 6 million records (guessing here). Getting the right result out of manticore is not a problem of course.

barryhunter commented 2 years ago

That weird, you were able to do "group 1 by resortId" in the first query, so it should be an attribute.

But otherwise, ok yes, see what you mean. Using the first query to just get 'one row per resort' and then grouping that.

Either way seems like would still be going through all 6M rows, just split over two queries. I guess the first query might be relatively fast because max_matches means it doesn't actually need to store the entire resultset.

If you still looking for a workaround until/if Manticore has a way to putting ids into a uservar, would suggest creating a 'pre grouped' index, have an index that does just have one row per resort, so can group directly on it.

Otherwise, potentially, I created a UDF that might help https://github.com/geograph-project/sphinx-plugins/ IN theory

 SELECT GROUPBY(), id, COUNT(*), WITHINFIRSTX(resortId,1) AS f FROM a2 WHERE {somecondition} AND f=1 GROUP BY amenities.regions

Originally created for sphinx, but work with manticore too! The WITHINFIRSTX function would only return 1 for exactly one row, so prefilters the group by. Although looks like

githubmanticore commented 2 years ago

➤ Aleksey N. Vinogradov commented:

btw, look at test 66, it might be the feature you looking for.

barryhunter commented 2 years ago

Hmm, are you meaning option store='@foo' ?

Doesnt seem to work on SELECT queries

RT.production>select id from os_gaz limit 10 option store='@foo';
ERROR 1064 (42000): unknown option 'store'

Otherwise set global @foo=(5,6); doesnt really seem to help. Still have to transfer the resultset back to the client, to then resend the ids back to searchd. Just splitting it into two parts.

githubmanticore commented 2 years ago

➤ Aleksey N. Vinogradov commented:

yes, it is about 'option store'. It's kind of experimental and works only with 'delete' (which btw effectively excludes all kind of filters other than 'by id'). With the option there is no actual 'delete', but prepared set of documents is stored 'by id' into a variable. Then you can perform your second queries filtering by the variable (look again onto test 66).

The gray areas here is 1-st, a bit unusual syntax (and this is because sphinxql parser became quite complex, and it is quite often not immediately possible to implement any desired syntax without a lot of side effects. Using 'option' was the most plain way to do so). And 2-nd, you need to terminate the variable, as their lifetime were never specially managed (that is, if filter gives you 2 billions of id's, and you've finished with them - they're still in memory and wastes space).

githubmanticore commented 2 years ago

➤ Stan commented:

maybe worth to allow such syntax only in multi-query and lifetime will be the session time or only during handling of multi-query

barryhunter commented 2 years ago

ok so suggesting

delete from a2 where {somecondition} option store="@foo"

which will store the IDs in @foo, which not actually perform the delete (!?!) then

select id from a2 where id in (@foo) limit 0 facet amenities.regions

as the real query. Finally

set global @foo = (0);

The 'unset' all the values. Couldnt set to empty, had to provide one value. Which is a slow memory "leak"!

Seems to work as such, BUT, in the context of the original question, the initial select has a GROUP BY (to get distinct rows per resortId attribute) ... not going to be able to use this fake DELETE for that as doesnt support GROUP BY (AFAIK), nor expressions to be able to use a UDF?

carlokok commented 2 years ago

Maybe I'm misunderstanding, but wouldn't adding support for something like:

select id from a2 where {somecondition} group 1 by resortId limit 5000 option store="sessionvar"

a relatively simple option inline with existing features?

lenar commented 4 days ago

What's the status of this? Really interested to avoid unneccessary roundtrips of (sometimes big too) data packets between client and searchd we currently have.

sanikolaev commented 1 day ago

This works:

mysql> drop table if exists t; create table t(f text); insert into t values(1, 'a'),(2, 'a'),(3, 'c'); delete from t where match('a') option store='@ids'; select * from t where id in @ids;
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
create table t(f text)
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
insert into t values(1, 'a'),(2, 'a'),(3, 'c')
--------------

Query OK, 3 rows affected (0.00 sec)

--------------
delete from t where match('a') option store='@ids'
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
select * from t where id in @ids
--------------

+------+------+
| id   | f    |
+------+------+
|    1 | a    |
|    2 | a    |
+------+------+
2 rows in set (0.00 sec)
--- 2 out of 2 results in 0ms ---

It looks ugly that you use delete when nothing is actually being deleted. Unfortunately, this task has never been a priority. PRs are welcome!

manticoresoftware / manticoresearch

Set variable to result of query #875