Open carlokok opened 2 years ago
If you adding 'limit 0' on the inigal query, seem you JUST want the results of the "facet" part anyway. So why not just issue it all in a single GROUP BY ? What will probably be faster than using FACET (as still computes most of the first query, even with LIMIT 0)
SELECT GROUPBY(), id, COUNT(DISTINCT resortId) FROM a2 WHERE {somecondition} GROUP BY amenities.regions
That I think should get you same data as the output of the FACET would.
That gives:
SQL Fout (1064): index a2: group-count-distinct attribute 'resortId' not found
The big thing with my original query seems to be that I have no way to tell manticore that "amenities.regions" is the same for all the records with the same resortId, and it has to go over all 6 million records (guessing here). Getting the right result out of manticore is not a problem of course.
That weird, you were able to do "group 1 by resortId" in the first query, so it should be an attribute.
But otherwise, ok yes, see what you mean. Using the first query to just get 'one row per resort' and then grouping that.
Either way seems like would still be going through all 6M rows, just split over two queries. I guess the first query might be relatively fast because max_matches means it doesn't actually need to store the entire resultset.
If you still looking for a workaround until/if Manticore has a way to putting ids into a uservar, would suggest creating a 'pre grouped' index, have an index that does just have one row per resort, so can group directly on it.
Otherwise, potentially, I created a UDF that might help https://github.com/geograph-project/sphinx-plugins/ IN theory
SELECT GROUPBY(), id, COUNT(*), WITHINFIRSTX(resortId,1) AS f FROM a2 WHERE {somecondition} AND f=1 GROUP BY amenities.regions
Originally created for sphinx, but work with manticore too! The WITHINFIRSTX function would only return 1 for exactly one row, so prefilters the group by. Although looks like
➤ Aleksey N. Vinogradov commented:
btw, look at test 66, it might be the feature you looking for.
Hmm, are you meaning option store='@foo'
?
Doesnt seem to work on SELECT queries
RT.production>select id from os_gaz limit 10 option store='@foo';
ERROR 1064 (42000): unknown option 'store'
Otherwise set global @foo=(5,6);
doesnt really seem to help. Still have to transfer the resultset back to the client, to then resend the ids back to searchd. Just splitting it into two parts.
➤ Aleksey N. Vinogradov commented:
yes, it is about 'option store'. It's kind of experimental and works only with 'delete' (which btw effectively excludes all kind of filters other than 'by id'). With the option there is no actual 'delete', but prepared set of documents is stored 'by id' into a variable. Then you can perform your second queries filtering by the variable (look again onto test 66).
The gray areas here is 1-st, a bit unusual syntax (and this is because sphinxql parser became quite complex, and it is quite often not immediately possible to implement any desired syntax without a lot of side effects. Using 'option' was the most plain way to do so). And 2-nd, you need to terminate the variable, as their lifetime were never specially managed (that is, if filter gives you 2 billions of id's, and you've finished with them - they're still in memory and wastes space).
➤ Stan commented:
maybe worth to allow such syntax only in multi-query and lifetime will be the session time or only during handling of multi-query
ok so suggesting
delete from a2 where {somecondition} option store="@foo"
which will store the IDs in @foo, which not actually perform the delete (!?!) then
select id from a2 where id in (@foo) limit 0 facet amenities.regions
as the real query. Finally
set global @foo = (0);
The 'unset' all the values. Couldnt set to empty, had to provide one value. Which is a slow memory "leak"!
Seems to work as such, BUT, in the context of the original question, the initial select has a GROUP BY (to get distinct rows per resortId attribute) ... not going to be able to use this fake DELETE for that as doesnt support GROUP BY (AFAIK), nor expressions to be able to use a UDF?
Maybe I'm misunderstanding, but wouldn't adding support for something like:
select id from a2 where {somecondition} group 1 by resortId limit 5000 option store="sessionvar"
a relatively simple option inline with existing features?
What's the status of this? Really interested to avoid unneccessary roundtrips of (sometimes big too) data packets between client and searchd we currently have.
This works:
mysql> drop table if exists t; create table t(f text); insert into t values(1, 'a'),(2, 'a'),(3, 'c'); delete from t where match('a') option store='@ids'; select * from t where id in @ids;
--------------
drop table if exists t
--------------
Query OK, 0 rows affected (0.00 sec)
--------------
create table t(f text)
--------------
Query OK, 0 rows affected (0.01 sec)
--------------
insert into t values(1, 'a'),(2, 'a'),(3, 'c')
--------------
Query OK, 3 rows affected (0.00 sec)
--------------
delete from t where match('a') option store='@ids'
--------------
Query OK, 0 rows affected (0.00 sec)
--------------
select * from t where id in @ids
--------------
+------+------+
| id | f |
+------+------+
| 1 | a |
| 2 | a |
+------+------+
2 rows in set (0.00 sec)
--- 2 out of 2 results in 0ms ---
It looks ugly that you use delete
when nothing is actually being deleted. Unfortunately, this task has never been a priority. PRs are welcome!
Is your feature request related to a problem? Please describe.
There are quite a few cases, especially with facets where it would be really good to store the result of a query in a (session) variable to query on. For example when first narrowing down to groups and then doing facets (since facets currently run on the original query)
Describe the solution you'd like
Something like:
Describe alternatives you've considered
Running facets on the result of a subquery (currently the facet runs on the inside of the query):
select id from (select id from a2 where {somecondition} group 1 by resortId limit 5000) facet amenities.regions
This returns the wrong counts because it runs on the inside, not the outside.
Additional context
I'm currently doing:
select id from a2 where {somecondition} group 1 by resortId limit 5000
(create a new query with the ids from above query): ->
select id from a2 where id in (1514968144571714605,1514968144568776334,...) limit 0 facet amenities.regions
This runs in .2 sec vs 3 sec for:
with 6 million records.