Closed wabzqem closed 3 years ago
One other observation:
matches
part of the WHERE
clause to match the single group_doc_id value where I expect more hits to be shown, I do get more hits (3 as max is 3, and it tells me the count() is 4, which is accurate). The relevance values for those hits are, in order, 36.27, 0.25966, 0.2242.Here's a search result with tracelevel 5: https://gist.github.com/wabzqem/3fec158f1ed8267d3fdbb683b1a8e6c1
I'd expect the very first group here to have 2 hits in it, but only 1 is returned and it has a count() of 1. The first has relevance value of 46.2 (which shows up) and second (if it showed) has relevance 8.12. Only the first currently shows. If I alter the query to match the exact group_doc_id value and leave the grouping as is, i.e.:
{
"ranking": {
"profile": "top_news"
},
"hits": 0,
"select": "all(group(group_doc_id) max(5) order(-avg(relevance())) each(output(count()) max(3) each(output(summary()))))",
"firstpubtime": "1615242362",
"yql": "select * from sources newsarticle WHERE firstpubtime > @firstpubtime and group_doc_id matches \"^id:newsarticle:newsarticle::ac9710d4d85865dbf870bc14684463ebd2c809010c87ba76413f05da6a0957f7\";"
}
Then two articles show up in this group (as expected).
I deployed my application to vespa cloud (source: https://github.com/ausnews/ausnews-search/tree/vespa-cloud/search-engine-app) after removing the document processor, and fed it all the data from my production vespa install (json data can be found here, approx ~30k docs, can be fed directly with vespa-http-client-jar-with-dependencies.jar
): http://www.whatsbeef.net/wabz/data.tar.gz).
curl --cert ./data-plane-public-cert.pem --key ./data-plane-private-key.pem -s "https://text-search.wabzqem.ausnews.ausnews.aws-us-east-1c.dev.public.vespa.oath.cloud/search/?ranking.profile=top_news&hits=0&firstpubtime=1615360444&select=all(group(group_doc_id)+max(15)+order(-avg(relevance()))+each(output(count())+max(3)+each(output(summary()))))&yql=select+*+from+sources+newsarticle+WHERE+firstpubtime+%3E+%40firstpubtime+and+group_doc_id+matches+%22%5Eid%22%3B" | jq '.' | less
The groups here have results as expected (as with docker), I don't see the dropped group hits that I'm seeing from my vespa install.
I might just try blowing it all away and starting with a fresh cluster and feeding it up to date data to see if that fixes it, but it would be nice to understand what I might have done here to make things go awry. Could low memory cause this?
In your private instance, could you increase the timeout and see if you reproduce? &timeout=5s
Thanks - I've just tried with the timeout parameter, it still reproduces. With presentation.timing=true, searchTime is about twice as fast as my docker instance, and the same as what I see on vespa cloud ("searchtime": 0.015
).
Bizarrely, if I remove the firstpubtime > @firstpubtime
from the WHERE
clause, I get all the hits in each group that I'd expect. This actually works out okay, because the rank profile has a very low freshness.maxAge which pretty much does the same thing. I'd love to understand what is going on here, it's been doing my head in 🤯 :) The hits that are removed definitely do have a firstpubtime attribute value greater than the query parameter given.
Sorry, another update. The above definitely improved things, but I then still found instances of groups which didn't have as many hits as I expected. However, I think I have found that having more than a single content node causes this - I'm able to reproduce on vespa cloud. I've fed the same dataset to a dev (single content node) and prod (2 content nodes) instance, and I get different group count()s on the same query. Ordering is slightly different too, as ranking is affected:
export QUERY="presentation.timing=true&ranking.profile=top_news&hits=0&select=all(group(group_doc_id)+max(15)+order(-avg(relevance()))+each(output(count())+max(3)+each(output(summary()))))&yql=select+headline+from+sources+newsarticle+WHERE+group_doc_id+matches+%22%5Eid%22+AND+firstpubtime+%3E+1615591675%3B"
rnelson@wabz vespa-cloud % curl --cert ./data-plane-public-cert.pem --key ./data-plane-private-key.pem -s "https://text-search.ausnews.ausnews.aws-us-east-1c.public.vespa.oath.cloud/search/?$QUERY" | jq '.root.children[].children[].children[].fields["count()"]'
2
2
2
2
1
2
11
3
3
2
3
2
1
1
2
rnelson@wabz vespa-cloud % curl --cert ./data-plane-public-cert.pem --key ./data-plane-private-key.pem -s "https://text-search.wabzqem.ausnews.ausnews.aws-us-east-1c.dev.public.vespa.oath.cloud/search/?$QUERY" | jq '.root.children[].children[].children[].fields["count()"]'
2
2
3
2
4
3
2
2
5
2
1
28
2
20
2
Then, I stopped services on one of my kubernetes content nodes (I now have redundancy=2), waited a bit for indexing, and I got the results that I expect. The single content node appears to give more, and accurate group hit counts.
Is this expected behaviour?
No, this is not expected behaviour. I suspect this has to do with avg in relation with relevancy. When using relevancy in the order clause it will trigger an optimisation. However I suspect that this optimisation can not be done when using anything but min or max aggregator. I will verify that when I get back to work tomorrow and get back to you. Thanks a lot for investigating this.
Any update on this one? Right now I'm just running on one content node so as to get accurate results. Are there any other potential workarounds? Thanks!
It seems I dropped this one on the floor. Will pick it up again now.
Hello, appreciate this ranking issue may not be high in the priority order. But I'm left wondering if I should think of a different approach here or if there might be a fix at some point?
Again, thank you for open sourcing vespa, it's clearly an amazing piece of work.
I am starting with this one now. While I investigate the group merging code could you try to see what happens if you add precision(2-5 x of what you set for max()). See https://docs.vespa.ai/en/grouping.html#ordering-and-limiting-groups
Initial pass of the code did not confirm my suspicion that we incorrectly allowed an optimisation. I will continue to make a system test to reproduce. You should also try adding 'hint(singlepass)' to you grouping expression. Like "all(group(group_doc_id) hint(singlepass) max(5) order(-avg(relevance())).....". The latter is undocumented.
I’ve made this change: https://github.com/ausnews/ausnews-search/commit/dd50405d3a8fdc3f505c440bad5778eedbbfd729 to add precision, and ran vespa-start-services
on the content node I had previously stopped. I’m getting the same good results now as prior to this (where I had only one content node)
Hi @baldersheim, were you able to reproduce this in the end? All my code is open-source so I should be able to provide a simple app & data to reproduce.
Unfortunately, the above didn't fix it. I ran vespa-services-start
on the other content node, but didn't realise that distribution to it wasn't working, so it still wasn't being used. I've since fixed that, and group hits have gone bad again. I've tried both hint(singplepass) and precision(15), but they don't help sadly.
Please let me know if there's anything further here I can assist with.
@wabzqem We discovered after further debugging that this behaviour is actually not a bug, but instead an expected side-effect of a performance optimization in the grouping engine. It's briefly documented in https://docs.vespa.ai/en/grouping.html#ordering-and-limiting-groups.
You may no longer get the globally best groups when restricting the maximum number of groups and using a non-default ordering expression. This can be somewhat compensated by specifying the precision
operator - see link above for details. There is also a single pass optimization that causes some groups to miss documents when using non-default group ordering.
As a side note; using avg
in the group ordering expression is not recommended. It will only produce correct result for content clusters were all nodes have the full corpus (e.g single node setup). See @baldersheim comment in https://github.com/vespa-engine/vespa/issues/16861#issuecomment-798957938.
The default group ordering - max(relevance())
- will produce correct result in all situations and might suffice for your use-case.
Summary: To evaluate this expression perfectly accurately, all the node partitions need to see all the data. The framework decides against that and does it's best in a faster way.
Somewhat unexpected as a default perhaps, but the realistic alternative is timeout.
I think we could improve this by some more work with data sketches, but I'm not sure increase precision with average is important enough to prioritize.
I agree that average might not be important enough, as it will either be too costly to solve correctly, or not be accurate. However we do have a similar issue for other aggregators like max/min/count when used in order clause. We could estimate precision in a similar way as we estimate number of hits needed from each node. I think that the rule of large numbers, many partitions and random distribution would make that good enough.
Note that this case was also extreme, as all hits had zero relevance giving no order at all. But most use cases estimating precision should be fine.
Thanks for all of this info. I've managed to improve things greatly by using -max(relevance()) along with precision(100), and get very good grouped results. Without precision, I'd still get missing hits, with 2 content nodes and redundancy 2. Happy for this ticket to be closed.
Given the following query:
This is giving a result as follows:
On a kubernetes deployment with 2 content nodes (one is also the config & admin server), I am seeing a count() of 1 or 2 and no continuationswhere I would expect to see the max, 3. For example, here are two documents that I think should be shown in the group results, where only the first one is. I've removed some irrelevant fields with [snip] for brevity:
These results are using the same query profile ("top_news"), so the relevance should be the same as calculated for the groups - I don't think it should be low enough to disappear.
When I dump all the documents and feed into vespa running on docker on my local machine, the results appear as I would expect. Both had about 30,000 documents, and only approx 200 documents match the where clause (have a
doc_group_id
starting with "id:" andfirstpubtime
in the last ~day). Here's some metadata from the results:services.xml: https://gist.github.com/wabzqem/98faf22c21019bf781eba457d2a1f402 hosts.xml: https://gist.github.com/wabzqem/10b5e9661f2d11af499db6cc0d1de3f1
Please let me know what other information I can provide here. All the source for this is public at https://github.com/ausnews/ausnews-search/ .