vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.85k stars 606 forks source link

Expected hits in groups aren't appearing #16861

Closed wabzqem closed 3 years ago

wabzqem commented 3 years ago

Given the following query:

{
    "ranking": {
        "profile": "top_news"
    },
    "hits": 0,
    "select": "all(group(group_doc_id) max(15) order(-avg(relevance())) each(output(count()) max(3) each(output(summary()))))",
    "firstpubtime": "1615122508",
    "yql": "select * from sources newsarticle WHERE firstpubtime > @firstpubtime and group_doc_id matches \"^id\";"
}

This is giving a result as follows:

 "children": [
                            {
                                "id": "group:string:id:newsarticle:newsarticle::3cc6ff10bd9fc1d47ab54fc30ccecb983bfa0e0bb531c543983447d479ed699a",
                                "relevance": 1,
                                "value": "id:newsarticle:newsarticle::3cc6ff10bd9fc1d47ab54fc30ccecb983bfa0e0bb531c543983447d479ed699a",
                                "fields": {
                                    "count()": 1
                                },
                                "children": [
                                    {
                                        "id": "hitlist:hits",
                                        "relevance": 1,
                                        "label": "hits",
                                        "children": [
                                            {
                                                "id": "id:newsarticle:newsarticle::bb97333cf8dd4c4a9940758176688f5a876519958d8712d2e7b415ea512cd2a2",
                                                "relevance": 35.54693287037036,
                                                "source": "newsarticle",
                                                "fields": {
                                                    "sddocname": "newsarticle",
-- rest of single hit in group ---

On a kubernetes deployment with 2 content nodes (one is also the config & admin server), I am seeing a count() of 1 or 2 and no continuationswhere I would expect to see the max, 3. For example, here are two documents that I think should be shown in the group results, where only the first one is. I've removed some irrelevant fields with [snip] for brevity:

        {
            "id": "id:newsarticle:newsarticle::bb97333cf8dd4c4a9940758176688f5a876519958d8712d2e7b415ea512cd2a2",
            "relevance": 42.17184,
            "source": "newsarticle",
            "fields": {
                "sddocname": "newsarticle",
                "bodytext": "[snip]",
                "documentid": "id:newsarticle:newsarticle::bb97333cf8dd4c4a9940758176688f5a876519958d8712d2e7b415ea512cd2a2",
                "headline": "Burger King deletes 'tone deaf' International Women's Day campaign tweet",
                "url": "https://www.abc.net.au/news/2021-03-09/burger-king-says-women-belong-in-the-kitchen-on-iwd/13229022",
                "keywords": [
                    [snip]
                ],
                "firstpubtime": 1615242227,
                "modtime": 1615257511,
                "sentiment": 0.11060606,
                "wordcount": 448,
                "abstract": "[snip]",
                "bylines": [
                    "ABC News"
                ],
                "source": "ABC News",
                "twitter_retweet_count": 19,
                "twitter_favourite_count": 93,
                "twitter_link": "https://twitter.com/abcnews/status/1369051839082110976",
                "group_doc_id": "id:newsarticle:newsarticle::3cc6ff10bd9fc1d47ab54fc30ccecb983bfa0e0bb531c543983447d479ed699a",
                "summaryfeatures": {
                    "freshness(firstpubtime)": 0.4534606481481481,
                    "vespa.summaryFeatures.cached": 0
                },
                "group_doc_id_relevance": 0.15066065
            }
        },
        {
            "id": "id:newsarticle:newsarticle::3cc6ff10bd9fc1d47ab54fc30ccecb983bfa0e0bb531c543983447d479ed699a",
            "relevance": 1.1572223,
            "source": "newsarticle",
            "fields": {
                "sddocname": "newsarticle",
                "bodytext": "[snip]",
                "documentid": "id:newsarticle:newsarticle::3cc6ff10bd9fc1d47ab54fc30ccecb983bfa0e0bb531c543983447d479ed699a",
                "headline": "Burger King slammed over sexist tweet on International Women’s Day",
                "url": "https://www.news.com.au/finance/business/retail/burger-king-slammed-over-sexist-tweet-on-international-womens-day/news-story/9fd6f9b99871cd96bf099842a7c06c6f",
                "keywords": [
 [snip]
                ],
                "firstpubtime": 1615253040,
                "modtime": 1615257041,
                "sentiment": 0.06629633,
                "wordcount": 414,
                "abstract": "An international marketing campaign by Burger King – Hungry Jacks in Australia – has gone viral for the wrong reasons.",
                "bylines": [
                    "Kathy Skantzos"
                ],
                "source": "news.com.au",
                "twitter_retweet_count": 0,
                "twitter_favourite_count": 2,
                "twitter_link": "https://twitter.com/newscomauHQ/status/1369135858755002372",
                "group_doc_id": "id:newsarticle:newsarticle::3cc6ff10bd9fc1d47ab54fc30ccecb983bfa0e0bb531c543983447d479ed699a",
                "summaryfeatures": {
                    "freshness(firstpubtime)": 0.5786111111111112,
                    "vespa.summaryFeatures.cached": 0
                },
                "group_doc_id_relevance": 0.15066065
            }
        }

These results are using the same query profile ("top_news"), so the relevance should be the same as calculated for the groups - I don't think it should be low enough to disappear.

When I dump all the documents and feed into vespa running on docker on my local machine, the results appear as I would expect. Both had about 30,000 documents, and only approx 200 documents match the where clause (have a doc_group_id starting with "id:" and firstpubtime in the last ~day). Here's some metadata from the results:

        "fields": {
            "totalCount": 260
        },
        "coverage": {
            "coverage": 100,
            "documents": 29884,
            "full": true,
            "nodes": 2,
            "results": 1,
            "resultsFull": 1
        },

services.xml: https://gist.github.com/wabzqem/98faf22c21019bf781eba457d2a1f402 hosts.xml: https://gist.github.com/wabzqem/10b5e9661f2d11af499db6cc0d1de3f1

# vespa-model-inspect service searchnode
searchnode @ vespa-0.vespa-internal.default.svc.cluster.local : search
newsarticle/search/cluster.newsarticle/0
    tcp/vespa-0.vespa-internal.default.svc.cluster.local:19106 (STATUS ADMIN RTC RPC)
    tcp/vespa-0.vespa-internal.default.svc.cluster.local:19107 (UNUSED)
    tcp/vespa-0.vespa-internal.default.svc.cluster.local:19108 (UNUSED)
    tcp/vespa-0.vespa-internal.default.svc.cluster.local:19109 (UNUSED)
    tcp/vespa-0.vespa-internal.default.svc.cluster.local:19110 (STATE HEALTH JSON HTTP)
searchnode @ vespa-content-0.vespa-internal.default.svc.cluster.local : search
newsarticle/search/cluster.newsarticle/1
    tcp/vespa-content-0.vespa-internal.default.svc.cluster.local:19103 (STATUS ADMIN RTC RPC)
    tcp/vespa-content-0.vespa-internal.default.svc.cluster.local:19104 (UNUSED)
    tcp/vespa-content-0.vespa-internal.default.svc.cluster.local:19105 (UNUSED)
    tcp/vespa-content-0.vespa-internal.default.svc.cluster.local:19106 (UNUSED)
    tcp/vespa-content-0.vespa-internal.default.svc.cluster.local:19107 (STATE HEALTH JSON HTTP)

Please let me know what other information I can provide here. All the source for this is public at https://github.com/ausnews/ausnews-search/ .

wabzqem commented 3 years ago

One other observation:

wabzqem commented 3 years ago

Here's a search result with tracelevel 5: https://gist.github.com/wabzqem/3fec158f1ed8267d3fdbb683b1a8e6c1

I'd expect the very first group here to have 2 hits in it, but only 1 is returned and it has a count() of 1. The first has relevance value of 46.2 (which shows up) and second (if it showed) has relevance 8.12. Only the first currently shows. If I alter the query to match the exact group_doc_id value and leave the grouping as is, i.e.:

{
    "ranking": {
        "profile": "top_news"
    },
    "hits": 0,
    "select": "all(group(group_doc_id) max(5) order(-avg(relevance())) each(output(count()) max(3) each(output(summary()))))",
    "firstpubtime": "1615242362",
    "yql": "select * from sources newsarticle WHERE firstpubtime > @firstpubtime and group_doc_id matches \"^id:newsarticle:newsarticle::ac9710d4d85865dbf870bc14684463ebd2c809010c87ba76413f05da6a0957f7\";"
}

Then two articles show up in this group (as expected).

wabzqem commented 3 years ago

I deployed my application to vespa cloud (source: https://github.com/ausnews/ausnews-search/tree/vespa-cloud/search-engine-app) after removing the document processor, and fed it all the data from my production vespa install (json data can be found here, approx ~30k docs, can be fed directly with vespa-http-client-jar-with-dependencies.jar): http://www.whatsbeef.net/wabz/data.tar.gz).

curl --cert ./data-plane-public-cert.pem --key ./data-plane-private-key.pem -s "https://text-search.wabzqem.ausnews.ausnews.aws-us-east-1c.dev.public.vespa.oath.cloud/search/?ranking.profile=top_news&hits=0&firstpubtime=1615360444&select=all(group(group_doc_id)+max(15)+order(-avg(relevance()))+each(output(count())+max(3)+each(output(summary()))))&yql=select+*+from+sources+newsarticle+WHERE+firstpubtime+%3E+%40firstpubtime+and+group_doc_id+matches+%22%5Eid%22%3B" | jq '.' | less

The groups here have results as expected (as with docker), I don't see the dropped group hits that I'm seeing from my vespa install.

I might just try blowing it all away and starting with a fresh cluster and feeding it up to date data to see if that fixes it, but it would be nice to understand what I might have done here to make things go awry. Could low memory cause this?

jobergum commented 3 years ago

In your private instance, could you increase the timeout and see if you reproduce? &timeout=5s

wabzqem commented 3 years ago

Thanks - I've just tried with the timeout parameter, it still reproduces. With presentation.timing=true, searchTime is about twice as fast as my docker instance, and the same as what I see on vespa cloud ("searchtime": 0.015).

wabzqem commented 3 years ago

Bizarrely, if I remove the firstpubtime > @firstpubtime from the WHERE clause, I get all the hits in each group that I'd expect. This actually works out okay, because the rank profile has a very low freshness.maxAge which pretty much does the same thing. I'd love to understand what is going on here, it's been doing my head in 🤯 :) The hits that are removed definitely do have a firstpubtime attribute value greater than the query parameter given.

wabzqem commented 3 years ago

Sorry, another update. The above definitely improved things, but I then still found instances of groups which didn't have as many hits as I expected. However, I think I have found that having more than a single content node causes this - I'm able to reproduce on vespa cloud. I've fed the same dataset to a dev (single content node) and prod (2 content nodes) instance, and I get different group count()s on the same query. Ordering is slightly different too, as ranking is affected:

export QUERY="presentation.timing=true&ranking.profile=top_news&hits=0&select=all(group(group_doc_id)+max(15)+order(-avg(relevance()))+each(output(count())+max(3)+each(output(summary()))))&yql=select+headline+from+sources+newsarticle+WHERE+group_doc_id+matches+%22%5Eid%22+AND+firstpubtime+%3E+1615591675%3B"

rnelson@wabz vespa-cloud % curl --cert ./data-plane-public-cert.pem --key ./data-plane-private-key.pem -s "https://text-search.ausnews.ausnews.aws-us-east-1c.public.vespa.oath.cloud/search/?$QUERY" | jq '.root.children[].children[].children[].fields["count()"]'
2
2
2
2
1
2
11
3
3
2
3
2
1
1
2
rnelson@wabz vespa-cloud % curl --cert ./data-plane-public-cert.pem --key ./data-plane-private-key.pem -s "https://text-search.wabzqem.ausnews.ausnews.aws-us-east-1c.dev.public.vespa.oath.cloud/search/?$QUERY" | jq '.root.children[].children[].children[].fields["count()"]'
2
2
3
2
4
3
2
2
5
2
1
28
2
20
2

Then, I stopped services on one of my kubernetes content nodes (I now have redundancy=2), waited a bit for indexing, and I got the results that I expect. The single content node appears to give more, and accurate group hit counts.

Is this expected behaviour?

baldersheim commented 3 years ago

No, this is not expected behaviour. I suspect this has to do with avg in relation with relevancy. When using relevancy in the order clause it will trigger an optimisation. However I suspect that this optimisation can not be done when using anything but min or max aggregator. I will verify that when I get back to work tomorrow and get back to you. Thanks a lot for investigating this.

wabzqem commented 3 years ago

Any update on this one? Right now I'm just running on one content node so as to get accurate results. Are there any other potential workarounds? Thanks!

baldersheim commented 3 years ago

It seems I dropped this one on the floor. Will pick it up again now.

wabzqem commented 3 years ago

Hello, appreciate this ranking issue may not be high in the priority order. But I'm left wondering if I should think of a different approach here or if there might be a fix at some point?

Again, thank you for open sourcing vespa, it's clearly an amazing piece of work.

baldersheim commented 3 years ago

I am starting with this one now. While I investigate the group merging code could you try to see what happens if you add precision(2-5 x of what you set for max()). See https://docs.vespa.ai/en/grouping.html#ordering-and-limiting-groups

baldersheim commented 3 years ago

Initial pass of the code did not confirm my suspicion that we incorrectly allowed an optimisation. I will continue to make a system test to reproduce. You should also try adding 'hint(singlepass)' to you grouping expression. Like "all(group(group_doc_id) hint(singlepass) max(5) order(-avg(relevance())).....". The latter is undocumented.

wabzqem commented 3 years ago

I’ve made this change: https://github.com/ausnews/ausnews-search/commit/dd50405d3a8fdc3f505c440bad5778eedbbfd729 to add precision, and ran vespa-start-services on the content node I had previously stopped. I’m getting the same good results now as prior to this (where I had only one content node)

wabzqem commented 3 years ago

Hi @baldersheim, were you able to reproduce this in the end? All my code is open-source so I should be able to provide a simple app & data to reproduce.

Unfortunately, the above didn't fix it. I ran vespa-services-start on the other content node, but didn't realise that distribution to it wasn't working, so it still wasn't being used. I've since fixed that, and group hits have gone bad again. I've tried both hint(singplepass) and precision(15), but they don't help sadly.

Please let me know if there's anything further here I can assist with.

bjorncs commented 3 years ago

@wabzqem We discovered after further debugging that this behaviour is actually not a bug, but instead an expected side-effect of a performance optimization in the grouping engine. It's briefly documented in https://docs.vespa.ai/en/grouping.html#ordering-and-limiting-groups.

You may no longer get the globally best groups when restricting the maximum number of groups and using a non-default ordering expression. This can be somewhat compensated by specifying the precision operator - see link above for details. There is also a single pass optimization that causes some groups to miss documents when using non-default group ordering.

As a side note; using avg in the group ordering expression is not recommended. It will only produce correct result for content clusters were all nodes have the full corpus (e.g single node setup). See @baldersheim comment in https://github.com/vespa-engine/vespa/issues/16861#issuecomment-798957938.

The default group ordering - max(relevance()) - will produce correct result in all situations and might suffice for your use-case.

bratseth commented 3 years ago

Summary: To evaluate this expression perfectly accurately, all the node partitions need to see all the data. The framework decides against that and does it's best in a faster way.

Somewhat unexpected as a default perhaps, but the realistic alternative is timeout.

I think we could improve this by some more work with data sketches, but I'm not sure increase precision with average is important enough to prioritize.

baldersheim commented 3 years ago

I agree that average might not be important enough, as it will either be too costly to solve correctly, or not be accurate. However we do have a similar issue for other aggregators like max/min/count when used in order clause. We could estimate precision in a similar way as we estimate number of hits needed from each node. I think that the rule of large numbers, many partitions and random distribution would make that good enough.

Note that this case was also extreme, as all hits had zero relevance giving no order at all. But most use cases estimating precision should be fine.

wabzqem commented 3 years ago

Thanks for all of this info. I've managed to improve things greatly by using -max(relevance()) along with precision(100), and get very good grouped results. Without precision, I'd still get missing hits, with 2 content nodes and redundancy 2. Happy for this ticket to be closed.