inspirehep / inspire-next

The INSPIRE repo.
https://inspirehep.net
GNU General Public License v3.0
59 stars 69 forks source link

OR query giving wrong results #2829

Closed jmartinm closed 6 years ago

jmartinm commented 6 years ago

Performing the following query: control_number:1497201 OR control_number:1498589 generates:

{
    "query": {
        "bool": {
            "filter": [
                {
                    "bool": {
                        "must_not": [
                            {
                                "match": {
                                    "_collections": "HERMES Internal Notes"
                                }
                            }
                        ],
                        "must": [
                            {
                                "match": {
                                    "_collections": "literature"
                                }
                            }
                        ]
                    }
                }
            ],
            "should": [
                {
                    "match": {
                        "control_number": "1497201"
                    }
                },
                {
                    "match": {
                        "control_number": "1498589"
                    }
                }
            ]
        }
    },
    "from": 0,
    "size": 25
}

which is returning all results (instead of the expected 2).

I think the problem is due to the use of should together with filter https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html#query-dsl-bool-query

The clause (query) should appear in the matching document. If the bool query is in a query context and has a must or filter clause then a document will match the bool query even if none of the should queries match.

jmartinm commented 6 years ago

Actually just realised that the issue is already there - https://github.com/inspirehep/inspire-next/issues/1792

jmartinm commented 6 years ago

@chris-asl Is this something you can have a look at?

jacquerie commented 6 years ago

While it's true that OR queries are broken, the query that it's trying to do here is the product of insufficient API design: we should never need to reparse queries that we generate internally!

chris-asl commented 6 years ago

According to this

The clause (query) should appear in the matching document. If the bool query is in a query context and has a must or filter clause then a document will match the bool query even if none of the should queries match. In this case these clauses are only used to influence the score. If the bool query is a filter context or has neither must or filter then at least one of the should queries must match a document for it to match the bool query. This behavior may be explicitly controlled by settings the minimum_should_match parameter.

So, we currently have a bool query in a query context, which has a filter clause and we're falling into the case of a document will match the bool query even if none of the should queries match. This means, that when an ElasticSearch query object is created with a filter, we should be adding the "minimum_should_match": 1, as @jmartinm suggested also here.

I will create a PR on invenio-search.

chris-asl commented 6 years ago

The issue is resolved since https://github.com/inveniosoftware/invenio-search/pull/105 has been merged to upstream. Currently title holography or bosons returns three results, as it should.