inveniosoftware / invenio-vocabularies

Invenio module for managing vocabularies.
https://invenio-vocabularies.readthedocs.io
MIT License
2 stars 40 forks source link

OpenSearch 2.x #227

Closed max-moser closed 1 year ago

max-moser commented 1 year ago

There's still an issue going on with the suggestions for contrib vocabularies... The following errors only happen with OSv2, but neither OSv1 nor ES7:

> /home/mmoser/rdm/opensearch2/invenio-vocabularies/tests/contrib/affiliations/test_affiliations_resource.py(130)test_affiliations_suggest_sort()
    128     # Should show 2 results, but id=cern as first due to acronym/name
    129     res = client.get(f"{prefix}?suggest=CERN", headers=h)
--> 130     assert res.status_code == 200
    131     assert res.json["hits"]["total"] == 2
    132     assert res.json["hits"]["hits"][0]["id"] == "cern"

ipdb> res.text
'{"status": 400, "message": "Invalid query string syntax."}'

Without the ?suggest=CERN part, it works:

ipdb> client.get(f"{prefix}", headers=h).text
'{"hits": {"hits": [{"acronym": "CERN", "name": "CERN", "created": "2022-09-29T16:31:47.286619+00:00", "links": {"self": "https://127.0.0.1:5000/api/affiliations/cern"}, "updated": "2022-09-29T16:31:47.307105+00:00", "id": "cern", "title": {"en": "European Organization for Nuclear Research", "fr": "Conseil Europ\\u00e9en pour la Recherche Nucl\\u00e9aire"}, "revision_id": 1}, {"acronym": "CERT", "name": "CERT", "created": "2022-09-29T16:31:47.462047+00:00", "links": {"self": "https://127.0.0.1:5000/api/affiliations/cert"}, "updated": "2022-09-29T16:31:47.473284+00:00", "id": "cert", "title": {"en": "Computer Emergency Response Team", "fr": "\\u00c9quipe d\'Intervention d\'Urgence Informatique"}, "revision_id": 1}, {"acronym": "NU", "name": "Northwestern University", "created": "2022-09-29T16:31:47.506629+00:00", "links": {"self": "https://127.0.0.1:5000/api/affiliations/nu"}, "updated": "2022-09-29T16:31:47.515084+00:00", "id": "nu", "title": {"en": "Northwestern University"}, "revision_id": 1}, {"acronym": "OTHER", "name": "OTHER", "created": "2022-09-29T16:31:47.405280+00:00", "links": {"self": "https://127.0.0.1:5000/api/affiliations/other"}, "updated": "2022-09-29T16:31:47.425857+00:00", "id": "other", "title": {"en": "CERN"}, "revision_id": 1}], "total": 4}, "sortBy": "name", "links": {"self": "https://127.0.0.1:5000/api/affiliations?page=1&size=25&sort=name"}}'

In fact, it even works with ?suggest= but no value:

ipdb> client.get(f"{prefix}?suggest=", headers=h).text
'{"hits": {"hits": [{"acronym": "CERN", "name": "CERN", "created": "2022-09-29T16:31:47.286619+00:00", "links": {"self": "https://127.0.0.1:5000/api/affiliations/cern"}, "updated": "2022-09-29T16:31:47.307105+00:00", "id": "cern", "title": {"en": "European Organization for Nuclear Research", "fr": "Conseil Europ\\u00e9en pour la Recherche Nucl\\u00e9aire"}, "revision_id": 1}, {"acronym": "CERT", "name": "CERT", "created": "2022-09-29T16:31:47.462047+00:00", "links": {"self": "https://127.0.0.1:5000/api/affiliations/cert"}, "updated": "2022-09-29T16:31:47.473284+00:00", "id": "cert", "title": {"en": "Computer Emergency Response Team", "fr": "\\u00c9quipe d\'Intervention d\'Urgence Informatique"}, "revision_id": 1}, {"acronym": "NU", "name": "Northwestern University", "created": "2022-09-29T16:31:47.506629+00:00", "links": {"self": "https://127.0.0.1:5000/api/affiliations/nu"}, "updated": "2022-09-29T16:31:47.515084+00:00", "id": "nu", "title": {"en": "Northwestern University"}, "revision_id": 1}, {"acronym": "OTHER", "name": "OTHER", "created": "2022-09-29T16:31:47.405280+00:00", "links": {"self": "https://127.0.0.1:5000/api/affiliations/other"}, "updated": "2022-09-29T16:31:47.425857+00:00", "id": "other", "title": {"en": "CERN"}, "revision_id": 1}], "total": 4}, "sortBy": "name", "links": {"self": "https://127.0.0.1:5000/api/affiliations?page=1&size=25&sort=name&suggest="}}'
ppanero commented 1 year ago

Digging a bit, seems that the mappings have some incompatibility. Most likely due to the autodetect of field type inside title since the keyword is inside a multi field:

"title": {
    "dynamic": "true",
    "properties": {
        "en": {
            "type": "search_as_you_type",
            "doc_values": false,
            "max_shingle_size": 3
        },
        "fr": {
            "type": "text",
            "fields": {
                "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                }
            }
        }
    }
},

The generated query is:

{
    "query": {
        "bool": {
            "minimum_should_match": "0<1",
            "filter": [{
                "match_all": {}
            }],
            "must": [{
                "multi_match": {
                    "query": "CERN",
                    "fields": ["name^100", "title.*^5"],
                    "type": "phrase_prefix"
                }
            }]
        }
    },
    "sort": ["_score"],
    "track_total_hits": true,
    "from": 0,
    "size": 25
}

Resulting in:

opensearchpy.exceptions.RequestError: RequestError(400, 'search_phase_execution_exception', 'failed to create query: Can only use phrase prefix queries on text fields - not on [title.fr.keyword] which is of type [keyword]')

I think this can be fixed using dynamic templates (on it)

ppanero commented 1 year ago

After a bit more digging, there are two problems:

1- Some fields are autodetected. Easily fixed with the dynamic templates. 2- The query is not being properly built. It should be (this is the query being built for ES7)

"multi_match": {
    "query": "CERN",
    "fields": ["name^100", "title.*^5", "title.*._2gram", "title.*._3gram"],
    "type": "bool_prefix"
}

This should be working tho, see query parser

ppanero commented 1 year ago

Pushed a new commit to fix the mappings. Once records-resources is released, particularly this commit tests will pass. They do locally:

Screenshot 2022-10-04 at 14 30 31
ntarocco commented 1 year ago

Thanks everyone for your help here! 🚀