opensanctions / yente

API for OpenSanctions with support for entity search and bulk matching of data collections. Supports Reconciliation API spec.
https://www.opensanctions.org/docs/yente/
MIT License
66 stars 26 forks source link

Fuzzy matching not working #120

Closed skrafft closed 2 years ago

skrafft commented 2 years ago

Hi,

The fuzzy matching parameter has no effect:

I tried to return results for https://api.opensanctions.org/search/default?q=Barrrack%20Obama&fuzzy=true and it should return a result since there's only 1 letter changing

I checked in the code https://github.com/opensanctions/yente/blob/main/yente/search/queries.py#L85 and in Elastic Search documentation, it should work but as a matter of fact, it does not.

Searching on Google returns results linked to a wrong mapping but I could not find any problem in the ES mapping either. I ended up updating the text_query function to this:

def text_query(
    dataset: Dataset,
    schema: Schema,
    query: str,
    filters: FilterDict = {},
    fuzzy: bool = False,
):

    if not len(query.strip()):
        should = {"match_all": {}}
    elif fuzzy and query.find('~') == -1:
        should = {
            "match": {
                "text": {
                    "query": query,
                    "fuzziness": "AUTO",
                    "lenient": True,
                    "operator":"AND"
                }
            }
        }
    else:
        should = {
            "query_string": {
                "query": query,
                "fields": ["names^3", "text"],
                "default_operator": "and",
            }
        }
    return filter_query([should], dataset=dataset, schema=schema, filters=filters)

The reason for this line fuzzy and query.find('~') == -1 is to not mix fuzziness and ~ operator. If query contains ~, the fuzzy parameter is just ignored

@pudo any comment on this ?

I can open a pull request if needed

pudo commented 2 years ago

Hi. I'm just working on an integration test harness for the API, so this comes in useful. So what I think it turns out to be is: "fuzziness": "AUTO" brings in a levenshtein tolerance of only 1 for a string the length of "Barrrack Obama". Adding two extra "r" exceeds that threshold. So I guess the best option would be to make fuzziness default to something other than AUTO, e.g. 2. I don't want to do this on the public API that we operate, since it's a massive performance penalty, but we could introduce and environment setting?

skrafft commented 2 years ago

Hi,

I don't think this is related to the AUTO value. I've tested multiple combination directly on Elastic Search with fuzziness=AUTO,1 or 2 and it does not change the results. As a matter of fact, the query https://api.opensanctions.org/search/default?q=Barrack%20Obama returns 1 result and https://api.opensanctions.org/search/default?q=Barrock%20Obama%fuzzy=true (changing one "a" to one "o") does not return anything.

I think there's something wrong with the mapping but could not figure what so ended up rewriting the query.

pudo commented 2 years ago

Just to be clear: the guy is called Barack Obama (https://en.wikipedia.org/wiki/Barack_Obama). Barrack Obama is fuzziness=1, Barrock Obama is fuzziness=2. Am I total confused here?

skrafft commented 2 years ago

That's true but he also has aliases like Barrack Obama in the data so Barrack Obama is a perfect match according to Elastic Search (which makes fuzzy to 1 when you replace a to o). Anyway, searching https://api.opensanctions.org/search/default?q=Barock%20Obama does not return any result either.

AndreiD commented 2 years ago

so for it to return a result for Barock%20Obama is there something that can be configured or added?

pudo commented 2 years ago

Ok so I've solved this question, but the answer is less than amazing. Basically: ElasticSearch never does fuzzy search on all the terms in a query_string query - that's something you have to actively indicate by adding a tilde to the fuzzy term: barock~ obama gives a result.

My take-away: probably a good idea to use /match in yente most of the time if you're trying to match entities. The search API is just that: a way for people to search on the web site...

cf. https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html