opensanctions / yente

API for OpenSanctions with support for entity search and bulk matching of data collections. Supports Reconciliation API spec.
MIT License
66 stars 26 forks source link

Fuzzy matching not working #120

Closed skrafft closed 2 years ago

skrafft commented 2 years ago


The fuzzy matching parameter has no effect:

I tried to return results for and it should return a result since there's only 1 letter changing

I checked in the code and in Elastic Search documentation, it should work but as a matter of fact, it does not.

Searching on Google returns results linked to a wrong mapping but I could not find any problem in the ES mapping either. I ended up updating the text_query function to this:

def text_query(
    dataset: Dataset,
    schema: Schema,
    query: str,
    filters: FilterDict = {},
    fuzzy: bool = False,

    if not len(query.strip()):
        should = {"match_all": {}}
    elif fuzzy and query.find('~') == -1:
        should = {
            "match": {
                "text": {
                    "query": query,
                    "fuzziness": "AUTO",
                    "lenient": True,
        should = {
            "query_string": {
                "query": query,
                "fields": ["names^3", "text"],
                "default_operator": "and",
    return filter_query([should], dataset=dataset, schema=schema, filters=filters)

The reason for this line fuzzy and query.find('~') == -1 is to not mix fuzziness and ~ operator. If query contains ~, the fuzzy parameter is just ignored

@pudo any comment on this ?

I can open a pull request if needed

pudo commented 2 years ago

Hi. I'm just working on an integration test harness for the API, so this comes in useful. So what I think it turns out to be is: "fuzziness": "AUTO" brings in a levenshtein tolerance of only 1 for a string the length of "Barrrack Obama". Adding two extra "r" exceeds that threshold. So I guess the best option would be to make fuzziness default to something other than AUTO, e.g. 2. I don't want to do this on the public API that we operate, since it's a massive performance penalty, but we could introduce and environment setting?

skrafft commented 2 years ago


I don't think this is related to the AUTO value. I've tested multiple combination directly on Elastic Search with fuzziness=AUTO,1 or 2 and it does not change the results. As a matter of fact, the query returns 1 result and (changing one "a" to one "o") does not return anything.

I think there's something wrong with the mapping but could not figure what so ended up rewriting the query.

pudo commented 2 years ago

Just to be clear: the guy is called Barack Obama ( Barrack Obama is fuzziness=1, Barrock Obama is fuzziness=2. Am I total confused here?

skrafft commented 2 years ago

That's true but he also has aliases like Barrack Obama in the data so Barrack Obama is a perfect match according to Elastic Search (which makes fuzzy to 1 when you replace a to o). Anyway, searching does not return any result either.

AndreiD commented 2 years ago

so for it to return a result for Barock%20Obama is there something that can be configured or added?

pudo commented 2 years ago

Ok so I've solved this question, but the answer is less than amazing. Basically: ElasticSearch never does fuzzy search on all the terms in a query_string query - that's something you have to actively indicate by adding a tilde to the fuzzy term: barock~ obama gives a result.

My take-away: probably a good idea to use /match in yente most of the time if you're trying to match entities. The search API is just that: a way for people to search on the web site...
