Closed skrafft closed 2 years ago
Hi. I'm just working on an integration test harness for the API, so this comes in useful. So what I think it turns out to be is: "fuzziness": "AUTO"
brings in a levenshtein tolerance of only 1 for a string the length of "Barrrack Obama". Adding two extra "r" exceeds that threshold. So I guess the best option would be to make fuzziness
default to something other than AUTO, e.g. 2
. I don't want to do this on the public API that we operate, since it's a massive performance penalty, but we could introduce and environment setting?
Hi,
I don't think this is related to the AUTO value. I've tested multiple combination directly on Elastic Search with fuzziness=AUTO,1 or 2 and it does not change the results. As a matter of fact, the query https://api.opensanctions.org/search/default?q=Barrack%20Obama returns 1 result and https://api.opensanctions.org/search/default?q=Barrock%20Obama%fuzzy=true (changing one "a" to one "o") does not return anything.
I think there's something wrong with the mapping but could not figure what so ended up rewriting the query.
Just to be clear: the guy is called Barack Obama
(https://en.wikipedia.org/wiki/Barack_Obama). Barrack Obama
is fuzziness=1, Barrock Obama
is fuzziness=2. Am I total confused here?
That's true but he also has aliases like Barrack Obama in the data so Barrack Obama is a perfect match according to Elastic Search (which makes fuzzy to 1 when you replace a to o). Anyway, searching https://api.opensanctions.org/search/default?q=Barock%20Obama does not return any result either.
so for it to return a result for Barock%20Obama is there something that can be configured or added?
Ok so I've solved this question, but the answer is less than amazing. Basically: ElasticSearch never does fuzzy search on all the terms in a query_string
query - that's something you have to actively indicate by adding a tilde to the fuzzy term: barock~ obama
gives a result.
My take-away: probably a good idea to use /match
in yente most of the time if you're trying to match entities. The search API is just that: a way for people to search on the web site...
cf. https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html
Hi,
The fuzzy matching parameter has no effect:
I tried to return results for https://api.opensanctions.org/search/default?q=Barrrack%20Obama&fuzzy=true and it should return a result since there's only 1 letter changing
I checked in the code https://github.com/opensanctions/yente/blob/main/yente/search/queries.py#L85 and in Elastic Search documentation, it should work but as a matter of fact, it does not.
Searching on Google returns results linked to a wrong mapping but I could not find any problem in the ES mapping either. I ended up updating the text_query function to this:
The reason for this line
fuzzy and query.find('~') == -1
is to not mix fuzziness and ~ operator. If query contains ~, the fuzzy parameter is just ignored@pudo any comment on this ?
I can open a pull request if needed