seperman / fast-autocomplete

Fast Autocomplete: When Elastcsearch suggestions are not fast and flexible enough
MIT License
262 stars 40 forks source link

Adding a underscore character to valid characters ignores underscore #37

Open lazzarello opened 2 years ago

lazzarello commented 2 years ago

Describe the bug Adding a special character (an underscore) to valid_chars_for_string does not exclude results which do not have the character in the string, until two misses.

To Reproduce

Initialized with

_valid_chars = '_' + string.ascii_lowercase
words = {'i_love_code': {'count': 5}, 'island': {'count': 2}, 'ironman': {'count': 2}, 'i_love_coding': {'count': 2}, 'i_love_machine_learning': {'count': 3}}
autocomplete = AutoComplete(words=words, synonyms={}, valid_chars_for_string=_valid_chars)
autocomplete.search(word=search_string, max_cost=1, size=10)

Formatted output with simulated input:

Valid Characers: _abcdefghijklmnopqrstuvwxyz
Search Input 'i' : i_love_code, i_love_machine_learning, island, ironman, i_love_coding
Search Input 'ir' : ironman
Search Input 'iro' : ironman
Search Input 'iron' : ironman
Search Input 'iron_' : ironman
Search Input 'iron_m' : ironman
Search Input 'iron_ma' : 
Search Input 'iron_mai' : 
Search Input 'iron_maid' : 
Search Input 'iron_maide' : 
Search Input 'iron_maiden' : 

Search Input 'i' : i_love_code, i_love_machine_learning, island, ironman, i_love_coding, iron_maiden

Expected behavior Much like the input 'ir' excludes 'i_lovecode' I would expect 'iron' to exclude 'ironman' and so forth. From this output, it looks like it only begins to exclude 'ironman' when the input reaches 'iron_ma'.

OS, DeepDiff version and Python version (please complete the following information):

Additional context

This seems to have something to do with the max_cost parameter. If I raise it > 2 it matches even more then the unexpected results.

seperman commented 1 year ago

Hi @lazzarello The fuzzy matching logic still sees enough similarities between them to include it in the results. You are right that the underscore character is treated differently. That's because internally we convert all spaces into underscores. Maybe internally we should switch from using underscore for that purpose to a Unicode character that is barely used.