seperman / fast-autocomplete

Fast Autocomplete: When Elastcsearch suggestions are not fast and flexible enough
MIT License
267 stars 40 forks source link

How to search not only from first word? #10

Open eiva opened 4 years ago

eiva commented 4 years ago

Hello, is there any way to provide autosuggestion based not only by first word?

words = {'acura zdx': {},
             'acura abc': {},
             'bmw test': {},
             'bmw coupe': {},
             }
    synonyms = {}
    autocomplete = AutoComplete(words=words, synonyms=synonyms)
    print(autocomplete.search(word='test', size=2))

Prints empty array - so it is not search by second word...

seperman commented 4 years ago

Hi @eiva What I recommend is to add these variations of words in the words dictionary. acura rlx and test acura rlx. We do already have code that does all this automatically but is not explicitly added to fast autocomplete yet.

So for example your words dictionary is going to be:

{
  "acura rlx": [
    {
      "model": "rlx",
      "make": "acura"
    },
    "Acura RLX",
    3132
  ],
  "test acura rlx": [
    {
      "model": "rlx",
      "make": "acura"
    },
    "Acura RLX",
    3132
  ],
  "rlx": [
    {
      "model": "rlx",
      "make": "acura"
    },
    "Acura RLX",
    3132
  ],
  "acura": [
    {
      "make": "acura"
    },
    "Acura",
    130123
  ],
  ...
}

And then you can use the factory function as described in: https://github.com/wearefair/fast-autocomplete#sorting

You will see a similar example in that page.

jayaddison commented 4 years ago

A small +1 on this issue, to agree that in-built support for this would be a useful extra (even if optional / disabled-by-default).

jayaddison commented 4 years ago

(afterthought: if it's challenging to implement within the library for algorithmic and/or performance reasons, perhaps including a usage example like the one above in the repo itself would be a reasonable alternative)

seperman commented 4 years ago

Hi @jayaddison I have not had a chance to implement it. But basically the solution is called Gaddag: https://en.wikipedia.org/wiki/GADDAG I agree with you that it is easier to leave some examples in the readme for now! I will try to leave some examples soon. If you have a chance to open a PR and add some examples that would be great too till Gaddag is implemented. Thanks!

tomerav0 commented 4 years ago

I "sovled" it by kind of brute force. I took each phrase I want to use and created all the combination for that phrase. I used another array to map each combination to the original value. For each combination I attached UUID that later I parse at the results.

original_word = row["name"]
          #words[original_word] = [{}, original_word, row["count"]]
          parts = original_word.split()

          if len(parts) > 1 and len(parts) <= self._combo_words_limit:   
            for words_combo in itertools.permutations(parts, len(parts)):
                pharse, id = self.buildPharse(words_combo)
                pharses_map.append({id : original_word})
                words.append({pharse : [{}, original_word, row["count"]]})
          else:
                pharse, id = self.buildPharse(original_word)
                pharses_map.append({id : original_word})
                words.append({pharse : [{}, original_word, row["count"]]})
seperman commented 4 years ago

@tomerav0 Yeah that works! In fact I have used something similar to your solution before too.

Ronserruya commented 3 years ago

I "sovled" it by kind of brute force. I took each phrase I want to use and created all the combination for that phrase. I used another array to map each combination to the original value. For each combination I attached UUID that later I parse at the results.

original_word = row["name"]
          #words[original_word] = [{}, original_word, row["count"]]
          parts = original_word.split()

          if len(parts) > 1 and len(parts) <= self._combo_words_limit:   
            for words_combo in itertools.permutations(parts, len(parts)):
                pharse, id = self.buildPharse(words_combo)
                pharses_map.append({id : original_word})
                words.append({pharse : [{}, original_word, row["count"]]})
          else:
                pharse, id = self.buildPharse(original_word)
                pharses_map.append({id : original_word})
                words.append({pharse : [{}, original_word, row["count"]]})

Would you mind posting a more complete working example? maybe a gist?

tomerav0 commented 3 years ago

I made it to work but finally gave up on this because its extremely RAM consuming not feasible in any sense of cost. Im talking about 60GB of RAM needed just to load the model. This lib is good for small datasets but if you go back its a mess.

seperman commented 3 years ago

Haha, yeah it is not designed for big data sets at all. I will need to rewrite it maybe in Go or something other than Python to be memory efficient.

Sep Dehpour

On Oct 28, 2020, at 10:39 AM, tomerav notifications@github.com wrote:

 I made it to work but finally gave up on this because its extremely RAM consuming not feasible in any sense of cost. Im talking about 60GB of RAM needed just to load the model. This lib is good for small datasets but if you go back its a mess.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

angrygoats commented 3 years ago

Related to @tomerav0's memory issue:

@seperman I think there is a solution here you could do with cython. Cython supports extension types. These extension types ("cdef classes") operate using C structs which should have far less overhead.

I glanced through the code and the Autocomplete class has some very memory intense operations in it. These could be cleaned up with Cython as well.

@tomerav0 if the data isn't proprietary (or if you could scrub it) it would be useful to have in order to profile the code before making any changes.

@seperman once you have some test data you can profile using:

python -m cProfile -o memory_profile.profile <test_script>.py <any_args_here>

and then you can inspect it via:

python -m pstats memory_profile.profile

and then check calls using

sort time
stats 10

To show the top 10 biggest call sites. You could also look into memory_profiler. I haven't had a need for it but it seems popular.