Open eiva opened 4 years ago
Hi @eiva
What I recommend is to add these variations of words in the words dictionary. acura rlx
and test acura rlx
. We do already have code that does all this automatically but is not explicitly added to fast autocomplete yet.
So for example your words dictionary is going to be:
{
"acura rlx": [
{
"model": "rlx",
"make": "acura"
},
"Acura RLX",
3132
],
"test acura rlx": [
{
"model": "rlx",
"make": "acura"
},
"Acura RLX",
3132
],
"rlx": [
{
"model": "rlx",
"make": "acura"
},
"Acura RLX",
3132
],
"acura": [
{
"make": "acura"
},
"Acura",
130123
],
...
}
And then you can use the factory function as described in: https://github.com/wearefair/fast-autocomplete#sorting
You will see a similar example in that page.
A small +1 on this issue, to agree that in-built support for this would be a useful extra (even if optional / disabled-by-default).
(afterthought: if it's challenging to implement within the library for algorithmic and/or performance reasons, perhaps including a usage example like the one above in the repo itself would be a reasonable alternative)
Hi @jayaddison I have not had a chance to implement it. But basically the solution is called Gaddag: https://en.wikipedia.org/wiki/GADDAG I agree with you that it is easier to leave some examples in the readme for now! I will try to leave some examples soon. If you have a chance to open a PR and add some examples that would be great too till Gaddag is implemented. Thanks!
I "sovled" it by kind of brute force. I took each phrase I want to use and created all the combination for that phrase. I used another array to map each combination to the original value. For each combination I attached UUID that later I parse at the results.
original_word = row["name"]
#words[original_word] = [{}, original_word, row["count"]]
parts = original_word.split()
if len(parts) > 1 and len(parts) <= self._combo_words_limit:
for words_combo in itertools.permutations(parts, len(parts)):
pharse, id = self.buildPharse(words_combo)
pharses_map.append({id : original_word})
words.append({pharse : [{}, original_word, row["count"]]})
else:
pharse, id = self.buildPharse(original_word)
pharses_map.append({id : original_word})
words.append({pharse : [{}, original_word, row["count"]]})
@tomerav0 Yeah that works! In fact I have used something similar to your solution before too.
I "sovled" it by kind of brute force. I took each phrase I want to use and created all the combination for that phrase. I used another array to map each combination to the original value. For each combination I attached UUID that later I parse at the results.
original_word = row["name"] #words[original_word] = [{}, original_word, row["count"]] parts = original_word.split() if len(parts) > 1 and len(parts) <= self._combo_words_limit: for words_combo in itertools.permutations(parts, len(parts)): pharse, id = self.buildPharse(words_combo) pharses_map.append({id : original_word}) words.append({pharse : [{}, original_word, row["count"]]}) else: pharse, id = self.buildPharse(original_word) pharses_map.append({id : original_word}) words.append({pharse : [{}, original_word, row["count"]]})
Would you mind posting a more complete working example? maybe a gist?
I made it to work but finally gave up on this because its extremely RAM consuming not feasible in any sense of cost. Im talking about 60GB of RAM needed just to load the model. This lib is good for small datasets but if you go back its a mess.
Haha, yeah it is not designed for big data sets at all. I will need to rewrite it maybe in Go or something other than Python to be memory efficient.
Sep Dehpour
On Oct 28, 2020, at 10:39 AM, tomerav notifications@github.com wrote:
I made it to work but finally gave up on this because its extremely RAM consuming not feasible in any sense of cost. Im talking about 60GB of RAM needed just to load the model. This lib is good for small datasets but if you go back its a mess.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
Related to @tomerav0's memory issue:
@seperman I think there is a solution here you could do with cython. Cython supports extension types. These extension types ("cdef classes") operate using C structs which should have far less overhead.
I glanced through the code and the Autocomplete
class has some very memory intense operations in it. These could be cleaned up with Cython as well.
@tomerav0 if the data isn't proprietary (or if you could scrub it) it would be useful to have in order to profile the code before making any changes.
@seperman once you have some test data you can profile using:
python -m cProfile -o memory_profile.profile <test_script>.py <any_args_here>
and then you can inspect it via:
python -m pstats memory_profile.profile
and then check calls using
sort time
stats 10
To show the top 10 biggest call sites. You could also look into memory_profiler
. I haven't had a need for it but it seems popular.
Hello, is there any way to provide autosuggestion based not only by first word?
Prints empty array - so it is not search by second word...