Large database optimisation solution

talentoscope commented 6 years ago

I've done quite a bit of googling recently on this as, at only 20MB it is becoming really slow, especially when using Jaccard similarity.

My proposed solution is to use something like the MinHash algorithm as a replacement. It is ten times faster when comparing strings of documents, so I think it will be a great addition or replacement for the Jaccard similarity code - as it basically does the same thing.

gunthercox commented 6 years ago

Hi @talentoscope, I'd be happy to accept contributions for the MinHash algorithm. In the mean time, I'd recommend Levenshtein distance for faster comparisons than Jaccard similarity (https://chatterbot.readthedocs.io/en/stable/comparisons.html#chatterbot.comparisons.LevenshteinDistance).

ChatterBot's performance with large sets of data is an issue that I am well aware of. I think it's worth mentioning that the major bottleneck in speed at the moment is not the speed of the comparisons, but the fact that ChatterBot currently needs to compare every statement in the database in order to search for a match to the input for every response it generates. This is a problem that I am working to address and hopefully optimize in the 1.0 release.

talentoscope commented 6 years ago

I'll see what I can do.

By the way, is there any design reason why every statement is compared? Just thinking a simple solution might be to use NLTK to lemmatise and remove stopwords for every statement and response as it goes into the database, alongside the full text, both being stored. The same can then be done with the user response and a DB search done at runtime.

i.e. for each keyword, search and return ID for comparison since "How are you today?" doesn't need to search statements like "I am a unicorn." So there wouldn't need to be so many comparisons if the search space is clipped in this manner, since DB searches take fractions of a second.

Might be taking this all wrong of course :)

On Sun, 21 Oct 2018, 19:47 Gunther Cox, notifications@github.com wrote:

Hi @talentoscope https://github.com/talentoscope, I'd be happy to accept contributions for the MinHash algorithm. In the mean time, I'd recommend Levenshtein distance for faster comparisons than Jaccard similarity ( https://chatterbot.readthedocs.io/en/stable/comparisons.html#chatterbot.comparisons.LevenshteinDistance ).

ChatterBot's performance with large sets of data is an issue that I am well aware of. I think it's worth mentioning that the major bottleneck in speed at the moment is not the speed of the comparisons, but the fact that ChatterBot currently needs to compare every statement in the database in order to search for a match to the input for every response it generates. This is a problem that I am working to address in the 1.0 release.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gunthercox/ChatterBot/issues/1463#issuecomment-431693603, or mute the thread https://github.com/notifications/unsubscribe-auth/AVItToyTRY2amGOeryh-i47n3F_haaOmks5unMFEgaJpZM4Xvbqc .

gunthercox commented 5 years ago

Something similar to lemmatization is being added in 1.0 (using partial matches to reduce the size of the search set). The current reason that all statements are compared is because of the need to do the text comparison for searching. Most similar statements are not exact matches and so inefficient query cannot be created for a database and therefore iterative comparisons are required.

patka817 commented 5 years ago

Have you been looking into searching in the sqlite by adding the extension spellfix1? Im not sure how or if it fits into this context but it looks like they support some distance-functions and have ranking of words/text etc.

Not that it will fix the issue for mongodb users though..

xxllp commented 5 years ago

two low ，just use as an toy。。。

azarezade commented 5 years ago

@gunthercox @talentoscope, have you ever think about a clustering algorithm to reduce the number of comparisons? For example if we have 100,000 dialogue (sentence pair) in db, then for each new query in the current version we need 100,000 comparison (distance evaluation), but if we first perform an offline clustering and have 1000 cluster for example with about 100 dialogue in each cluster, then we need only (approximately) 1000+100=1100 comparisons! 1000 comparisons with clusters median and 100 comparison with each member of the best cluster.

gunthercox commented 5 years ago

@azarezade Clustering does sound like an effective strategy. Are there any criteria you would suggest for creating each of the dialog clusters?

azarezade commented 5 years ago

@gunthercox For example if someone use a clustering method like DBSCAN, which can do clustering using a precomputed distance (useful in our case, Leveneshtien distance):

DBSCAN(eps=eps, min_samples=min_samples, metric='precomputed').fit(X)

min_samples and and eps are the criteria. But how to choose them to have good clusters, depends on the nature of data and similarity between sentences. Everyone can tune its own clusters by changing these parameters. I also suggest to use a normalized distance when precomputing similarity matrix X, like,

X(i,j):=leveneshtien_distance(data[i], data[j]) / max(len(data[i]), len(data[j]), 1)

then eps would be a number like 0.3 which is between [0,1].

gunthercox commented 5 years ago

I'll have to do a bit of reading. DBSCAN sounds very promising and it looks like sklearn has a number of clustering algorithms available.

Dobatymo commented 5 years ago

Bk-trees could be used to pre-calculate the Levenshtein (or any metric) distance.

gunthercox / ChatterBot

Large database optimisation solution #1463