Open talentoscope opened 6 years ago
Hi @talentoscope, I'd be happy to accept contributions for the MinHash algorithm. In the mean time, I'd recommend Levenshtein distance for faster comparisons than Jaccard similarity (https://chatterbot.readthedocs.io/en/stable/comparisons.html#chatterbot.comparisons.LevenshteinDistance).
ChatterBot's performance with large sets of data is an issue that I am well aware of. I think it's worth mentioning that the major bottleneck in speed at the moment is not the speed of the comparisons, but the fact that ChatterBot currently needs to compare every statement in the database in order to search for a match to the input for every response it generates. This is a problem that I am working to address and hopefully optimize in the 1.0 release.
I'll see what I can do.
By the way, is there any design reason why every statement is compared? Just thinking a simple solution might be to use NLTK to lemmatise and remove stopwords for every statement and response as it goes into the database, alongside the full text, both being stored. The same can then be done with the user response and a DB search done at runtime.
i.e. for each keyword, search and return ID for comparison since "How are you today?" doesn't need to search statements like "I am a unicorn." So there wouldn't need to be so many comparisons if the search space is clipped in this manner, since DB searches take fractions of a second.
Might be taking this all wrong of course :)
On Sun, 21 Oct 2018, 19:47 Gunther Cox, notifications@github.com wrote:
Hi @talentoscope https://github.com/talentoscope, I'd be happy to accept contributions for the MinHash algorithm. In the mean time, I'd recommend Levenshtein distance for faster comparisons than Jaccard similarity ( https://chatterbot.readthedocs.io/en/stable/comparisons.html#chatterbot.comparisons.LevenshteinDistance ).
ChatterBot's performance with large sets of data is an issue that I am well aware of. I think it's worth mentioning that the major bottleneck in speed at the moment is not the speed of the comparisons, but the fact that ChatterBot currently needs to compare every statement in the database in order to search for a match to the input for every response it generates. This is a problem that I am working to address in the 1.0 release.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gunthercox/ChatterBot/issues/1463#issuecomment-431693603, or mute the thread https://github.com/notifications/unsubscribe-auth/AVItToyTRY2amGOeryh-i47n3F_haaOmks5unMFEgaJpZM4Xvbqc .
Something similar to lemmatization is being added in 1.0 (using partial matches to reduce the size of the search set). The current reason that all statements are compared is because of the need to do the text comparison for searching. Most similar statements are not exact matches and so inefficient query cannot be created for a database and therefore iterative comparisons are required.
Have you been looking into searching in the sqlite by adding the extension spellfix1? Im not sure how or if it fits into this context but it looks like they support some distance-functions and have ranking of words/text etc.
Not that it will fix the issue for mongodb users though..
two low ,just use as an toy。。。
@gunthercox @talentoscope, have you ever think about a clustering algorithm to reduce the number of comparisons? For example if we have 100,000 dialogue (sentence pair) in db, then for each new query in the current version we need 100,000 comparison (distance evaluation), but if we first perform an offline clustering and have 1000 cluster for example with about 100 dialogue in each cluster, then we need only (approximately) 1000+100=1100 comparisons! 1000 comparisons with clusters median and 100 comparison with each member of the best cluster.
@azarezade Clustering does sound like an effective strategy. Are there any criteria you would suggest for creating each of the dialog clusters?
@gunthercox For example if someone use a clustering method like DBSCAN, which can do clustering using a precomputed distance (useful in our case, Leveneshtien distance):
DBSCAN(eps=eps, min_samples=min_samples, metric='precomputed').fit(X)
min_samples
and and eps
are the criteria. But how to choose them to have good clusters, depends on the nature of data and similarity between sentences. Everyone can tune its own clusters by changing these parameters. I also suggest to use a normalized distance when precomputing similarity matrix X
, like,
X(i,j):=leveneshtien_distance(data[i], data[j]) / max(len(data[i]), len(data[j]), 1)
then eps
would be a number like 0.3
which is between [0,1]
.
I'll have to do a bit of reading. DBSCAN
sounds very promising and it looks like sklearn has a number of clustering algorithms available.
Bk-trees could be used to pre-calculate the Levenshtein (or any metric) distance.
I've done quite a bit of googling recently on this as, at only 20MB it is becoming really slow, especially when using Jaccard similarity.
My proposed solution is to use something like the MinHash algorithm as a replacement. It is ten times faster when comparing strings of documents, so I think it will be a great addition or replacement for the Jaccard similarity code - as it basically does the same thing.