Open timotheecour opened 6 years ago
I'm not sure what you are asking for here? Did the result set contain the items you were looking for? It seems like you are asking for more magic, not less. This is how fuzzy matching works. Those results are because there are enough substring matches to give them high enough scores. I could tweak the scoring mechanism to punish some results (which I have already done in some cases) but there is no general solution to this. The search has no way to determine a "relevant entry" for any arbitrary result, at least not without doing ML type stuff or something far more advanced / magical than what we are already doing.
IMO, this is far better than the simple exact string search that the original search box algorithm was doing. Which, especially for beginners, would not show any results for commonly searched for terms.
I have thought about adding a feature to allow you to disable the fuzzy match and use exact string match instead, for the usecase where you know exactly what you want. This could be exposed either through a checkbox under the search box, or by accepting a special command in the search text such as
exact: "my_exact_search_term"
@rayman22201 Just use the fuzzy match and an exact match?
Sure. That would work too. It's a matter of taste I suppose. But, @timotheecour seems to want less results to show up in the search, not more lol
well there's also a limit of 19 results so fuzzy and exact may be nonideal... plus, we should let the user choose what he wants => see https://github.com/nim-lang/Nim/issues/9431
EDIT still though, comandlines
is very far from the search results I'm seeing (eg colors: colMaroon
) so indeed some parameter tweeking may be needed (regardless of above point)
* are results sorted by edit distance (or fuzzy search score akin to a similarity) or someting else?
It's not "edit-distance" in the levenshtein sense of the word. The algorithm finds the approximate set of "longest matching substrings", and then applies an arbitrary scoring mechanism to those substrings based on things such as distance to the start of the string or beginning of a word boundary.
It's based on the sublime text fuzzy match algorithm as described here: https://blog.forrestthewoods.com/reverse-engineering-sublime-text-s-fuzzy-match-4cffeed33fdb
The algorithm finds the approximate set of "longest matching substrings", and then applies an arbitrary scoring mechanism to those substrings based on things such as distance to the start of the string or beginning of a word boundary.
But the scoring seems worse than what my initial algorithm did which had some hand-crafted priorities, though I don't remember the details.
@Araq This was your algorithm at the time of my PR: https://github.com/nim-lang/Nim/pull/8260/files#diff-80c921d6019e3f5509432f61bbf3380fL91
I don't see any hand-crafted priorities. It was a pretty simple regex Unless I'm missing something?
And in my experience, and based on forum and IRC posts, people were not happy with the results that algorithm was showing either, particularly new users.
But the scoring seems worse than what my initial algorithm
Does it though? For many of my searches it shows better scoring. I wouldn't have submitted the PR if I didn't think it was better.
I also don't think irrelevant results are a problem as long as it narrows it down enough for you to find the result you want, especially if you don't know exactly what you want (the case for many new users). @timotheecour's point about the artificial limit on search results is a good point though, and should be fixed. That would alleviate some of these issues.
I would argue that the new fuzzy algorithm is still better even if you don't like the current scoring, because it's far easier to adjust the scoring mechanism because it's explicit, as opposed to a regex that requires you to pray to the perl gods and tweak your regex grammar endlessly to get reasonable results. regex is not good for general case search imo.
the naive algorithm I was thinking of was (in pseudo-code)
for candidate in index:
let dist = editdistance(query, candidate)
if dist < maxDist(query.len): # eg: maxDist = 0 if query.len <=2 (exact match required); then an increasing function of query.len (to be tweaked a bit)
candidates.add (dist:dist, candidate:candidate)
showResults(candidates sorted by dist)
Well the code used to do
if c.containsWord(key):
matches.add((db[i], -(30_000 - c.len)))
elif c.contains(key):
matches.add((db[i], c.len))
and that affected the sort order tremendously.
I keep hitting this, we simply need to use the same edit distance algorithm as used in https://github.com/nim-lang/Nim/pull/16067. eg:
shrkink
doesn't find shrink
(from deques.shrink):
and another example showing a different aspect:
I don't think we should match the arguments in docsearch, at least by default. instead it should only match the symbol name. But with a github-like search interface, we could also match by arguments.
comandlines retrieves:
colors: colMaroon
seems like there's too much magic going on; at the same time we have the opposite type of errors: https://github.com/nim-lang/Nim/issues/9198
links
EDIT