search function of html documentations has too many false positives/negatives

timotheecour commented 6 years ago

comandlines retrieves: colors: colMaroon

seems like there's too much magic going on; at the same time we have the opposite type of errors: https://github.com/nim-lang/Nim/issues/9198

links

EDIT

see past relevant PR's : https://github.com/nim-lang/Nim/pull/8260/files#diff-80c921d6019e3f5509432f61bbf3380fL91

rayman22201 commented 6 years ago

I'm not sure what you are asking for here? Did the result set contain the items you were looking for? It seems like you are asking for more magic, not less. This is how fuzzy matching works. Those results are because there are enough substring matches to give them high enough scores. I could tweak the scoring mechanism to punish some results (which I have already done in some cases) but there is no general solution to this. The search has no way to determine a "relevant entry" for any arbitrary result, at least not without doing ML type stuff or something far more advanced / magical than what we are already doing.

IMO, this is far better than the simple exact string search that the original search box algorithm was doing. Which, especially for beginners, would not show any results for commonly searched for terms.

rayman22201 commented 6 years ago

I have thought about adding a feature to allow you to disable the fuzzy match and use exact string match instead, for the usecase where you know exactly what you want. This could be exposed either through a checkbox under the search box, or by accepting a special command in the search text such as exact: "my_exact_search_term"

Araq commented 6 years ago

@rayman22201 Just use the fuzzy match and an exact match?

rayman22201 commented 6 years ago

Sure. That would work too. It's a matter of taste I suppose. But, @timotheecour seems to want less results to show up in the search, not more lol

timotheecour commented 6 years ago

well there's also a limit of 19 results so fuzzy and exact may be nonideal... plus, we should let the user choose what he wants => see https://github.com/nim-lang/Nim/issues/9431

EDIT still though, comandlines is very far from the search results I'm seeing (eg colors: colMaroon) so indeed some parameter tweeking may be needed (regardless of above point)

are results sorted by edit distance (or fuzzy search score akin to a similarity) or someting else?

rayman22201 commented 6 years ago

* are results sorted by edit distance (or fuzzy search score akin to a similarity) or someting else?
It's not "edit-distance" in the levenshtein sense of the word. The algorithm finds the approximate set of "longest matching substrings", and then applies an arbitrary scoring mechanism to those substrings based on things such as distance to the start of the string or beginning of a word boundary.

It's based on the sublime text fuzzy match algorithm as described here: https://blog.forrestthewoods.com/reverse-engineering-sublime-text-s-fuzzy-match-4cffeed33fdb

Araq commented 6 years ago

The algorithm finds the approximate set of "longest matching substrings", and then applies an arbitrary scoring mechanism to those substrings based on things such as distance to the start of the string or beginning of a word boundary.

But the scoring seems worse than what my initial algorithm did which had some hand-crafted priorities, though I don't remember the details.

rayman22201 commented 6 years ago

@Araq This was your algorithm at the time of my PR: https://github.com/nim-lang/Nim/pull/8260/files#diff-80c921d6019e3f5509432f61bbf3380fL91

I don't see any hand-crafted priorities. It was a pretty simple regex Unless I'm missing something?

And in my experience, and based on forum and IRC posts, people were not happy with the results that algorithm was showing either, particularly new users.

But the scoring seems worse than what my initial algorithm

Does it though? For many of my searches it shows better scoring. I wouldn't have submitted the PR if I didn't think it was better.

I also don't think irrelevant results are a problem as long as it narrows it down enough for you to find the result you want, especially if you don't know exactly what you want (the case for many new users). @timotheecour's point about the artificial limit on search results is a good point though, and should be fixed. That would alleviate some of these issues.

I would argue that the new fuzzy algorithm is still better even if you don't like the current scoring, because it's far easier to adjust the scoring mechanism because it's explicit, as opposed to a regex that requires you to pray to the perl gods and tweak your regex grammar endlessly to get reasonable results. regex is not good for general case search imo.

timotheecour commented 6 years ago

the naive algorithm I was thinking of was (in pseudo-code)

for candidate in index:
  let dist = editdistance(query, candidate)
  if dist < maxDist(query.len): # eg: maxDist = 0 if query.len <=2 (exact match required); then an increasing function of query.len (to be tweaked a bit)
    candidates.add (dist:dist, candidate:candidate)
showResults(candidates sorted by dist)

Araq commented 6 years ago

Well the code used to do

 if c.containsWord(key):
      matches.add((db[i], -(30_000 - c.len)))
    elif c.contains(key):
      matches.add((db[i], c.len))

and that affected the sort order tremendously.

timotheecour commented 3 years ago

I keep hitting this, we simply need to use the same edit distance algorithm as used in https://github.com/nim-lang/Nim/pull/16067. eg:

shrkink doesn't find shrink (from deques.shrink):

timotheecour commented 3 years ago

and another example showing a different aspect:

I don't think we should match the arguments in docsearch, at least by default. instead it should only match the symbol name. But with a github-like search interface, we could also match by arguments.

nim-lang / Nim

search function of html documentations has too many false positives/negatives #9406

links