Revisit contiguous matches

TylerLeonhardt commented 3 years ago

I've added the "quotes" support in #131292 and am planning on it releasing in stable as "experimental" because I'm not fully convinced it's the correct UI to do this. I want to try 2 other options:

a toggle button. We have these all over the place in VS Code (like in search and the find widgets). I want to experiment having one of those toggles for toggling between fuzzy and non-fuzzy search. This means we could have a toggle button for other things.
(maybe) a different Quick Access provider since that would allow the new syntax to be in the ? menu.

alefragnani commented 3 years ago

Hi @TylerLeonhardt ,

Is it really necessary to use “quotes” to identify that I prefer contiguous match even for single word searches? I mean, if I type a single word, I expect that any item that contains that exact match to have a higher rank, otherwise, the fuzzy search could do it’s magic. And, because you said you are not fully convinced about the UI, I would say, please don’t 😬 .

You are probably aware of some of the issues related to the fuzzy search, but I remember one in particular, originally created by @kentcdodds on my Project Manager extension, which I asked to fill here (#14879). He suggested to use one of its libs https://github.com/kentcdodds/match-sorter, which seemed to work really great. I’m not saying this lib is the solution (I didn’t look all the libs details and how it would match on other pieces of VSCode) but because the issue was closed later in favor of another one, this history/details could be missed.

Hope this helps

TylerLeonhardt commented 3 years ago

I mean, if I type a single word, I expect that any item that contains that exact match to have a higher rank

This is pretty much the experience today. The only scenarios I've seen exact matches have less weight are when all are true:

A query is only 2 characters
the first character of the query matches the first letter of the file

For example, if I use a repo for a bunch of markdown files that are my notes for that day:

However once I add a 3rd character (the .) then the contiguous results jump to the top:

The quotes will make sure the contiguous results are the only ones left.

Quite an edge case...

The other scenario for quotes is when you don't want to see the clutter of the fuzzy search results (which I know some users don't always like to see).

Ranking won't really change in that case, you will just see less options.

TylerLeonhardt commented 3 years ago

Notes from standup:

Folks like the quotes - very intuitive (thanks to search engines already using this concept)
Folks were against having toggle buttons - full stop.... however, having an "advanced search" quick access provider was interesting to folks. That provider could support several toggles... for example:
- toggle fuzzy
- allow file search through excluded files (node_modules)

I will take this to the UX sync.

alefragnani commented 3 years ago

However once I add a 3rd character (the .) then the contiguous results jump to the top:

In this case, I expected the highlight to be 2021-5- 26. md instead, but yes, an edge case because . could be considered a word separator

Sorry but, personally, I feel the VS Code fuzzy search really needs improvement. Most of the times I rely on the recent items (when available) instead, because the search rank doesn’t work for me. I type almost the perfect match to find what I want, but still, sometimes I have to scroll down to select the expected result. I have memories of great results in Sublime (the first time I used a Command Palette) and mixed results in Atom, but honestly I don’t use either for a long time, so I can’t give real comparisons.

Notes from standup:

I’m a folk that don’t like “quotes”, unless for phrase (full content) searches. Maybe it’s a background from other search engines I have used in the past. So, an “advanced search” option to toggle on/off would be a good alternative.

Great to see you are working on this. Improvements are welcome 😁

Thank you

TylerLeonhardt commented 3 years ago

In this case, I expected the highlight to be 2021-5- 26. md instead, but yes, an edge case because . could be considered a word separator

The highlighting is wrong but the order is what I was trying to demonstrate. Also the . in this case isn't a word separator. It's just any other character.

alefragnani commented 3 years ago

That's great! Eager to try this one.

bpasero commented 3 years ago

Sorry but, personally, I feel the VS Code fuzzy search really needs improvement. Most of the times I rely on the recent items (when available) instead, because the search rank doesn’t work for me. I type almost the perfect match to find what I want, but still, sometimes I have to scroll down to select the expected result.

Would be good to collect these cases, it is always useful to have a collection of less than ideal ranking so that we can understand why that is and if we make a fix see if our test cases are still good. Over the time, whenever I made changes to the scorer and ranking, I tried to write a test case for that scenario so that when we make changes we see what other cases fall apart (if any).

Btw, unfortunately there is not just 1 fuzzy ranking/scorer algorithm used, depending on what quick open you are talking about:

editor history (list of recently opened editors, LCS fuzzy search NOT allowing non-contiguous matches)
file search (LCS fuzzy search allowing non contiguous matches - unless the new quotes support is used from Tyler)
command palette (some very old simple matcher)
picker (that e.g. extensions can use): another simple fuzzy matcher

I think a first good step would be to align the various pickers to use the same fuzzy scoring, except maybe command palette: we try to preserve muscle memory of users and not break it. We put commands in alphabetic order when showing results to group commands that logically belong together close to each other. Some fuzzy ranker might break this muscle memory easily.

alefragnani commented 3 years ago

Would be good to collect these cases, it is always useful to have a collection of less than ideal ranking so that we can understand why that is and if we make a fix see if our test cases are still good

Totally agree, I'll try to replicate/identify those cases and if don't find an already open issue to add comments, I'll add create a new one with the details.

Just to complement my previous comment about comparisons with other tools I used before, I wasn't saying Atom (with mixed results) had better results than VS Code. Only Sublime was better back then. On the other hand, also talking about other tools, I have Jetbrains Rider available in the company I'm working on, and I would say VS Code fuzzy search is way better. I see myself using VS Code instead quite often, because I feel it is much easier to search/navigate the source, congrats 👏 .

Thank you

TylerLeonhardt commented 3 years ago

Btw, unfortunately there is not just 1 fuzzy ranking/scorer algorithm used, depending on what quick open you are talking about

@bpasero isn't there a different implementation for symbols too? Or is that covered in your list somewhere?

ssigwart commented 3 years ago

The other scenario for quotes is when you don't want to see the clutter of the fuzzy search results (which I know some users don't always like to see).

Yeah. I'm one of those users. My workflow is that I tend to type 3 (sometimes 4) characters of distinct words in the files I'm look for. Here's an example where what I really want is Database/Users.class.php.

Sorry the screenshots are going to be limited. I don't want to show too much of my directory structure. As you can see, the 7th item on the list is what I really want. I assume matches in the filename are preferred over directory path, but does finding a "u", "s", and "e" or "d", "at", and "a" randomly in the filename really help that much? That's a really cool thing VSCode does in autocompletion, but not so cool to me in the file open dialog. I could see maybe allowing a one character difference in case of a typo.

If I type the "r" in "user", the situation is better, but still not great:

So for me, allowing quotes was a way to get around those issues without breaking anyone else's workflow (e.g. by completely disabling it). However, if there was a setting that said file opening would default to exact word matches, I'd turn that on instead. I still would want it so the word order doesn't matter. That way "data user" and "user data" would both work.

On a related topic from #128924/#128923, I found a concrete example where exclusions would be really helpful. See the below heavily redacted screenshot: Assuming I know don't really know the full name of the bottom 3 files, but I know they have "Fraud" and "Config" in them, it's not easy to find them. Notice that I already scrolled through a lot of results all from the private/ directory. I know the file I want won't be in that directory, so it would be great to be able to just add -private and get rid of them. I've never seen an editor do this, but I think it would be a great little feature. As was mentioned though, it's not really discoverable, but for people that know about it, it can be powerful.

bpasero commented 3 years ago

@TylerLeonhardt very good point, I forgot about symbols: we use yet another fuzzy matcher implemented by Jo, which is the same thing used to filter down intelli-sense results:

https://github.com/microsoft/vscode/blob/77905c850e170eab00cac5ca190c7b1fe5ad43ba/src/vs/base/common/filters.ts#L546-L546

It is another LCS variant as far as I know.

The one I wrote is heavily optimised for file paths, that is also why it is not used for symbols actually. And this explains some of the seemingly bad results from @ssigwart. To explain what is going on in that screenshot:

if you separate words by space each word is becoming a query on its own and scores are just added up (in other words a query of foo bar becomes 2 queries of foo and bar and both scores are added to compute the final score)
we ALWAYS rank a match on the file name higher than a match on the parent folders, which explains why the top results in that query are the ones where matches exist only in the file name (see also https://github.com/microsoft/vscode/issues/25925)

I think that latter behaviour is probably the explanation for many results that are less than ideal. But changing that rule will also mean that for any query you might see results appear in the top ranks where matches are only coming from parent folders, not the file itself. I think this is a matter of improving the ranking better to account for.

ssigwart commented 3 years ago

Thanks for the explanation, @bpasero. It makes sense that a match on the filename is preferred over a directory. However, I feel like it would result in better results if the top results are always the ones that include an exact match of the separated words. Then the remainder would be sorted by the closeness to the full word (e.g. a1b1c would score better than a12b1c for a search of abc). I don't know all the rules currently in VSCode, but a simplified version of how I think scoring would work is:

Filenames that match all words exactly.
Filename or directories that match all words exactly.
Filenames that match all characters in each word.
Filenames or directories that match all characters in each word.

Another thought I had is that maybe the length could be valuable for scoring too. So if you search for abc, it will score abc.txt better than abcdef.txt. That would make it easier to find short filenames that are substrings of other file names. For example, if you're looking for User.php, but UserFile1.php, UserFile2.php, etc. are ranked before, it's hard to filter down any more to get to User.php. However, if you really wanted UserFile1.php and it didn't show up first, you still have the option to add more to the query (e.g. file1) to filter down to the actual file.

If you want, I'm willing to try to work on an update if given direction.

marblewraith commented 3 years ago

My 2 cents.

I agree the functionality should precisely match those of web search engines (e.g. google, bing, yahoo), all of which default to wrapping contiguous (exact) searches with "quotes". However...

The scope of implementation and result set in this case is naturally more limited i.e. filename search, in a specific workspace, (probably) with files.exclude, files.watcherExclude, search.exclude, configured.

Given this. While i approve of the change, it's more important to not break DevX (expectations / muscle memory).

To achieve that, i request no matter what's done here with fuzzy search algorithmically, when opened with ctrl+p the quick open dialog should be rendered by default with a cursor in between a pair of quotes "|", and/or an option in settings to toggle this off.

AndrewRayCode commented 1 year ago

I filed https://github.com/microsoft/vscode/issues/164352 but i can't tell if it dupes this ticket. I think it's more specifically about this issue:

if you separate words by space each word is becoming a query on its own and scores are just added up

Which means filenames with spaces are never exactly matched, and the files aren't highlighted properly (ignore the first two files in the below screenshot, they are recent matches)

197369666-b1346b87-ade6-4467-a095-ace1280cb9ef

this screenshot shows the same thing, an exact file match is ignored:

197370126-66d4a22c-031b-46ea-869f-a4695f28a62b

I don't understand the reasoning behind having two words start two separate matches. Maybe it would make more sense if the file list let you select multiple files to open at once. But you can only open one file, so starting a second search is counterintuitive.

The intuitive searching behavior I'm used to: Always fuzzy matching, on the whole search (not broken up into searches by spaces), and sorted by Levenshtein distance (or similar)

In other fuzzy search tools, if there ambiguous file name matches, I'm used to typing a letter or two from each subdirectory I know the file is in, so the fuzzy matcher will match that letter against the directory.

Whatever Ctrl-P for Vim uses is so intuitive that I've never had an issue with the fuzzy matching nor really even thought about it, because it matches my mental model.

VSCode's matching doesn't surface expected results, and I have to think about it a lot.

Maybe instead of a toggle button, there could be a setting (or an API we can plug into to override it) to allow for more "natural" fuzzy matching?

gpeal commented 1 year ago

I recently switched from the IntelliJ world to VSCode and the fuzzy match algorithm of VSCode is slowly driving me insane. In addition to the cases here, one thing that IntelliJ gets really right is case sensitive matching of partial words and priority for local symbols.

Check out this example. I am trying to match externalReport which is defined on the line immediately above. ext correctly matches. Typing R should increase the weight of the already-top result but instead, it drops off the list entirely.

The result of this is that I frequently wind up with random variables and imports as a result of the correct result disappearing after I type more letters that should only increase its weight.

You can see a similar result play out here: CleanShot 2022-11-30 at 13 44 37@2x (1) vs WebStorm:

bpasero commented 1 year ago

@gpeal note that this issue is about the "quick open" picker component while the suggest widget uses a different implementation, so I think a separate issue would probably be warranted to improve suggest scoring.

gpeal commented 1 year ago

@bpasero What is the "quick open", specifically?

bpasero commented 1 year ago

The picker, see https://github.com/microsoft/vscode/issues/131431#issuecomment-904164436

gpeal commented 1 year ago

@bpasero I encounter the exact same issue with quick open. From what I can tell, they use the same fuzzy match algorithm. I just used a variable intellisense as an example but the same principle holds for quick open.

bpasero commented 1 year ago

Maybe similar algorithms (LCS) but entirely different implementations at least for file search vs suggest for symbols.

gpeal commented 1 year ago

FYI, these are the IntelliJ settings (and the default values are selected). Notably, it defaults to "first letter only" and also case sensitivity.

Case sensitivity alone would go a long way to improve VSCode's algorithm here. People rarely capitalize a letter by mistake so matching a capital letter should have much higher weight than a lowercase letter in the middle of a word when a capital letter is typed.

microsoft / vscode

Revisit contiguous matches #131431