microsoft / vscode

Visual Studio Code
https://code.visualstudio.com
MIT License
163.25k stars 28.88k forks source link

Explore using a native third-party search tool such as ripgrep or Silver Searcher #19983

Closed d-akara closed 7 years ago

d-akara commented 7 years ago

I am very impressed with the performance of the new parallel search; however, there is an opportunity to take search speed to the absolute limit by optionally allowing a user to configure ripgrep as the search provider.

Even with the new search speed, ripgrep is still an order of magnitude faster. Ripgrep is actually an order of magnitude faster than pretty much anything.

http://blog.burntsushi.net/ripgrep/

roblourens commented 7 years ago

I was going to look at The Silver Searcher this month, I had skipped ripgrep because it doesn't support multiline search. I know ripgrep can be faster though. Do you think there's enough difference between different search tools that it would be worth allowing people to hook up arbitrary ones to vscode?

d-akara commented 7 years ago

Yes, I was just thinking that maybe there should be an API that allows extension authors to hookup whatever they want. However, I'm not sure there is anything that interesting beyond ripgrep and Silver Searcher. When I think of awesome search tools, those are what come to mind.

So, I just did some comparisons on my project. Searching for a particular term, I get these results:

If we can get back to 2 sec performance again, then there is less difference between vscode and ag, but ripgrep is still significantly faster. However, as you said, you do lose multiline searches. If vscode implements multiline searching, then I would be happy with doing multiline with vscode's builtin search and using ripgrep for most of my other searching as that would be most common.

roblourens commented 7 years ago

Ripgrep would be perfect, but I really want multiline search support. The problem with Silver Searcher right now is that it can't handle ignoring ** patterns, which is problematic for supporting gitignore files and our ignore glob patterns. https://github.com/ggreer/the_silver_searcher/issues/530

The ripgrep blog post you posted lists other tools, but I eliminate them on various other grounds, like lacking windows support or perf that apparently breaks down.

There are also some specialist tools I've found, like ICgrep or Hyperscan, that focus on advanced unicode or regex features.

Considering all this, we should either

Still leaning towards ag even though working around the ** issue would be very annoying.

d-akara commented 7 years ago

My thoughts are something like this in order of preference:

  1. use ripgrep. However, with these conditions
    1. We are able to restore the original search speed of 1.8.1
    2. There are plans to support multiline searching in the future for vscode builtin search
  2. use Silver Searcher: If both conditions above for ripgrep are not true, then for me it seems Silver Searcher would be best choice.
  3. Extension API: If for whatever reason we can't make a confident decision for 1 or 2
roblourens commented 7 years ago

For 1., if we were using ripgrep, it would replace vscode search entirely, so it would be much faster than 1.8.1, and there would be no multiline search.

d-akara commented 7 years ago

Ahh, I didn't realize you were considering it as a replacement. So you would then bundle ripgrep or Silver Searcher as part of vscode? If the internal search wouldn't be supported any longer, I suppose I would then prefer Silver Searcher.

roblourens commented 7 years ago

Yeah that's the idea, to have it drive the search viewlet behind the scenes. Possibly could also be involved in driving quick open.

BurntSushi commented 7 years ago

(ripgrep author here.) What do you folks use multiline search for? I've long considered it something I'd be unlikely to add support for, but I've been known to bend if there's strong demand for it. Alternatively, maybe there's a compromise that can be reached.

Note that ** isn't the only thing that ag doesn't support in gitignore files. ripgrep's support for gitignore matching is pretty dang close to 100% and remains fast. e.g., If you have lots of gitignore files or a single giant one, then ag slows down quite a bit compared to ripgrep.

Are there other things you folks care about? What about Unicode support? Support for searching UTF-16 (planned, not actually available yet)?

BurntSushi commented 7 years ago

I'm also in the process of moving a lot of code in ripgrep out into distinct distinct Rust libraries, which would give you a lot more control over how search operates. But, you'd need to build out a C FFI for it, which wouldn't be especially hard, but it wouldn't be something someone could bang out in a day either.

d-akara commented 7 years ago

@BurntSushi here are some sample use cases of multiline search.

The most often is simply Code statements often don't always exist as single lines someFunctionCall( arg1, arg2 ) Can be written like this

someFunctionCall(arg1, arg2)`

  1. I often am interested in terms that appear near each other or in the same file. Questions like...
    1. Which classes make use of x
    2. Where do we query for type X using join with Y. Likely these terms will be near each other, but not on same line
  2. Where are empty try catch blocks where exceptions were not handled.
BurntSushi commented 7 years ago

@dakaraphi If ripgrep asked you to use two distinct regexes, would that suffice? Or do you want to use one regex?

d-akara commented 7 years ago

@BurntSushi If I follow what you imply, then that would only help answer if 2 different terms exist in the same file. However, a sample regex might look like this where I want to find something that is near: termA(.|\n){0,200}termB or termA(.*\n.*){0,3}termB

or example searching for xml tag with given id <extension(.|\n)*?id="A"

roblourens commented 7 years ago

I'll do a writeup for this investigation next week, but for vscode's purposes, we're interested in multiline search, UTF-16 support, and also I like a search that returns results in sorted order by path, which ripgrep doesn't do right now.

BurntSushi commented 7 years ago

I don't think any search tool with parallelism returns results in sorted order. ripgrep does have the --sort-files option which I think will do what you want, but it disables parallelism.

roblourens commented 7 years ago

Silver Searcher does actually - I see why it could be a perf hit to order the results though.

BurntSushi commented 7 years ago

The silver searcher does not. I just tried. I can easily observe non-deterministic ordering of output by running the same command a few times.

Also, UTF-16 support is on my roadmap for ripgrep. :-)

On Feb 19, 2017 21:04, "Rob Lourens" notifications@github.com wrote:

Silver Searcher does actually - I see why it could be a perf hit to order the results though.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Microsoft/vscode/issues/19983#issuecomment-280970792, or mute the thread https://github.com/notifications/unsubscribe-auth/AAb34nz0Svq_rdev4-F85biLkJZ6osV_ks5rePSygaJpZM4L3i7G .

roblourens commented 7 years ago

You're right - looking at it more closely, SS tends to be closer to being in order, and often is in order when I run it in my vscode workspace, but not always.

roblourens commented 7 years ago

And I'm glad to hear that UTF-16 support is on the roadmap. Any idea in what timeframe you'd expect to look at it?

BurntSushi commented 7 years ago

Any idea in what timeframe you'd expect to look at it?

No, sorry. "Within the next year" is probably the best I can do. Hopefully sooner.

d-akara commented 7 years ago

Given there isn't an ideal match around features you wish to provide out of the box and uncertainty around the timing of the availability of those features, does it make sense to reevaluate the option of simply providing an API that extension authors can use to integrate external search providers?

It would probably be useful also if the API provided the ability to invoke VS Code's builtin search as a fallback if the extension is able to detect a type of regular expression might not be supported in the external tool, it can then pass it on to VS Code.

roblourens commented 7 years ago

@BurntSushi I could also fall back to VS Code's builtin search for UTF-16 files, but don't want to duplicate the file tree walking work that ripgrep does. An easy compromise would be if ripgrep prints a message each time it encounters a UTF-16 (or binary) file. I imagine this hidden behind an option but it would also be useful for CLI users who are missing matches because they don't realize a file is an unsupported encoding.

@dakaraphi Thought about it, but it seems like overkill. Creating an extension API is a lot of work and there are probably only a few search providers in the world that anyone would want to use. I want to focus on the out of box experience.

d-akara commented 7 years ago

Using ripgrep as primary and VS Code builtin as fallback seems like a good solution.

I would actually be very happy with that compromise. Especially if VS Code could implement multiline search, then multiline regular expressions could just be passed on to VS Code. VS Code's builtin is fast enough that it wouldn't be a bad solution given that multiline searching will be less common.

roblourens commented 7 years ago

Here's a brain dump to sum it up…

Why would we want to do this?

What do we need from search?

There are a handful of native search tools out there. Originally I experimented with grep and findstr but they weren't fast enough or right for the job. The ripgrep introductory blog post (which is highly recommended reading!) discusses several others. I eliminate some of the others for various reasons (license, lack of Windows support, missing features like gitignore). For completeness, I'll also mention ICGrep and Hyperscan, but these are specialized tools focusing on advanced regex and unicode search use cases. So the tools that are good candidates for us are Silver Searcher and Ripgrep.

Numbers

Here are some very rough benchmarks to see what kind of improvement we're looking at.

Searching a simple regex in a Chromium enlistment:

Tool Time
VS Code 1.10 23.0s
ag 7.4s
rg 5.7s

Searching for a literal in the vscode repo:

Tool Time
VS Code 1.10, cold 1.8s (starting worker processes => annoying pause)
ag .3s
rg .1s

Of course this isn't a totally fair comparison since vscode does more work to display results, but it should get the idea across.

Silver Searcher (ag)

To build for Windows, ag requires msys/mingw which is an annoying hurdle but probably not a blocker. It also needs a win32 implementation of pthreads, which is under the LGPL. This is probably OK but we need to check with LCA. There is a port that gets it to build natively using MSVC but is quite out of date with the upstream.

There are some gaps in its .gitignore support, like not matching **, but also several others. Also some details here. This would impact the search.exclude patterns that we already use too. If we use ag, we'll probably want to implement the .gitignore support ourself and filter the results as they come back. That's not really so bad - we may want to do that for the File Explorer anyway. But, you may find yourself gitignoring a large number of files and noticing that search doesn't get any faster.

It does support multiline search. We don’t support this yet but it is a popular request and a typical feature in other editors.

It doesn't understand UTF-16 files. More on this below.

Since it searches in parallel, it can return results in random order. In practice, results are usually close to in order, or even often in order for repos the size of vscode. This is good because we sort results on the frontend, and it looks nicer for the list to not be reordering itself every time a result comes in. Also, our tree code doesn't handle these random insertions very well at scale, and is the main reason for the ~2000 result limit.

Ripgrep (rg)

Rust's regex engine uses finite automata instead of backtracking, which is why it isn't susceptible to catastrophic backtracking, and why it doesn't support lookaround. I think we can live without lookaround - someone will complain, but I don't think it should be a priority. It may be worrying that 'find' (cmd+f) will have a different regex feature set than 'search' though.

It completely supports .gitignore patterns and the glob patterns that we already use.

It doesn't support multiline search at the moment, see some discussion of that with the rg author in this thread.

It also doesn't support searching UTF-16 files.

It tends to return results in much more random order than ag. Same comments as above apply here. I don’t necessarily see it as a problem - loading results in order is more useful when the search is slow. So if we are speeding up search by an order of magnitude, this is less important.

UTF-16 support

We have users with UTF-16 files, and they noticed when we broke it, so we need a solution for this. In rg, UTF-16 support is planned, see discussion above. Ag has an issue for it but little traction.

On either side, short of getting it implemented in the tool itself, we need to run UTF-16 files through vscode's existing search code. To do this, we need to either walk the file tree separately, detect files with a UTF-16 BOM, and search them. Or to save time, get an option into the tools to have them print a path when they encounter a UTF-16 file. I think both are realistic options. But of course then we're supporting two search tools. We could have feature set mismatches in multiline search, gitignore support, and regex features. Implementing multiline search in our current search code isn't out of the question.

Etc

This branch implements a very quick and dirty proof of concept using ag on your path, and this one is the same using rg. At some point, someone could write a native Node wrapper for rg, but this doesn't look currently possible for ag.

Another note is that we lose the progress bar with these tools - they spit out data until they're done, and that's all the information we have. But it was fairly useless anyway. We can switch to an infinite progress bar.

BurntSushi commented 7 years ago

@roblourens Great write up!

we'll probably want to implement the .gitignore support ourself and filter the results as they come back. That's not really so bad - we may want to do that for the File Explorer anyway.

Implementing good and fast .gitignore support was actually one of the more challenging aspects of ripgrep (second only to the regex engine itself). I refactored the gitignore support in ripgrep out into a separate ignore library that you may find useful (either directly or as something to port). Alternatively, finding a way to reuse git to do this work for you may be a quicker path, e.g., by using git ls-files or git check-ignore.

The difficulty in getting it right is why ag struggles with it so much. In order to make it fast, I had to roll my own multi-glob matcher.

roblourens commented 7 years ago

Thanks for the tips @BurntSushi. I know we already support ignoring glob patterns in the file explorer, which makes me think we could explore sending gitignore patterns down the same path, but I don't know how fast it is, and we don't currently support negative patterns. So maybe we'd end up using a different strategy.

cristim commented 7 years ago

@roblourens were you aware of this one? it's a similar implementation written in golang. https://github.com/monochromegane/the_platinum_searcher ?

realgeek commented 7 years ago

PCRE supports lookarounds, like this negative lookbehind assertion (?<!g)rip (that is, rip not preceeded by a g). Unfortunately, recent OS versions of libpcre (perl 5.10) no longer allow variable-length negative lookbehind assertions, where we used to do things like (?<!(s.|g)rip to match rip but not grip, scrip, strip, etc.

This doesn't prevent catastrophic backtracking, and I don't know if there's any way to set a max recursion limit at run time. Actually, Apache's mod_security allows one to set it in a config file. One could set a relatively low default value for it, and if it's reached then indicate to the user that their regex sucks.

coder543 commented 7 years ago

@cristim what are the pros/cons of the platinum searcher over ripgrep?

these benchmarks indicate that the platinum searcher can be dramatically slow sometimes, while other times it is only a little slower than ripgrep. perhaps those benchmarks are suspect since they're provided by the author of ripgrep, but I think he lays out his methodology pretty well.

cristim commented 7 years ago

@coder543 I wasn't aware of those results.

From those benchmarks it indeed looks like ripgrep is much of a better choice than the platinum searcher.

bstrie commented 7 years ago

And I'm glad to hear that UTF-16 support is on the roadmap. Any idea in what timeframe you'd expect to look at it?

@roblourens UTF-16 support appears to be featured in the most recent release of ripgrep: https://github.com/BurntSushi/ripgrep/releases/tag/0.5.0

roblourens commented 7 years ago

This landed in today's Insiders! Set "search.useRipgrep": true to try it out. There will likely be bugs...

d-akara commented 7 years ago

Very nice! I just tried this and seems to be working well. Very fast :-) Awesome!

Going forward, what is being considered for multiline and look around assertions etc? Will the internal search remain as a mechanism to support such features?

If so, will there be a quick toggle to be able to choose the search engine or would you try to automatically pass on the regex to the supported engine?

roblourens commented 7 years ago

Great!

I actually don't plan to keep the internal search. It's a little unfortunate to lose lookaround regexes, but I'm much more interested in search speed than advanced features.

If ripgrep implements any other features like multiline search, we will try to pick them up and support them. But it's not a priority for me right now, I won't try to implement it on my own in the internal search or anything like that.

By the way, huge thanks to @BurntSushi for his work and support in bringing it to VS Code!

d-akara commented 7 years ago

Speed is awesome, but I will have to strongly disagree here. Not being able to do something entirely is fairly a significant disadvantage. I certainly can understand there is no desire to support another engine that you have to implement yourself.

I would suggest actually consider having a secondary external engine. Silver searcher or platinum searcher as a fallback. If Platinum searcher is easier to integrate, it doesn't have to be the fastest, but feature completeness could then be provided without having to support your own implementation.

BurntSushi commented 7 years ago

@dakaraphi I don't think the platinum searcher has any functionality that ripgrep doesn't have at this point. It doesn't appear to have multiline search and its regex engine is FSM based like ripgrep's. The only real choice available to you if you want multiline search and PCRE is the silver searcher.

d-akara commented 7 years ago

@BurntSushi ahh ok thanks. Yes, the most important feature for me would be multiline. That's a big one. I do use it somewhat often. PCRE would be nice, but I could live without it.

BurntSushi commented 7 years ago

@dakaraphi I've thought about multiline search for a long time. You folks aren't the only ones who have requested it. I re-opened the issue on ripgrep's tracker and left some thoughts: https://github.com/BurntSushi/ripgrep/issues/176#issuecomment-287240086

d-akara commented 7 years ago

@roblourens @BurntSushi Thanks for making this happen and bringing to VS Code in such a short time! Truly is a pleasure to use.

lnicola commented 7 years ago

Does any of this apply to searches in a file that's being edited? That is, the "Find" command, not "Find in Files".

roblourens commented 7 years ago

No

d-akara commented 7 years ago

So I have given some additional thought to feature gap of things like look arounds. Typically additional regex features are about further constraining the results in some way.

I think some of the feature gap would be mitigated if the search results could be easily sent to a new editor. Then you could further search the results using the more feature rich in editor regex engine. Potentially it would also be useful to have a way to send results directly to an editor and have a much higher result limit cap.

There is already a request for this for other use cases. So this would just be an additional benefit. See #17920

octref commented 7 years ago

Was looking for how I can use ag to search faster in VSCode and found this issue. Tried it out and search takes milliseconds in my fairly large web project. Huge improvement on my workflow. Thanks @roblourens and @BurntSushi!

Ethan-VisualVocal commented 7 years ago

@dakaraphi This is a feature of Sublime Text that I miss in VSCode -- ST just automatically dumps everything into a special, searchable "Find Results" tab that also doesn't auto-clear between searches.

(Relying on this feels a bit like a crutch, like maybe I could have gotten ideal results if I'd composed my original search filters + regex better, but I end up using it often anyway because I want to keep my brain on the original task at hand.)

d-akara commented 7 years ago

@Ethan-VisualVocal having the ability to dump the results to a document tab opens up some very useful possibilities; however, I currently strongly prefer VSCode's default implementation for finding and navigating code. I find it much better at browsing the files from the results. I just want the option of being able to capture the results in a document, as there are times when it is very useful and not just as the potential work around here for feature gap of ripgrep regular expression support.

ThunderEX commented 7 years ago

Can we just use git grep to replace original "find in files" feature? git grep is just a built-in command of git. Compared to ripgrep:

sophiajt commented 7 years ago

@ThunderEX - git grep doesn't work for non-git directories, unless it's some option I haven't seen.

ThunderEX commented 7 years ago

@jonathandturner you can check config grep.fallbackToNoIndex

roblourens commented 7 years ago

I didn't realize that git grep works on non-git dirs. But we're now shipping with ripgrep for the March release so I'm closing this issue.

d-akara commented 7 years ago

git grep doesn't have multiline. However ripgrep is now investigating adding that feature. That will be a greater win. BurntSushi/ripgrep#176

BurntSushi commented 7 years ago

I don't think it supports UTF-16 either.