Search provider API feedback

chrmarti commented 6 years ago

Testing #50497

provideFileSearchResults comes without query argument, that will prevent implementations from optimizing the amount of data they need to provide. E.g., a large remote filesystem might prefer doing the filtering based on the user's input on the remote side. Or, e.g., a local search engine might prefer to run in a separate process (for performance or runtime reason) and benefit from doing the filtering there. Also include a maxResults: number option to further help limit the amount of data being passed around (we currently only show 500 entries in QuickOpen, IIRC).
previewOptions is on SearchOptions, should it be on TextSearchOptions?
There was some concern around using strings instead of URIs. Maybe the performance impact needs to be better understood. Although with large filesystems and a long workspace folder path, repeating the workspace folder path can have a significant memory overhead. /cc @jrieken
I introduced an exists option at one point to optimize the workspaceContains trigger, but that could also be passed as maxResults: 1 (See 1.).
includes and excludes are a specific glob mini-language. Maybe we should have type Glob = string; with the documentation that is currently on GlobPattern?
Should provideFileSearchResults be mandatory? Listing files seems almost like a subset of search for text in the same files.

Documentation will go a long way in making this API understood. We should take a fresh look when that is in place.

jrieken commented 6 years ago

There was some concern around using strings instead of URIs.

URIs must be used because that's our way to knowing what's a file path. I understand the concern but I measure (without easy optimisations we have discussed) that toJSON-fying 10000 uri takes ~40ms. Hardly a blocker, but you might have measure something else.

roblourens commented 6 years ago

I think it makes sense to omit 'query' so filtering will be consistent across different providers. We could have one provider doing a fuzzy search, another just doing a substring match, another doing something else, and it would make quickopen seem broken. And this lets us just invoke the provider once, so we can own and optimize the cache filtering logic to make quickopen fast. I agree about maxResults, wrote notes about my plans here: https://github.com/Microsoft/vscode/issues/47058#issuecomment-392282689
Yes...
I'll revisit this. Besides the toString issue, there was also slowness around the unnecessary string operations from constructing, deconstructing, then reconstructing URIs for every result. File search builds a tree of relative path fragments to do its searching and matching, and it was simpler to just work in relative paths.
Yes
Good idea. I could also reuse GlobPattern although the 'base' part is redundant.
You could have separate providers for text/file search. I don't want to limit anyone.

chrmarti commented 6 years ago

I agree. Still, requiring the provider to send a complete listing of files will severely limit the performance for remote filesystems. Can we foresee a second method for file search that would take a query parameter? We can then add that later in case there is real demand for that.

roblourens commented 6 years ago

I'm open to it, we will get a chance to validate this with the live share extension.

hansonw commented 6 years ago

+1 to #1. We've been testing the remote filesystem API at Facebook with filesystems that have more than 1 million files, so returning the full list of files is a non-starter unfortunately :( We'd appreciate going back to the older API that includes the query string. cc @siegebell @bolinfest

roblourens commented 6 years ago

I prototyped this today in https://github.com/Microsoft/vscode/tree/roblou/cachedSearchProvider

provideFileSearchResults also gets a query, typed like

export interface FileSearchQuery {
    pattern: string;
    cacheKey?: string;
}

where cacheKey is different between quickopen instances. The provider could ignore this, but this lets the provider know whether it can be filtering down the same set of results, or must start a fresh search.

To get the caching, fuzzy matching, and sorting code into the extension, I had to copy several hundred lines of code from other places in vscode into the extension. This code should stay in sync with changes on the vscode side. Since we don't have another way to share code between vscode and builtin extensions, we should consider copying that shared code into another repo and publishing it as an npm module.

Then since the implementation is complicated, we should publish this code as a library that search providers can use if they want to use the simpler API (just list out all files in the workspace) and be guaranteed to fuzzy filter in a consistent way. It will also be useful for what live share needs to do - they can use this library on the host side, hook it up to findFiles... and that's it.

Implementing sibling clauses gets complicated here. We decided that they should be hidden from the SearchProvider API and implemented by vscode. Currently we get files from the provider once and run them through the list of include/exclude patterns, including sibling clause patterns. This is slow but only happens once. With this new model, we will have to run all results through glob.ts on every keypress in quickopen which will be slightly slower.

roblourens commented 6 years ago

@hansonw, we are still discussing what the API should look like and I would like to get a better feel for the needs of implementors. Can you tell me more about how you would implement search against a remote filesystem? Is there a preexisting API that you would like to invoke on every keypress, and is it able to handle fuzzy matching and include/excludes by glob pattern?

Feel free to email or DM me if you'd prefer to discuss offline.

hansonw commented 6 years ago

@roblourens Yup, that's correct! You can take a look at what we've done with Atom + Nuclide: here is the "queryFuzzyFile" API that we've implemented right now (which works with remote filesystems):

https://github.com/facebook/nuclide/blob/master/pkg/nuclide-fuzzy-file-search-rpc/lib/FuzzyFileSearchService.js#L71

We've built a custom native library (https://github.com/facebook-atom/nuclide-prebuilt-libs/tree/master/fuzzy-native) to store the 1M+ files in memory and do fuzzy matching with a similar API to the above.

roblourens commented 6 years ago

Thanks, I see that you have some sort of cache for results in FileSearchProcess.js, what causes that cache to be invalidated? We will probably include a cacheKey in the call to a search provider which is the same while the user is typing in the same quickopen instance.

Also I see that you have a list of ignoredFiles, are those just filenames or can they be glob patterns or regexes? Our API has negative patterns (excludes) and positive patterns (includes), would you support both of those? We can either leave checking these patterns entirely up to the provider, or we can "double-check" all the results that come back against all the patterns, which will ensure that they're applied consistently.

eamodio commented 6 years ago

Fyi here are a couple of other search provider implementations: https://github.com/eamodio/vscode-remotehub/blob/feature/search/src/gitHubSearchProvider.ts

https://github.com/eamodio/vscode-remotehub/blob/feature/search/src/sourcegraphSearchProvider.ts

One thing, I'm currently not dealing with in either of those implementations, is paging -- since these api, won't always return all the results. It is expected that paging of results would be something that would have to be inside the provider and all results would have to wait for the looping through pages to get them all?

roblourens commented 6 years ago

I'll close this in favor of the master issue https://github.com/Microsoft/vscode/issues/47058. I've implemented almost everything brought up here - file search now gets the query, both APIs are URI-based, added some docs. There's still a TODO with maxResults and hitLimit but there's a note in the other issue about that.

roblourens commented 6 years ago

@hansonw I'm trying to make sure I understand one more thing about your needs for search. It looks like you create a cache of file paths in initFileSearchForDirectory which you then search inside. When exactly is this cache created/invalidated?

I ask because internally vscode has used a similar cache that is created when quickopen is opened, then we search in the cache as the user types, and is destroyed when quickopen is closed. I'm trying to understand whether this needs to be reflected in the API. In other words, do you need to know about this search "session" to manage a cache, or would it fit your needs to have a stateless API that has no "cache key" or state, and is invoked on each keypress during quickopen?

hansonw commented 6 years ago

Hi @roblourens, apologies for the delayed response! That's actually not really a cache, per se - we start up a process for each project root, which holds the list of files in memory (in a native Node modules, which allows very fast multi-threaded searching over all files.) It wouldn't really make sense to tear down & restart this process for every unique session, so it manages caching internally in its own way (for example, we re-filter the results of the previous query if it was a prefix of the current query).

Only if the user changes the include/exclude flags do we restart the process. We expect these config changes to be very infrequent so it's easiest to reflect these flags by having a different file index, rather than paying the cost of applying the filter at query time.

To answer your question more directly: I don't think we really need to know about the session & cache key. We'd be fine with a stateless API that is just invoked on each keypress.

roblourens commented 6 years ago

Makes sense, thanks for the explanation.

microsoft / vscode

Search provider API feedback #50788