Closed chrmarti closed 6 years ago
There was some concern around using strings instead of URIs.
URIs must be used because that's our way to knowing what's a file path. I understand the concern but I measure (without easy optimisations we have discussed) that toJSON-fying 10000 uri takes ~40ms. Hardly a blocker, but you might have measure something else.
I think it makes sense to omit 'query' so filtering will be consistent across different providers. We could have one provider doing a fuzzy search, another just doing a substring match, another doing something else, and it would make quickopen seem broken. And this lets us just invoke the provider once, so we can own and optimize the cache filtering logic to make quickopen fast. I agree about maxResults, wrote notes about my plans here: https://github.com/Microsoft/vscode/issues/47058#issuecomment-392282689
Yes...
I'll revisit this. Besides the toString issue, there was also slowness around the unnecessary string operations from constructing, deconstructing, then reconstructing URIs for every result. File search builds a tree of relative path fragments to do its searching and matching, and it was simpler to just work in relative paths.
Yes
Good idea. I could also reuse GlobPattern although the 'base' part is redundant.
You could have separate providers for text/file search. I don't want to limit anyone.
+1 to #1. We've been testing the remote filesystem API at Facebook with filesystems that have more than 1 million files, so returning the full list of files is a non-starter unfortunately :( We'd appreciate going back to the older API that includes the query string. cc @siegebell @bolinfest
I prototyped this today in https://github.com/Microsoft/vscode/tree/roblou/cachedSearchProvider
provideFileSearchResults
also gets a query, typed like
export interface FileSearchQuery {
pattern: string;
cacheKey?: string;
}
where cacheKey
is different between quickopen instances. The provider could ignore this, but this lets the provider know whether it can be filtering down the same set of results, or must start a fresh search.
To get the caching, fuzzy matching, and sorting code into the extension, I had to copy several hundred lines of code from other places in vscode into the extension. This code should stay in sync with changes on the vscode side. Since we don't have another way to share code between vscode and builtin extensions, we should consider copying that shared code into another repo and publishing it as an npm module.
Then since the implementation is complicated, we should publish this code as a library that search providers can use if they want to use the simpler API (just list out all files in the workspace) and be guaranteed to fuzzy filter in a consistent way. It will also be useful for what live share needs to do - they can use this library on the host side, hook it up to findFiles
... and that's it.
Implementing sibling clauses gets complicated here. We decided that they should be hidden from the SearchProvider API and implemented by vscode. Currently we get files from the provider once and run them through the list of include/exclude patterns, including sibling clause patterns. This is slow but only happens once. With this new model, we will have to run all results through glob.ts on every keypress in quickopen which will be slightly slower.
@hansonw, we are still discussing what the API should look like and I would like to get a better feel for the needs of implementors. Can you tell me more about how you would implement search against a remote filesystem? Is there a preexisting API that you would like to invoke on every keypress, and is it able to handle fuzzy matching and include/excludes by glob pattern?
Feel free to email or DM me if you'd prefer to discuss offline.
@roblourens Yup, that's correct! You can take a look at what we've done with Atom + Nuclide: here is the "queryFuzzyFile" API that we've implemented right now (which works with remote filesystems):
We've built a custom native library (https://github.com/facebook-atom/nuclide-prebuilt-libs/tree/master/fuzzy-native) to store the 1M+ files in memory and do fuzzy matching with a similar API to the above.
Thanks, I see that you have some sort of cache for results in FileSearchProcess.js, what causes that cache to be invalidated? We will probably include a cacheKey
in the call to a search provider which is the same while the user is typing in the same quickopen instance.
Also I see that you have a list of ignoredFiles, are those just filenames or can they be glob patterns or regexes? Our API has negative patterns (excludes) and positive patterns (includes), would you support both of those? We can either leave checking these patterns entirely up to the provider, or we can "double-check" all the results that come back against all the patterns, which will ensure that they're applied consistently.
Fyi here are a couple of other search provider implementations: https://github.com/eamodio/vscode-remotehub/blob/feature/search/src/gitHubSearchProvider.ts
https://github.com/eamodio/vscode-remotehub/blob/feature/search/src/sourcegraphSearchProvider.ts
One thing, I'm currently not dealing with in either of those implementations, is paging -- since these api, won't always return all the results. It is expected that paging of results would be something that would have to be inside the provider and all results would have to wait for the looping through pages to get them all?
I'll close this in favor of the master issue https://github.com/Microsoft/vscode/issues/47058. I've implemented almost everything brought up here - file search now gets the query, both APIs are URI-based, added some docs. There's still a TODO with maxResults and hitLimit but there's a note in the other issue about that.
@hansonw I'm trying to make sure I understand one more thing about your needs for search. It looks like you create a cache of file paths in initFileSearchForDirectory
which you then search inside. When exactly is this cache created/invalidated?
I ask because internally vscode has used a similar cache that is created when quickopen is opened, then we search in the cache as the user types, and is destroyed when quickopen is closed. I'm trying to understand whether this needs to be reflected in the API. In other words, do you need to know about this search "session" to manage a cache, or would it fit your needs to have a stateless API that has no "cache key" or state, and is invoked on each keypress during quickopen?
Hi @roblourens, apologies for the delayed response! That's actually not really a cache, per se - we start up a process for each project root, which holds the list of files in memory (in a native Node modules, which allows very fast multi-threaded searching over all files.) It wouldn't really make sense to tear down & restart this process for every unique session, so it manages caching internally in its own way (for example, we re-filter the results of the previous query if it was a prefix of the current query).
Only if the user changes the include/exclude flags do we restart the process. We expect these config changes to be very infrequent so it's easiest to reflect these flags by having a different file index, rather than paying the cost of applying the filter at query time.
To answer your question more directly: I don't think we really need to know about the session & cache key. We'd be fine with a stateless API that is just invoked on each keypress.
Makes sense, thanks for the explanation.
Testing #50497
provideFileSearchResults
comes withoutquery
argument, that will prevent implementations from optimizing the amount of data they need to provide. E.g., a large remote filesystem might prefer doing the filtering based on the user's input on the remote side. Or, e.g., a local search engine might prefer to run in a separate process (for performance or runtime reason) and benefit from doing the filtering there. Also include amaxResults: number
option to further help limit the amount of data being passed around (we currently only show 500 entries in QuickOpen, IIRC).previewOptions
is onSearchOptions
, should it be onTextSearchOptions
?There was some concern around using strings instead of URIs. Maybe the performance impact needs to be better understood. Although with large filesystems and a long workspace folder path, repeating the workspace folder path can have a significant memory overhead. /cc @jrieken
I introduced an
exists
option at one point to optimize theworkspaceContains
trigger, but that could also be passed asmaxResults: 1
(See 1.).includes
andexcludes
are a specific glob mini-language. Maybe we should havetype Glob = string;
with the documentation that is currently onGlobPattern
?Should
provideFileSearchResults
be mandatory? Listing files seems almost like a subset of search for text in the same files.Documentation will go a long way in making this API understood. We should take a fresh look when that is in place.