Open keegancsmith opened 6 years ago
I think you are essentially asking for JOIN queries
this is hard to make performant: currently, we go over the documents in a shard in order, and for each doc, we can evaluate the query. With your proposed syntax, this won't be true anymore, eg. consider the shard
doc1: name="alphabet" content="xyz" doc2: name=dockerfile content="xyz"
now, to process a query like
(type:repo f:dockerfile) f:alphabet
you'd have to go over the shard twice.
Maybe going over it twice isn't so bad, you'd have to write some sort of query planner that will take apart the query in subqueries that be able to evaluate this.
hmm.
This will require reorganization, because of sharding of large repositories. What looks like a unit ("type:repo") from the outside, is actually a collection of shards, and they currently don't have specialized processing. This means you have to restructure the code such that all the shards of a single repo have to be processed together. Eg. consider
shard1: doc1: name="alphabet" content="xyz" shard2: doc1: name=dockerfile content="xyz"
The query
(type:repo f:dockerfile) f:alphabet
can only be answered in some place that considers all shards at the same time.
grouping shards of the same repo together is probably a sensible idea anyway (eg. to ensure that you never search half the shards of a repo), so it could be a start towards this idea.
I actually took a stab at implementing this yesterday afternoon after filing this, and came to the same realisation. The type:file
filter is easy to do. The type:repo
is not as straightforward due to large repos and sharding. It does require your planner to be aware of all the shards for a repo. I am close to an implementation which just looks at the shard and I want to use what I learn from that to write the "repo" planner.
This means you have to restructure the code such that all the shards of a single repo have to be processed together
Is this an alright change to make? Right now the architecture allows you to spread the shards for a repo across nodes and make it distributed (which must be nice for very large monorepos). I know this isn't supported right now, but doing the above would make it harder to do in the future.
Thinking along those lines, maybe a layer above indexData
takes a query.Q
and extracts the subtrees for each type:repo
and evaluates each of those to replace them with a query.Const
for each repo. Once we have managed to replace all type:repo
with query.Const
we use that to evaluate results. I'm trying to think of a clean architecture, but not really sure. Will just try make it works and see what makes sense.
Maybe going over it twice isn't so bad.
Yeah I think that is totally fine, as long as its linear w.r.t. the number of type:repo
nodes in the query.
BTW the code is really nice. I've read the eval and matchiter code before, but actually changing it I really grokked it for the first time. Thanks :)
"Is this an alright change to make? "
Yes, I think it's a good idea. The tricky bit is that you still want to search num-CPU shards in parallel, but you want all the shards for the same repo to be in a separate ParallelSearcher (this one would also do the per-repo query => const evaluation). So there should be a global CPU throttler.
Also, it would be nice to load/unload whole repos "atomically", ie. the loader should be aware of what repos each new shard belongs to, and it should wait a bit to be sure no new shards are appearing for that repo (alternatively: wait until you see the final shard of a repo, but if that somehow crashed, you'd be stuck waiting forever.)
anyway, this all seems doable, but substantial work, ie. not something I have time to undertake.
but I'm happy to review changes :)
shardedSearcher.getShards()
exists, so we can group shards by repo in that function. How do we actually tell if two shards are the same repo. Just the Repository.Name
?
As for atomic loading, that could also be done in getShards
. It can only return the shards for a repo if the newest shard is old enough. What would you suspect is a good enough amount of time for waiting?
By just changing getShards
, the change ends up being pretty minimal to support grouping search by repo. I'll see what it actually looks like and send a CR for further discussion which just does the grouping by repo (and not the type:repo support)
you can tell by repository.name, yes.
But that doesn't look like the right approach: if shardedSearcher.getShards is too late: you should make the construction so that shards from different repos never get into the same shardedSearcher: ie. you need something like a shardedRepoSearcher that is both a shardedSearcher and knows which repo it is for.
Sorry I should of made that clearer. I am making getShards()
not return a slice of shards, but a slice of repoShards
. Will rename to getRepoShards()
.
oh, that sounds good.
Sorry for not updating in a while. So I went a different approach here after attempting grouping shards by repo. Instead I created a zoekt.Searcher
which wraps a searcher (specifically the shardedSearcher
). When evaluating a query, it first does a query.Map
which converts every query.TypeRepo
into a query.RepoSet
via a List
call (see my CR for making List work on any query). query.RepoSet
contains a set of repos (so is like query.Repo
, except does a set membership check instead of a regex check).
I need to clean this up a bit before sending out for review, but its working quite well and is quite powerful.
SGTM - looking forward to review.
Hi folks, sorry to bother, but my org would also be interested in this change - do you have any updates you could share, or at least a concrete vision as to where this will go ("won't do" is also good enough), just wanna make sure my expectations are aligned with this project's roadmap :)
It would be great to support a "type" operator. It would be used to influence the type of results returned:
filematch|filename|repo
. Currently we always return filematch and filename results, unless we only haverepo:
operators. Combining this with the hierarchical nature of the queries I think can be very powerful.Two motivating examples:
type:repo abc
- a list of repositories which contain matches forabc
.(type:repo abc) def
- matches fordef
, but only in repos which contain matches forabc
.(type:repo f:abc) def
- matches for def, but only in repos which contain file paths that matchabc
.More concretely you can imagine doing queries on all repositories containing a Dockerfile.
I'm wondering what you believe the complexity of this feature would be, and the proposed name and syntax? Another possibility I considered is something like UNIX pipes, where you would write
type:repo abc | def
.