feature: "type" operator

keegancsmith commented 6 years ago

It would be great to support a "type" operator. It would be used to influence the type of results returned: filematch|filename|repo. Currently we always return filematch and filename results, unless we only have repo: operators. Combining this with the hierarchical nature of the queries I think can be very powerful.

Two motivating examples:

type:repo abc - a list of repositories which contain matches for abc.
(type:repo abc) def - matches for def, but only in repos which contain matches for abc.
(type:repo f:abc) def - matches for def, but only in repos which contain file paths that match abc.

More concretely you can imagine doing queries on all repositories containing a Dockerfile.

I'm wondering what you believe the complexity of this feature would be, and the proposed name and syntax? Another possibility I considered is something like UNIX pipes, where you would write type:repo abc | def.

hanwen commented 6 years ago

I think you are essentially asking for JOIN queries

this is hard to make performant: currently, we go over the documents in a shard in order, and for each doc, we can evaluate the query. With your proposed syntax, this won't be true anymore, eg. consider the shard

doc1: name="alphabet" content="xyz" doc2: name=dockerfile content="xyz"

now, to process a query like

(type:repo f:dockerfile) f:alphabet

you'd have to go over the shard twice.

Maybe going over it twice isn't so bad, you'd have to write some sort of query planner that will take apart the query in subqueries that be able to evaluate this.

hmm.

This will require reorganization, because of sharding of large repositories. What looks like a unit ("type:repo") from the outside, is actually a collection of shards, and they currently don't have specialized processing. This means you have to restructure the code such that all the shards of a single repo have to be processed together. Eg. consider

shard1: doc1: name="alphabet" content="xyz" shard2: doc1: name=dockerfile content="xyz"

The query

(type:repo f:dockerfile) f:alphabet

can only be answered in some place that considers all shards at the same time.

hanwen commented 6 years ago

grouping shards of the same repo together is probably a sensible idea anyway (eg. to ensure that you never search half the shards of a repo), so it could be a start towards this idea.

keegancsmith commented 6 years ago

I actually took a stab at implementing this yesterday afternoon after filing this, and came to the same realisation. The type:file filter is easy to do. The type:repo is not as straightforward due to large repos and sharding. It does require your planner to be aware of all the shards for a repo. I am close to an implementation which just looks at the shard and I want to use what I learn from that to write the "repo" planner.

This means you have to restructure the code such that all the shards of a single repo have to be processed together

Is this an alright change to make? Right now the architecture allows you to spread the shards for a repo across nodes and make it distributed (which must be nice for very large monorepos). I know this isn't supported right now, but doing the above would make it harder to do in the future.

Thinking along those lines, maybe a layer above indexData takes a query.Q and extracts the subtrees for each type:repo and evaluates each of those to replace them with a query.Const for each repo. Once we have managed to replace all type:repo with query.Const we use that to evaluate results. I'm trying to think of a clean architecture, but not really sure. Will just try make it works and see what makes sense.

Maybe going over it twice isn't so bad.

Yeah I think that is totally fine, as long as its linear w.r.t. the number of type:repo nodes in the query.

BTW the code is really nice. I've read the eval and matchiter code before, but actually changing it I really grokked it for the first time. Thanks :)

hanwen commented 6 years ago

"Is this an alright change to make? "

Yes, I think it's a good idea. The tricky bit is that you still want to search num-CPU shards in parallel, but you want all the shards for the same repo to be in a separate ParallelSearcher (this one would also do the per-repo query => const evaluation). So there should be a global CPU throttler.

Also, it would be nice to load/unload whole repos "atomically", ie. the loader should be aware of what repos each new shard belongs to, and it should wait a bit to be sure no new shards are appearing for that repo (alternatively: wait until you see the final shard of a repo, but if that somehow crashed, you'd be stuck waiting forever.)

anyway, this all seems doable, but substantial work, ie. not something I have time to undertake.

but I'm happy to review changes :)

keegancsmith commented 6 years ago

shardedSearcher.getShards() exists, so we can group shards by repo in that function. How do we actually tell if two shards are the same repo. Just the Repository.Name?

As for atomic loading, that could also be done in getShards. It can only return the shards for a repo if the newest shard is old enough. What would you suspect is a good enough amount of time for waiting?

By just changing getShards, the change ends up being pretty minimal to support grouping search by repo. I'll see what it actually looks like and send a CR for further discussion which just does the grouping by repo (and not the type:repo support)

hanwen commented 6 years ago

you can tell by repository.name, yes.

But that doesn't look like the right approach: if shardedSearcher.getShards is too late: you should make the construction so that shards from different repos never get into the same shardedSearcher: ie. you need something like a shardedRepoSearcher that is both a shardedSearcher and knows which repo it is for.

keegancsmith commented 6 years ago

Sorry I should of made that clearer. I am making getShards() not return a slice of shards, but a slice of repoShards. Will rename to getRepoShards().

hanwen commented 6 years ago

oh, that sounds good.

keegancsmith commented 6 years ago

Sorry for not updating in a while. So I went a different approach here after attempting grouping shards by repo. Instead I created a zoekt.Searcher which wraps a searcher (specifically the shardedSearcher). When evaluating a query, it first does a query.Map which converts every query.TypeRepo into a query.RepoSet via a List call (see my CR for making List work on any query). query.RepoSet contains a set of repos (so is like query.Repo, except does a set membership check instead of a regex check).

I need to clean this up a bit before sending out for review, but its working quite well and is quite powerful.

hanwen commented 6 years ago

SGTM - looking forward to review.

julioz commented 3 years ago

Hi folks, sorry to bother, but my org would also be interested in this change - do you have any updates you could share, or at least a concrete vision as to where this will go ("won't do" is also good enough), just wanna make sure my expectations are aligned with this project's roadmap :)

google / zoekt

feature: "type" operator #61