Open philipp-spiess opened 2 years ago
Heads up @muratsu @jjinnii @ryankscott - the "team/integrations" label was applied to this issue.
I've been working on a hacky PR based on what we included in the Code Ownership RFC (cc @muratsu @ryanslade @jjinnii). More specifically, I looked into how the predicate and compute APIs could be helpful for us.
I now have some technical questions for people who are more familiar with the search infrastructure (cc @rvantonder @camdencheek)
I initially looked into a pre-search filtering approach but noticed that in order to have correct results, we have to collect a lot of data before the search in advance so this felt like a slow solution. I also looked into a post-search approach and potentially a way of combining the two (might be too much for a v0).
VisitTypedPredicate
API to get the information inside for it in internal/search/zoekt/query.go
when building the zoekt query. My idea there is that we could have a list of pathing rules that we can pass into zoekt directly.I'll look at the PR more deeply bit later but generally:
Job
like you did. Passing things to Zoekt has the problems that (a) it's only Zoekt (so anything not indexed doesn't work with your new feature) (b) passing information into Zoekt (and depending on how custom it gets) is difficult and questionable. As much as possible we should treat it as a black box until there are perf issues to address more deeply. Treating it as a black box right now seems appropriate for the scope of this project :-) Having a job for your logic is a good start. Implementing some initial caching there seems appropriate (though I would not rush to start caching things?). There's no existing stuff to just cache results that I know of, and maybe there are examples elsewhere in the code but I don't know of them. But again I wouldn't rush into this.
Can you outline what the inputs and outputs are so we can think about whether this is information worth propagating into search backends? To your other question there's no system to do this---your job will contain the logic to drive this. At some point maybe we'll abstract this out and have some kind of layer but it's really fuzzy what that would look like right now. I can't say if this is something we will/should improve until we know more.
I think I need to understand the input / outputs here again to answer this better :-) It sounds like you want to emit a list of owners for each path. The way output
works in compute right now is to emit basically one logical result (whatever that result may be). It's really up to the consumer (client) to interpret those results--the client in the notebook is a really simple one and might not work "out of box" for a list of data you want to process--it just tallies result count when it sees a result repeated.
So concretely, there are a couple of ways to think about how to expose this. To start:
$owners emits a string that's a comma-separated list of owners (or JSON array or whatever data format you care for)
you write a client that understands that data and renders it (parse the commas, parse the JSON, whatnot). Currently like I said that compute client just consumes all results and tallies any result with the same content. I've worked on an experimental client that changes this idea and instead consumes data to render a graph (for example). So I'm just making clear that this client isn't necessarily predefined to work for this kind of data model you're building. At this stage of the game I hesitate to define an exact schema that says "thou shalt emit JSON" or "thou shalt emit CSV". It's pretty barebones and up to you. For example as part of Stephen's codestat.dev project he just hacked on the client to interpret some @@
in the query string on the client side to render links around results. This is all very loose and intentionally loose so that things like "Ah Philipp wants arrays/JSON in Compute but someone else really doesn't want that but still wants a way to process and expose some arbitrary result without dealing with a JSON parser in the client" is possible. At this stage and maturity of compute, such arbitrary things should be possible, until we have a more definite idea of what we want. So in short, it's unlikely that this output
command will or should ever understand the concept of "array" output, it's kind of up to you to define the contract with client/results.
The idea of an $owners
result being mostly associated with file paths (or maybe lines of code), and that you might want to define tighter client
<->data model format
in Compute (i.e., concept of "array") means that you might consider building your own command into compute (instead of output
, something like content:owner(...)
and you can basically define any input format in the ...
part and any result type with whatever constraints you want to impose (if you decide "this command shall always output JSON" then that's fine and you are free to choose it.).
So my leading question for you is: do you have a sense of what you would prefer for the above and what you're trying to solve? It might be worth thinking through the tradeoffs of these ^. My sense right now is that maybe something like a comma-separated value exposed in $owners
via the output
command is workable, and building logic in a client that understands that (our webapp or your own for MVP), but really you're in the best position to think through the possibilities/constraints about that.
Also you can punt on thinking about how exactly this works with Compute if you just want to start off focusing just on the search predicate.
This is extremely helpful @rvantonder , thank you!
Implementing some initial caching there seems appropriate (though I would not rush to start caching things?)
Just want to add here that what I had in mind is nothing fancy just that when we need to fetch some mappings from the database, we could keep it in memory and avoid fetching the same data again. Agree that we'll do that on-demand rather than "for the sake of it".
Can you outline what the inputs and outputs are so we can think about whether this is information worth propagating into search backends?
We'll get more clarity when we reconcile our customer feedback but for now I don't think we need to pass anything into the search backends either. Having a job that can do post-search filtering should be more than enough to get me started
do you have a sense of what you would prefer for the above and what you're trying to solve? It might be worth thinking through the tradeoffs of these ^. My sense right now is that maybe something like a comma-separated value exposed in $owners via the output command is workable, and building logic in a client that understands that (our webapp or your own for MVP), but really you're in the best position to think through the possibilities/constraints about that.
YES! What you said makes so much sense. The client right now seems to do a simple grouping by the resulting content but we could teach it a different way to group data. I agree that this is not something we should probably need to worry about in the current version.
I only have one more question at the moment: To prepare code to be pushed into main, would you say a simple feature flag is enough to gate this for now or do we have other patterns for these type of changes?
I think we should do this:
prepare a PR for realsies that introduces your owners
predicate. Only the search predicate to start, no compute things. From my perspective, between @camdencheek and myself reviewing something like that, I think we'd be comfortable merging it as-is, no need to feature flag. This because (a) search predicates are syntactically low-profile (it's rather unlikely that a user will search for something that collides with your predicate, and if it happens to, it's easy enough to escape) (b) we don't need to publish availability or release it if we add it (or later remove it). I mean if you want it to be feature-flagged that's fine but from my perspective we should try get it to a state that it is just "available by default without feature flag". The hacky PR is a good start. For review and discussion let's break up PRs in this order:
has.owner(...)
is the best fit right now)we can look into the compute things afterward.
Think the wrong Ryan was tagged above. cc @ryankscott
This is part of the Code Ownership proposal. We need further research into where we best add these new extension points into our system such that we eventually allow extensions to further integrate into Sourcegraph's services.