sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody
https://sourcegraph.com
Other
10.11k stars 1.29k forks source link

Extensions: Figure out a minimum viable solution for bringing extensions into our search core #38148

Open philipp-spiess opened 2 years ago

philipp-spiess commented 2 years ago

This is part of the Code Ownership proposal. We need further research into where we best add these new extension points into our system such that we eventually allow extensions to further integrate into Sourcegraph's services.

sourcegraph-bot-2 commented 2 years ago

Heads up @muratsu @jjinnii @ryankscott - the "team/integrations" label was applied to this issue.

philipp-spiess commented 2 years ago

I've been working on a hacky PR based on what we included in the Code Ownership RFC (cc @muratsu @ryanslade @jjinnii). More specifically, I looked into how the predicate and compute APIs could be helpful for us.

I now have some technical questions for people who are more familiar with the search infrastructure (cc @rvantonder @camdencheek)

Pre-search vs. post-search filtering

I initially looked into a pre-search filtering approach but noticed that in order to have correct results, we have to collect a lot of data before the search in advance so this felt like a slow solution. I also looked into a post-search approach and potentially a way of combining the two (might be too much for a v0).

My questions

  1. What are your thoughts regarding the ways I hooked into the search system: Does it look good to you to use a custom Job? How expensive is it to add a new Job in a search operation? Do I understand it correct that this way I can use the Job as a good place to implement some caching (e.g cache pre-processed code ownership data)
  2. If we have to resort to some pre-processing for performance: How do we usually determine if something needs special performance considerations? Do we have a better system to do post-search processing that does not require me to do some custom plumbing into the zoekt and/or other searchers (there’s one for non indexed files which will need similar work I believe?)? And if we don’t: Is this something we will/should improve?
  3. If we want to use compute to also expose code ownership data: I noticed that right now when we have a file result and use something like output((.|\n)* -> $owners) we don’t seem to have a good way of handling owners being an array. If I try to plot a chart of who owns the most files for example, it might be useful if we can duplicate results with compute somehow. Is that something that you have thought of? The example I have in mind is pretty much summed up in this screenshot where it would be much more helpful if we could show that both users own 4 files:

image

rvantonder commented 2 years ago

I'll look at the PR more deeply bit later but generally:

  1. Having a job for your logic is a good start. Implementing some initial caching there seems appropriate (though I would not rush to start caching things?). There's no existing stuff to just cache results that I know of, and maybe there are examples elsewhere in the code but I don't know of them. But again I wouldn't rush into this.

  2. Can you outline what the inputs and outputs are so we can think about whether this is information worth propagating into search backends? To your other question there's no system to do this---your job will contain the logic to drive this. At some point maybe we'll abstract this out and have some kind of layer but it's really fuzzy what that would look like right now. I can't say if this is something we will/should improve until we know more.

  3. I think I need to understand the input / outputs here again to answer this better :-) It sounds like you want to emit a list of owners for each path. The way output works in compute right now is to emit basically one logical result (whatever that result may be). It's really up to the consumer (client) to interpret those results--the client in the notebook is a really simple one and might not work "out of box" for a list of data you want to process--it just tallies result count when it sees a result repeated.

So concretely, there are a couple of ways to think about how to expose this. To start:

The idea of an $owners result being mostly associated with file paths (or maybe lines of code), and that you might want to define tighter client<->data model format in Compute (i.e., concept of "array") means that you might consider building your own command into compute (instead of output, something like content:owner(...) and you can basically define any input format in the ... part and any result type with whatever constraints you want to impose (if you decide "this command shall always output JSON" then that's fine and you are free to choose it.).

So my leading question for you is: do you have a sense of what you would prefer for the above and what you're trying to solve? It might be worth thinking through the tradeoffs of these ^. My sense right now is that maybe something like a comma-separated value exposed in $owners via the output command is workable, and building logic in a client that understands that (our webapp or your own for MVP), but really you're in the best position to think through the possibilities/constraints about that.

Also you can punt on thinking about how exactly this works with Compute if you just want to start off focusing just on the search predicate.

philipp-spiess commented 2 years ago

This is extremely helpful @rvantonder , thank you!

Implementing some initial caching there seems appropriate (though I would not rush to start caching things?)

Just want to add here that what I had in mind is nothing fancy just that when we need to fetch some mappings from the database, we could keep it in memory and avoid fetching the same data again. Agree that we'll do that on-demand rather than "for the sake of it".

Can you outline what the inputs and outputs are so we can think about whether this is information worth propagating into search backends?

We'll get more clarity when we reconcile our customer feedback but for now I don't think we need to pass anything into the search backends either. Having a job that can do post-search filtering should be more than enough to get me started

do you have a sense of what you would prefer for the above and what you're trying to solve? It might be worth thinking through the tradeoffs of these ^. My sense right now is that maybe something like a comma-separated value exposed in $owners via the output command is workable, and building logic in a client that understands that (our webapp or your own for MVP), but really you're in the best position to think through the possibilities/constraints about that.

YES! What you said makes so much sense. The client right now seems to do a simple grouping by the resulting content but we could teach it a different way to group data. I agree that this is not something we should probably need to worry about in the current version.


I only have one more question at the moment: To prepare code to be pushed into main, would you say a simple feature flag is enough to gate this for now or do we have other patterns for these type of changes?

rvantonder commented 2 years ago

I think we should do this:

ryanslade commented 2 years ago

Think the wrong Ryan was tagged above. cc @ryankscott