sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody
https://sourcegraph.com
Other
10.1k stars 1.28k forks source link

Code Insights: change the behavior of drilldown #60739

Closed camdencheek closed 3 months ago

camdencheek commented 8 months ago

When drilling down into a code insights data point, we generate a diff search for the difference between that data point and the point before it. This is not ideal for many reasons: 1) It is somewhat confusing semantics. You don't see what the data point represents, you only see the diff 2) Many code insights searches cannot actually be represented as a diff search 3) The query rewriting has been the source of a bunch of different bugs 4) Diff search is slow

Instead, we want to change the behavior to drill down into a search at the commit that data point represents.

### Related issues
- [ ] https://github.com/sourcegraph/sourcegraph/issues/60686
- [ ] https://github.com/sourcegraph/sourcegraph/issues/58906
AJKemps commented 6 months ago

This customer, who originally reported the issue in this ticket, are asking for it to be prioritized as it's "basically made Code Insights completely unusable for our users and we have many internal customers asking us for an update." This is their only hard ask from us at this time.

camdencheek commented 6 months ago

I started looking into this today because I was hopeful it was going to be a relatively easy change.

Narrator: It will not be a relatively easy change.

Attempting to capture what I've found so far below:

So, to implement this change, we would need to: 1) Fetch data from the database in unaggregated form (we currently aggregate by the timestamp, which I expect provides significantly better performance) 2) For each repo represented in each point, we would need to look up the commit at that timestamp 3) We would generate a query with one OR entry per repo queried for each data point

My biggest concern here is I expect the non-aggregated data to be large. For N data points and M repos, we'll have N*M unique commits for a series. That means we'll have to make N*M requests to gitserver to get the commit at that timestamp, we'll need to send back queries with N*M repo:repo1@commit filters, and we'll need to make M independent subqueries at search time, which will be somewhat slow as well. Since insights already struggles at enterprise scale, I do not think it would be a good idea to implement it as described.

Some options to reduce cost: