Code Insights: change the behavior of drilldown

camdencheek commented 8 months ago

When drilling down into a code insights data point, we generate a diff search for the difference between that data point and the point before it. This is not ideal for many reasons: 1) It is somewhat confusing semantics. You don't see what the data point represents, you only see the diff 2) Many code insights searches cannot actually be represented as a diff search 3) The query rewriting has been the source of a bunch of different bugs 4) Diff search is slow

Instead, we want to change the behavior to drill down into a search at the commit that data point represents.

### Related issues
- [ ] https://github.com/sourcegraph/sourcegraph/issues/60686
- [ ] https://github.com/sourcegraph/sourcegraph/issues/58906

AJKemps commented 6 months ago

This customer, who originally reported the issue in this ticket, are asking for it to be prioritized as it's "basically made Code Insights completely unusable for our users and we have many internal customers asking us for an update." This is their only hard ask from us at this time.

camdencheek commented 6 months ago

I started looking into this today because I was hopeful it was going to be a relatively easy change.

Narrator: It will not be a relatively easy change.

Attempting to capture what I've found so far below:

The goal is to make clicking on an insights data point generate a search query that yields the set of results that correspond to the data point that is clicked. So if the data point shows 30, our search should have 30 results.
This is different from the current behavior, which attempts to generate a diff search that yields the difference in the results between a data point and its previous data point.
In order to generate a single query that yields all the results represented, we would need to generate a query that looks like (repo:repo1@abcde OR repo:repo2@bcdef) needle because we need to query each repo in the set at a specific revision corresponding to that point in time, and we do not support anything like rev:head.at(1996-06-28) (though that would be really cool!)
We generate the diff query for each point on the backend with PointDiffQuery
The data we use to generate the diff query ultimately comes from the series_points table, which stores points at timestamps partitioned by repo, but does not store the commit that was searched along with it.

So, to implement this change, we would need to: 1) Fetch data from the database in unaggregated form (we currently aggregate by the timestamp, which I expect provides significantly better performance) 2) For each repo represented in each point, we would need to look up the commit at that timestamp 3) We would generate a query with one OR entry per repo queried for each data point

My biggest concern here is I expect the non-aggregated data to be large. For N data points and M repos, we'll have N*M unique commits for a series. That means we'll have to make N*M requests to gitserver to get the commit at that timestamp, we'll need to send back queries with N*M repo:repo1@commit filters, and we'll need to make M independent subqueries at search time, which will be somewhat slow as well. Since insights already struggles at enterprise scale, I do not think it would be a good idea to implement it as described.

Some options to reduce cost:

Add the commit ID to the table so we don't need to look up the commit at series fetch time. That saves M*N gitserver operations. We probably already have to get this information when backfilling, so it should already be readily available. The downside is this would either require an expensive OOB migration or some lazy-populating logic.
Add a special operator to search like repo:_at.insight.timestamp(<series_id>, 1996-06-28). This would allow us to push the commit resolution to search time so we don't pay that ahead of time, which keeps our series payloads and search queries smaller.

sourcegraph / sourcegraph-public-snapshot

Code Insights: change the behavior of drilldown #60739