Closed camdencheek closed 3 months ago
This customer, who originally reported the issue in this ticket, are asking for it to be prioritized as it's "basically made Code Insights completely unusable for our users and we have many internal customers asking us for an update." This is their only hard ask from us at this time.
I started looking into this today because I was hopeful it was going to be a relatively easy change.
Narrator: It will not be a relatively easy change.
Attempting to capture what I've found so far below:
(repo:repo1@abcde OR repo:repo2@bcdef) needle
because we need to query each repo in the set at a specific revision corresponding to that point in time, and we do not support anything like rev:head.at(1996-06-28)
(though that would be really cool!)So, to implement this change, we would need to: 1) Fetch data from the database in unaggregated form (we currently aggregate by the timestamp, which I expect provides significantly better performance) 2) For each repo represented in each point, we would need to look up the commit at that timestamp 3) We would generate a query with one OR entry per repo queried for each data point
My biggest concern here is I expect the non-aggregated data to be large. For N
data points and M
repos, we'll have N*M
unique commits for a series. That means we'll have to make N*M
requests to gitserver to get the commit at that timestamp, we'll need to send back queries with N*M repo:repo1@commit
filters, and we'll need to make M
independent subqueries at search time, which will be somewhat slow as well. Since insights already struggles at enterprise scale, I do not think it would be a good idea to implement it as described.
Some options to reduce cost:
M*N
gitserver operations. We probably already have to get this information when backfilling, so it should already be readily available. The downside is this would either require an expensive OOB migration or some lazy-populating logic.repo:_at.insight.timestamp(<series_id>, 1996-06-28)
. This would allow us to push the commit resolution to search time so we don't pay that ahead of time, which keeps our series payloads and search queries smaller.
When drilling down into a code insights data point, we generate a diff search for the difference between that data point and the point before it. This is not ideal for many reasons: 1) It is somewhat confusing semantics. You don't see what the data point represents, you only see the diff 2) Many code insights searches cannot actually be represented as a diff search 3) The query rewriting has been the source of a bunch of different bugs 4) Diff search is slow
Instead, we want to change the behavior to drill down into a search at the commit that data point represents.