insights: test monorepo improvements and confirm success

Joelkw commented 2 years ago

We've been making improvements to code insights running over large monorepos. This issue is focused on the testing and confirming we are "done" with our monorepo support work, and monorepo codebases (repo size 4GB+) are supported as well as any other codebase.

To test this, we should create common persisted (backend) insights over the megarepo and gigarepo for some common flows: 1) A single-series insight with a literal search 2) A single-series insight with a regexp search 3) A 5-series insight with literal searches, at least one file: or lang: filter on a series 4) A 5-series insight with regexp searces (using OR modifiers as well, like |), at least one file: or lang: filter on a series 5) A capture group series over just the monorepo for something with 20+ match types (think licenses, go versions, docker versions) 6) A capture group series over all repos including the monorepo for something with 20+ match types 7) a language pie chart insight over the monorepo

We should then test that those insights (where applicable) handle:

Filtering to only include the monorepo
Filtering to exclude the monorepo
A context that includes the monorepo

When we test this, we should track (roughly) time to completion to see if performance is an order of magnitude worse. Heuristically, if these test insights take >24 hours to run individually, that approaches the upper bound of acceptable first-class-supported performance.

sourcegraph-bot-2 commented 2 years ago

Heads up @joelkw @felixfbecker @vovakulikov @unclejustin - the "team/code-insights" label was applied to this issue.

felixfbecker commented 2 years ago

We should probably test this rather sooner in the iteration than later, in case there are issues found we need to fix. Assigning @leonore tentatively since you and @coury-clark worked on this

leonore commented 2 years ago

*remember we might want to tweak some of these so we don't hit caching and ensure we are actually testing the functionality

discarding test cases n5 and n7 as per Joel's comments. n5 should be tested after JIT runs on streaming search, #35306.

A single-series insight with a literal search ⚠️

lang:SCSS -file:module select:file

created at 15:45 UTC on Friday. time to completion unclear due to the weekend

Warning sign as today's snapshot seems to have returned very high values:

SELECT COUNT(*) from series_points_snapshots WHERE series_id = '297GEf0eHg1Gl8OpkZ6XSRoGjKi';
-- returns 1897

SELECT COUNT(*) from series_points WHERE series_id = '297GEf0eHg1Gl8OpkZ6XSRoGjKi' AND time = '2022-04-13 00:00:00+00';
-- returns 18

For the snapshot query the top hitters also include a gigarepo. Is there an issue with backfilling?

When filtering to sg/sg
- Values seem correct (validated on latest point and 13 Sept 2021- https://k8s.sgdev.org/search?q=context:global+repo:github.com/sourcegraph/sourcegraph%24%407f0f45f++lang:SCSS+-file:module+select:file&patternType=literal)
When filtering to megarepo
- Only returns a singular data point? ⚠️

But data is correct:

I searched for a specific diff from roughly 9 months ago, and got this :

After further investigation it seems that the issue comes from search hitting a timeout. once the search is made once, once retried it works (maybe a reindexing is triggered?)

A single-series insight with a regexp search

patterntype:regexp sourceCompatibility (\d[_\.]\d) f:build\.gradle$

insight created at 10:54 UTC As of 15:24 UTC the insight is still being processed but has data.

Similar to above data points seem to jump on the snapshot.

And once again the values for the megarepo are also just limited to the snapshot.

A 5-series insight with literal searches, at least one file: or lang: filter on a series

created with different logging searches on Friday at 15:47 UTC.

still being processed as of Monday 11 UTC. Some jobs still to be stuck in queued still.

SELECT * from insights_query_runner_jobs WHERE series_id =’297Gj5O4LWkGqMHL2lRna73KwFJ’;

we do see something similar happening though, where again the snapshot values massively jump up compared to the rest.

A capture group series over all repos including the monorepo for something with 20+ match types

patterntype:regexp var$--(.*color.*)$ lang:css

Time to backfill: I forgot to track the rough start/end but it got queued at 11:42 UTC and was done by 14:30 latest. So in that bracket

A 5-series insight with regexp searches (using OR modifiers as well, like |), at least one file: or lang: filter on a series

patterntype:regexp file:package.json "[license|licence]":\s(.*), <- notice added |

patterntype:regexp file:helm lang:YAML \sversion:\s[v']?(\d+\.\d+\.\d)

patterntype:regexp import\s.*\sfrom\s(.*) lang:TypeScript

file:readme.md (\n.*){50} select:repo patterntype:regexp

patterntype:regexp lang:go file:test context\.(\w+)

Filtering to only include the monorepo
Filtering to exclude the monorepo

(nb the timeout would happen on both direct search and search through insights so I'm counting the results as correct here)

A context that includes the monorepo

monorepo: github.com/sgtest/megarepo

https://k8s.sgdev.org/contexts/@leo.p/sg-and-megarepo

Joelkw commented 2 years ago

@leonore thanks for listing the test plan!

One other thing I wanted to add/note explicitly: can we ensure we also test "correctness" in addition to performance, for these states?

(For correctness verification, in the past we just use Sourcegraph searches, the trick is just making sure you're searching the older revisions and that those older revisions are the correct date matches, which can be done with our repo:someRepo@version filter).

Joelkw commented 2 years ago

Update:

For #5) – this may not be relevant if it runs JIT; we just want to test the backend persisted/streaming. (Customers would love it, but it wasn't part of our original work plan so I don't necessarily expect it works.) For #7) – this is more important than #5 to get working, but still ultimately less important than any other case listed here.

leonore commented 2 years ago

My recap for monorepo testing:

streaming search/compute work give results as graphQL does
time to backfill is hard to calculate but I've had insights finish as short as in 2 hours (compute ones especially), literal searches take longer as there's usually more results
however we are dependent on searcher improvements for large repos. essentially when we see this big jump of data between a snapshot value and historical data points (as shown in different test cases above), it's because we are hitting searcher for a repo/revision a second time so there are some caching benefits there
i.e. what we're seeing isn't a regression from graphql search.
snapshot data points return correct values for the megarepo, and megarepo contexts.

Joelkw commented 2 years ago

Thanks @leonore ! Are there github issues for the remaining steps we'd need to either spike on or implement to provide full monorepo support? My understanding is that your third bullet means we don't support large monorepos yet for customers to not have issues making insights over them – or that incorrect?

leonore commented 2 years ago

@Joelkw yes that's correct, because of the issues with indexed search mentioned there we can't provide full support. I've chased the search teams about improvements they can make there so that we can link issues here. ideally if we can get some streaming errors surfaced to us in insights then we could retry on them.

I'll close this issue in favour of working on that once I get more information.

Joelkw commented 2 years ago

Thanks @leonore !

leonore commented 2 years ago

@camdencheek following up here as it will be easier not to lose track of. you mentioned in slack that partial timeout errors might not get propagated. is there a plan to surface them, and if so, is there a ticket for this/can one be created so insights can then eventually use that info?

camdencheek commented 2 years ago

Okay, so looking closer at this, I think we do propagate the "timed out" message, but not as an error. We send it up as a SkippedReason. For the code insights use-case, it looks to me like receiving an event with skipped.Reason == "shard-timeout" here should be translated to an error so code insights knows it isn't getting complete results for that query and can retry.

@leonore does that make sense based on what you've seen?

leonore commented 2 years ago

@camdencheek yep I can see some logs on dogfood where we are hitting shard-timeout. so I'll open an issue for us to work on using that information, see if that fixes the issue. thanks!

sourcegraph / sourcegraph-public-snapshot