insights: search can produce non-deterministic result counts in the case of timeouts for large search corpus

coury-clark commented 2 years ago

We have observed two behaviors when creating a code insight using streaming search with a large (millions of results) result set.

Insights execute in two modes. In the backfill mode we generate and execute one search query per repository. Each query will be given the same timeout value (say 60 seconds). This means the backfill mode gets a sum total of num_repos * 60s to perform searches. In the snapshot mode, we execute a single search query over all repositories with the same timeout value (say 60 seconds). This results in the backfill typically having multiple orders of magnitude more time to search than the global query. If we were to provide the global snapshot search with the same amount of time (num_repos * timeout) we would likely experience the opposite problem, where the concurrency across shards will result in the larger repos that would otherwise exceed the single-repo timeout getting more overall time.
It seems from testing in the UI that global searches will generate non-deterministic results in a timeout window if the overall results exceed the timeout. One hypothesis is that the searches spread across the zoekt shards non-deterministically (or perhaps the repos are sharded non-deterministically). This means when we do encounter a timeout, there is no determinism in what the value will be which results in non-deterministic insight values.

according to the search team, the timeout is set universally for an instance through a site config option maxTimeoutSeconds which is set to 60 seconds. the context for this option is that:

most customers are in front of a loadbalancer that have a timeout. EG dot-com has a 60s timeout due to cloudflare. This might be historical and SSE with streaming would likely get around that

sourcegraph-bot-2 commented 2 years ago

Heads up @joelkw @felixfbecker @vovakulikov @unclejustin - the "team/code-insights" label was applied to this issue.

leonore commented 2 years ago

We don't want to run scoped queries for snapshots because global queries are significantly less (orders of magnitude) overhead (consider: not having to run git lookups for all repos). Consider if every snapshot for every insight on a 35k repo instance queued up 35k repos of work, daily.

our options to get past the issue filed here is to:

Solve this timeout / fairness problem, i.e. override this global option so that we can give a bigger timeout to snapshot queries
Find a new paradigm that is more consistent for large result sets
Queue snapshots individually, maybe for monorepos only
- can we get repo size when getting repo information and run those queries separately, excluding them from the global query?
- is this too complicated?
something else?

Joelkw commented 2 years ago

@leonore thank you for filing and summarizing this! Appreciate the context. A few quick questions:

Is this due to a recent change, or has this always been the case? (Are we just discovering it now because including a monorepo causes the snapshot to timeout, or for some other reason)?
Does this only affect charts where a search would also have timed out? In other words, is there no way for a user to use sourcegraph to exhaustively run the search that's timing out on the insight?
Is this problem going to get worse on even larger instances? Is this something we should have on our radar to solve for strategic-sized customers?

leonore commented 2 years ago

Is this due to a recent change, or has this always been the case? (Are we just discovering it now because including a monorepo causes the snapshot to timeout, or for some other reason)?

My belief is this has always been the case, we're just noticing it now because of our improved performance capabilities (and also because I was running a query with a large result set in the millions, e.g. package, rather than something more granular)

Does this only affect charts where a search would also have timed out? In other words, is there no way for a user to use sourcegraph to exhaustively run the search that's timing out on the insight?

I think a search is more likely to return closer to the correct amount of results, but we've found that the actual number of results depends can vary across searches (see point 2. in the issue description). again, this is for queries with large result sets.

Is this problem going to get worse on even larger instances? Is this something we should have on our radar to solve for strategic-sized customers?

I'm not able to give the most confident answer on this, but my view is that the issue lies with both the global timeout on search's side and with the unfair timeout this gives to a global snapshot query on our side. I don't know if a large instance would make a difference here, or if it's more about the kind of query the customers are running.

Joelkw commented 2 years ago

Thanks for those answers! Useful background to think about how far in advance we might need to prioritize this (backlog is fine for now).

sourcegraph / sourcegraph-public-snapshot

insights: search can produce non-deterministic result counts in the case of timeouts for large search corpus #37859