sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody
https://sourcegraph.com
Other
10.11k stars 1.29k forks source link

Large customers: Code Insights results may be incomplete #62295

Open bahrmichael opened 6 months ago

bahrmichael commented 6 months ago

This ticket originated from a Slack message, reported by @mike-r-mclaughlin.

We’ve been struggling to get Code Insights to work at Sentry. the screenshot below is what we see in the UI. From the screenshot it seems like we hit a timeout. Which timeout is it referring to and can that be configured?

image

It’s not entirely clear which log to check. we checked searcher (no warnings/errors during the time period this was being generated) and gitserver (also no seemingly relevant warning/error messages). we did find some relevant errors/warnings in frontend

{
  "SeverityText": "ERROR",
  "Timestamp": 1712170603579550779,
  "InstrumentationScope": "Resolver.live_preview_resolver",
  "Caller": "query/streaming_query_executor.go:154",
  "Function": "github.com/sourcegraph/sourcegraph/internal/insights/query.(*StreamingRepoQueryExecutor).ExecuteRepoList",
  "Body": "repo search encountered skipped events",
  "Resource": {
    "service.name": "frontend",
    "service.version": "5.2.7",
    "service.instance.id": "da92ab9c4de7"
  },
  "Attributes": {
    "reasons": "[shard-match-limit: Not all results have been returned due to hitting a match limit. Sourcegraph has limits for the number of results returned from a line, document and repository.]",
    "query": "fork:yes archived:yes repo:.* count:21"
  }
}

Logs: https://sourcegraph.slack.com/files/U01H4LDHVMF/F0703Q1EU2X/sourcegraph-frontend-0_--tail_100.txt

bahrmichael commented 6 months ago

We’ve been struggling to get Code Insights to work at Sentry. the screenshot below is what we see in the UI. From the screenshot it seems like we hit a timeout. Which timeout is it referring to and can that be configured?

I assume that the datapoints come from the backfill process.

Unfortunately this is a bit hard to pin point, since there are multiple timeouts. The one that we're seeing is most likely coming from a downstream service, or an error falsely being interpreted as a timeout. When I finished cloning chromium I should be able to reproduce some timeout errors and find the right places in the code that are causing this.

It’s not entirely clear which log to check. we checked searcher (no warnings/errors during the time period this was being generated) and gitserver (also no seemingly relevant warning/error messages). we did find some relevant errors/warnings in frontend

From what I'm seeing locally, the logs should be in the worker with the scope worker.insights-job.background.Handle. Here I can see error messages like "error during backfill execution", "insights backfill progress handler loaded", "backfill set to completed state", "backfill disabled due to exceeding error threshold", and others. You need to set the log level to INFO to see all of them.

A log line like the one below should tell us which downstream this error may be coming from.

[         worker] ERROR .../backfill_state_inprogress_handler.go:206 error during backfill execution ... "handle": {"seriesId": 12, "seriesUniqueId": "2gBFsFt6a1A5DmoRJTbJKLfsCex", "backfillId": 6, "totalDuration": "3m41.207475831s", "repoTotalCount": 17, "errorCount": 8, "percentComplete": 0.8235294117647058, "erroredRepos": 1, "error": "searchHandler: streaming search: errors: [rpc error: code = Unknown desc = hybrid search failed: iterating over changed files in git diff: rpc error: code = Unknown desc = failed to get changed file: encountered unknown file status \"T\" for file \"internal/telemetrygateway/v1/telemetrygateway.proto\"]\nrateLimiter.Wait: context canceled\nrateLimiter.Wait: context canceled\nrateLimiter.Wait: context canceled\nrateLimiter.Wait: context canceled\nrateLimiter.Wait: context canceled\nrateLimiter.Wait: context canceled\nrateLimiter.Wait: context canceled\nrateLimiter.Wait: context canceled\nrateLimiter.Wait: context canceled\nrateLimiter.Wait: context canceled"}}

Based on the line below from your logs I guess that the search limits are stopping us here. They are not configurable in the currently released version.

{"SeverityText":"ERROR","Timestamp":1712170603579550779,"InstrumentationScope":"Resolver.live_preview_resolver","Caller":"query/streaming_query_executor.go:154","Function":"github.com/sourcegraph/sourcegraph/internal/insights/query.(*StreamingRepoQueryExecutor).ExecuteRepoList","Body":"repo search encountered skipped events","Resource":{"service.name":"frontend","service.version":"5.2.7","service.instance.id":"da92ab9c4de7"},"Attributes":{"reasons":"[shard-match-limit: Not all results have been returned due to hitting a match limit. Sourcegraph has limits for the number of results returned from a line, document and repository.]","query":"fork:yes archived:yes repo:.* count:21"}}

I'm curious about the count:21, and if the query is the full query we're using in the code insight. I've seen some places in the code where it checks against the given limit, and aborts if there are more entries.

What still sounds odd to me is that shard-match-limit would be translated to a timeout.

camdencheek commented 6 months ago

hybrid search failed: iterating over changed files in git diff: rpc error: code = Unknown desc = failed to get changed file: encountered unknown file status \"T\" for file \"internal/telemetrygateway/v1/telemetrygateway.proto\"

This seems like a bug. It looks like we only recognize file statuses A, M, and D, but according to man git-diff-tree, there are various other valid statuses.

       Possible status letters are:
       •   A: addition of a file
       •   C: copy of a file into a new one
       •   D: deletion of a file
       •   M: modification of the contents or mode of a file
       •   R: renaming of a file
       •   T: change in the type of the file (regular file, symbolic link or
           submodule)
       •   U: file is unmerged (you must complete the merge before it can be
           committed)
       •   X: "unknown" change type (most probably a bug, please report it)
bahrmichael commented 6 months ago

FYI @camdencheek I copied the previous text to a local file as I iterate on the problem and get closer to the root cause. Just in case you're wondering where it just went.

camdencheek commented 6 months ago

Thanks! I went ahead and poked the Source team here about the diff iterator bug

bahrmichael commented 6 months ago

To help with excluding repos that cause problems I added this issue: https://github.com/sourcegraph/sourcegraph/issues/62578