sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody
https://sourcegraph.com
Other
10.11k stars 1.29k forks source link

โญ Q3B1 Mitigate Observability expenses #42140

Closed jhchabran closed 1 year ago

jhchabran commented 2 years ago

This was brought to us by @bobheadxi over this Slack message.

Problem

Dotcom is seeing continuous growth in traces export that is amounting to significant expenses ($22k in Grafana this month) as well. @bobheadxi worked on a few mitigation efforts, but I think we need a more concerted strategy here about how we are going to handle tracing in the application (see below) and investigate why this is happening and how we can keep costs under control.

We are missing guidance on how to write good tracing and potential misuse of tracing policy due to how tracing policies have been implemented in Sourcegraph for a long time. I suspect a significant issue we are seeing is that background jobs always enable tracing (https://sourcegraph.sourcegraph.com/github.com/sourcegraph/sourcegraph@af1c9d3a8ef24cc35f3b248b66b0e89dfed46258/-/blob/internal/workerutil/worker.go?L315) because there's no concept of "selective" for background jobs yet, which causes libraries like otelsql to emit large amounts of traces for every background job that gets run. @jhchabran and @burmudar did some work previously on options for otel-level sampling - IIRC (@bobheadxi) one of the conclusions is because of selective we can't really reliably suggest generalized probabilistic sampling, so we need a solution designed and implemented here (whether that be more feature flags, research into more filtering/sampling, etc).

Scope

We're not looking to refactor entirely the tracing API nor to make the selectiveness property work perfectly across all the components, as this would require to migrate all the existing opentracing code to OTEL.

Instead, by order of importance:

  1. Mitigate the spending by fixing the biggest source of traces that do not respect the selective property.
  2. Provide a clear path to write tracing code that respects the type of tracing.
  3. Sketch a plan for taking up this further.

Boundaries

Approach

Payout

Mitigate by TODO the spending for tracing. Possibly some guidance on how to write correct tracing code.

Tracked issues

@unassigned

Completed

@burmudar

Completed

Legend

jhchabran commented 2 years ago

Notes from the meeting with @bobheadxi https://docs.google.com/document/d/1yToLQhOgQ7uKQpq-UswK6YYiCXa9CCCy8ctyGxD4UjY/edit

jhchabran commented 2 years ago
burmudar commented 2 years ago

Additional PR's:

bobheadxi commented 2 years ago

The PRs linked by @burmudar have stopped the usage for now: https://sourcegraph.slack.com/archives/C07KZF47K/p1664810671427159?thread_ts=1662734633.832289&cid=C07KZF47K