Closed jhchabran closed 1 year ago
Notes from the meeting with @bobheadxi https://docs.google.com/document/d/1yToLQhOgQ7uKQpq-UswK6YYiCXa9CCCy8ctyGxD4UjY/edit
The PRs linked by @burmudar have stopped the usage for now: https://sourcegraph.slack.com/archives/C07KZF47K/p1664810671427159?thread_ts=1662734633.832289&cid=C07KZF47K
This was brought to us by @bobheadxi over this Slack message.
Problem
Dotcom is seeing continuous growth in traces export that is amounting to significant expenses ($22k in Grafana this month) as well. @bobheadxi worked on a few mitigation efforts, but I think we need a more concerted strategy here about how we are going to handle tracing in the application (see below) and investigate why this is happening and how we can keep costs under control.
We are missing guidance on how to write good tracing and potential misuse of tracing policy due to how tracing policies have been implemented in Sourcegraph for a long time. I suspect a significant issue we are seeing is that background jobs always enable tracing (https://sourcegraph.sourcegraph.com/github.com/sourcegraph/sourcegraph@af1c9d3a8ef24cc35f3b248b66b0e89dfed46258/-/blob/internal/workerutil/worker.go?L315) because there's no concept of "selective" for background jobs yet, which causes libraries like otelsql to emit large amounts of traces for every background job that gets run. @jhchabran and @burmudar did some work previously on options for otel-level sampling - IIRC (@bobheadxi) one of the conclusions is because of selective we can't really reliably suggest generalized probabilistic sampling, so we need a solution designed and implemented here (whether that be more feature flags, research into more filtering/sampling, etc).
Scope
We're not looking to refactor entirely the tracing API nor to make the selectiveness property work perfectly across all the components, as this would require to migrate all the existing opentracing code to OTEL.
Instead, by order of importance:
selective
property.Boundaries
Approach
Payout
Mitigate by TODO the spending for tracing. Possibly some guidance on how to write correct tracing code.
Tracked issues
@unassigned
Completed
@burmudar
Completed
Legend