telemetry: always best-effort write to v1 store

sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody

https://sourcegraph.com

Other

10.11k stars 1.29k forks source link

telemetry: always best-effort write to v1 store #64526

Closed keegancsmith closed 2 months ago

keegancsmith commented 2 months ago

In another PR addressing event_log tables being an issue, it was mentioned that we can just write to the legacy store without monitoring if it works or not. This commit implements that. Most of the code change is just prop drilling for the logger. We could simplify even further and not even log, but it feels like not being able to see the error anywhere would be hard to debug.

Test Plan: CI

Context: https://github.com/sourcegraph/sourcegraph/pull/64481#issuecomment-2291526398

github-actions[bot] commented 2 months ago

[!WARNING] Test ownership level below 70% on this branch.

We aim to have ~70% of Bazel test targets to help us (dev-infra) and you get insight into the impact of tests on CI times. This is a non-blocking requirement, but keeping on top of test attribution to maintain this threshold is appreciated.

See the list of tag variable names to use here, and reach out to #discuss-dev-infra if you need to add/change this list.

See an example of how to use the variables here.

keegancsmith commented 2 months ago

CI is legit failure. Will follow-up on a fix since looks like my "find references" failed to find some. But should still be reviewable.

akalia25 commented 2 months ago

Sorry if this is in your context link its coming across as broken for me.

To confirm, the main goal of this best effort v1, is that we've noticed that sending v1 telemetry has slowed down performance and that we are willing to "not wait" for v1 to send events, and whatever gets sent is sent. Would this be a fair picture of the goal? @keegancsmith

keegancsmith commented 2 months ago

Sorry context broke after the rename. I will port this over to the new repo, but might as well discuss on here before doing that work. Here is the working link https://github.com/sourcegraph/sourcegraph-public-snapshot/pull/64481#issuecomment-2291526398

The context is this code path itself has been responsible for downtime on dotcom due to the high volume of writes (we think) conflicting with a migration trying to get a table lock. So in the other PR we are introducing a mechanism which applies backpressure on v1 store if there are too many concurrent writes. Given that, v1 writes can take a very long time and we don't want v1 not working to affect v2 writes.

I think that is a fair summary @bobheadxi? Does that make sense? If so, does this feel like a good approach?

keegancsmith commented 2 months ago

Lets continue discussion in https://github.com/sourcegraph/sourcegraph/pull/53