stripe / veneur

A distributed, fault-tolerant pipeline for observability data
MIT License
1.74k stars 174 forks source link

allow only a single flush to process at once #1024

Closed akutta closed 1 year ago

akutta commented 1 year ago

Summary

Wrap Flush in a mutex. In general this is called only in the flusher goroutine; however, if flush on shutdown is enabled we could prematurely shutdown veneur without waiting for the Flush to complete.

Motivation

Discovered dropped metrics for short lived services and veneur running as a sidecar.

Test plan

I have automated tests in a different branch. Requires either introducing a delayed blackhole sink, or introducing mocks which adds additional vendored deps. The test asserts order of flushes while using different delays during the Flush call.

Additionally, we have two testbeds to validate this behaviour:

Rollout/monitoring/revert plan

CLAassistant commented 1 year ago

CLA assistant check
All committers have signed the CLA.

akutta commented 1 year ago

Three separate use cases were tested with this fix and it resolved their issues. Two for AWS Lambdas, one on EKS.