allow only a single flush to process at once

akutta commented 1 year ago

Summary

Wrap Flush in a mutex. In general this is called only in the flusher goroutine; however, if flush on shutdown is enabled we could prematurely shutdown veneur without waiting for the Flush to complete.

Motivation

Discovered dropped metrics for short lived services and veneur running as a sidecar.

Test plan

I have automated tests in a different branch. Requires either introducing a delayed blackhole sink, or introducing mocks which adds additional vendored deps. The test asserts order of flushes while using different delays during the Flush call.

Additionally, we have two testbeds to validate this behaviour:

AWS Lambdas - Running veneur in a Lambda Layer to collect and flush metrics. On single invocation lambdas we have noticed that metrics are dropped on occasion. We can hide some of this behaviour by reducing the flush interval, but would rather fix the root cause.
AWS EKS - Running short lived pods w/ veneur as a sidecar. When application finishes call /quitquitquit and notice that metrics do not flush.

Rollout/monitoring/revert plan

Testing in EKS first followed by Lambda both in dev/staging first. Will update

CLAassistant commented 1 year ago

All committers have signed the CLA.

akutta commented 1 year ago

Three separate use cases were tested with this fix and it resolved their issues. Two for AWS Lambdas, one on EKS.

stripe / veneur