vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.5k stars 1.53k forks source link

PubSub source: messages are not being acknowledged properly #14964

Open pjay-shopify opened 1 year ago

pjay-shopify commented 1 year ago

A note for the community

Problem

We're trying to employ Vector in our logging pipeline. All of our logs are being sent to a PubSub topic and then processed by Vector. However, there's a strange issue we've been experiencing... Some messages are being picked up and never acknowledged. The number of such messages is relatively low (5-10 every 20-30 minutes) but the pattern is worrying πŸ€” For troubleshooting, I've reduced my config to the following (the issue still persists):

[sources.pubsub_source]
endpoint = "https://pubsub.googleapis.com"
project = "our-project-id"
subscription = "subscription-id"
type = "gcp_pubsub"

[sinks.blackhole]
type = "blackhole"
inputs = ["pubsub_source"]

I'm attaching below two charts showing our PubSub metrics: number of sent messages (to give you a better understanding of our traffic pattern) and the oldest unacked message metric (that grows to 10 minutes every 20-30 minutes). 10 minutes is our ack deadline: sent_messages unacked_messages

We also mirror the traffic to our Grafana agent so it processes the same set of messages but we don't have any ack issues with that so it seems like there might be an issue on the Vector's end.

Configuration

[sources.pubsub_source]
endpoint = "https://pubsub.googleapis.com"
project = "our-project-id"
subscription = "subscription-id"
type = "gcp_pubsub"

[sinks.blackhole]
type = "blackhole"
inputs = ["pubsub_source"]
acknowledgements.enabled = false

Version

0.24.2

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

pjay-shopify commented 1 year ago

FWIW, I've also observed some correlation between occasional StreamingPullResponses returning with the Unavailable status code, and the growth of the oldest unacked message age metric around the same time.

Below is a chart presenting that: unavailable_vs_oldest

In theory, if these two things are related we should always see the yellow line overlapping with the spike of the green line. However, these metrics are being sampled by Google and if the number of Unavailable responses is relatively low, there's a high chance of missing them for some intervals.

neuronull commented 1 year ago

Thanks for reporting this @pjay-shopify !

Adding some notes from the discord support thread discussing this:

mozhi-bateman commented 5 months ago

@pjay-shopify Were you able to resolve the issue with any workaround ? Thank you