vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.3k stars 1.5k forks source link

"Source send cancelled." #20305

Closed fvarg00 closed 2 months ago

fvarg00 commented 4 months ago

A note for the community

No response

Problem

Hi, we see below error when there is high CPU load on vector pods. Do we know if this is a known problem?. Any help is appreciated. Thanks!

ERROR source{component_kind="source" component_id=datadog_agents component_type=datadog_agent}:http-request{method=POST path=/api/v2/series}: vector_common::internal_event::component_events_dropped: Events dropped intentional=false count= reason="Source send cancelled." internal_log_rate_limit=true

Configuration

No response

Version

0.28.0

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

jszwedko commented 4 months ago

Hi @fvarg00 ,

This appears to be an incomplete bug report. Do you mind filling out all of the fields (in particular how you ran into this situation)? It'll be difficult to reproduce otherwise or even understand if it is a bug or not.

JustinJKelly commented 4 months ago

Hello @jszwedko,

Problem Hi, we see below error when there is high CPU load on vector pods. Do we know if this is a known problem?. Any help is appreciated. Thanks!

ERROR source{component_kind="source" component_id=datadog_agents component_type=datadog_agent}:http-request{method=POST path=/api/v2/series}: vector_common::internal_event::component_events_dropped: Events dropped intentional=false count= reason="Source send cancelled." internal_log_rate_limit=true

Configuration No response

Version image: docker.io/timberio/vector:0.37.0-distroless-libc

Debug Output N/A

Example Data N/A

Additional Context

We are using DataDog Agent to send logs, metrics, traces to vector.

We use transforms to modify tags for every event that goes through vector, as well as route them to different sinks.

We use ClusterIP for the Kubernetes service and there is no explicit LoadBalancer to distribute traffic among vector pods.

For reference, here is an image regarding CPU/memory load from pods where the errors are coming from. Screenshot 2024-04-16 at 10 40 13 AM

References N/A

jszwedko commented 4 months ago

Thanks @fvarg00 . I'm guessing what you are seeing is request timeouts from the client, which will cancel the send downstream. Can you share your configuration? I'm particularly interested if you are using the acknowledgements feature or not.

JustinJKelly commented 4 months ago

Hello @jszwedko, we are not using the acknowledgements feature. We see that acknowledgements field is deprecated. Is that a field something you think would cause this issue or a possible fix?

Which part of configuration would you need to see?

jszwedko commented 4 months ago

Hello @jszwedko, we are not using the acknowledgements feature. We see that acknowledgements field is deprecated. Is that a field something you think would cause this issue or a possible fix?

Which part of configuration would you need to see?

Gotcha, if you aren't using the acknowledgements feature, than it seems likely that the topology is just applying back-pressure to the Datadog Agent source: that is, the downstream components aren't sending fast enough so data is buffering in the source. The fix would be to identify and resolve the bottleneck (in your case it seems like it might be CPU-bound). To identify the bottleneck you can use the utilization metric published by internal_metrics. Identifying the first component in the pipeline where the number is is 1 (or close to it) usually indicates the bottleneck.

jszwedko commented 2 months ago

Closing this since I think we've narrowed in on the issue, but let me know if you have additional questions!