newrelic / infrastructure-agent

New Relic Infrastructure Agent
https://docs.newrelic.com/docs/infrastructure/install-configure-manage-infrastructure
Apache License 2.0
132 stars 125 forks source link

Significant amount of "could not queue event: queue is full" #1857

Closed yannh closed 3 months ago

yannh commented 4 months ago

Our deployment of newrelic's infrastructure agent regularly throws the following error on some clusters: "error="could not queue event: queue is full" . We already increased the maximum queue size from 1000 to 3000, the error is still happening (hundreds to thousands of times per hour).

I have a guess as to why this is happening, but would love some :+1: that I am not completely misreading what is going on!

So this error comes from here, when we enqueue an event in the eventQueue.

That queue is consumed in accumulateBatches() that will batch the events every second, or when a batch reaches maxMetricsBatchSizeBytes or maxMetricsBatchSizeBytes. The batches are then sent to the batchQueue channel.

The batchQueue Channel is then consumed in sendBatches, which essentially calls doPost() for every batch, one batch after the other.

Now - if there are a lot of events to send, or if the events are fairly big, there will be a lot of events being sent to the eventQueue. If we send those events into the BatchQueue faster than sendBatches() can process the queue, then the events will be queued in the eventQeue, until EVENT_QUEUE_CAPACITY is reached.

This can be the case either when Newrelic's API is a little bit slower to react, or when there are a lot of metrics being sent in a very short amount of time. Unfortunately the events will then be dropped.

Increasing the speed at which sendBatches() is consuming events would likely help. I could not find any parallelisation or pooling for sendBatches, did I miss it, is the lack of parallelisation on purpose?

Possible solution 1: start several sendBatches workers in parallel, so a single slow request does not slow down sendBatches too much, and that we generally benefit from an increased processing throughput.

Possible solution 2: within sendBatches, add concurrency so we can process several batches in parallel. I do find the first solution more intuitive though.

Obviously we would need to check the concurrency-safety of sendBatches, which might be the largest part of the effort required.

I could give this a go - but only if someone more familiar with the code could confirm my reading / validate the possible solutions :bow:

Thanks!

Description

Events are dropped several times per hour ""error="could not queue event: queue is full""

Expected Behavior

All events are correctly sent to Newrelic.

Troubleshooting or NR Diag results

Provide any other relevant log data. TIP: Scrub logs and diagnostic information for sensitive information

Steps to Reproduce

Your Environment

Additional context

For context: the agent in question monitor PostgreSQL databases using custom queries, and likely generate quite a few events.

For Maintainers Only or Hero Triaging this bug

Suggested Priority (P1,P2,P3,P4,P5): Suggested T-Shirt size (S, M, L, XL, Unknown):

workato-integration[bot] commented 4 months ago

https://new-relic.atlassian.net/browse/NR-267523

rubenruizdegauna commented 3 months ago

Hi @yannh , thanks for the detailed issue.

Unfortunately, this is not trivial, as this part of the code is not simple, and we need to take into account the limit in the backend.

You can increase the queue size (~10K) but this will increase memory consumption too. You could also try to increase the interval of the Samples (https://docs.newrelic.com/docs/infrastructure/install-infrastructure-agent/configuration/infrastructure-agent-configuration-settings/#samples-variables) or if you are running integrations, you could increase the interval of the integrations.

Furthermore, you can enable self_instrumentation [noticing that:]

Infrastructure agent self-instrumentation is an experimental feature. The instrumented telemetry might change (metrics, transactions, custom events). We recommend to enable it only for complex troubleshooting scenarios. Standard pricing for data ingest applies.

And see the payloads of the Integrations and size of the queues to understand where this comes from.

workato-integration[bot] commented 3 months ago

This issue was not approved.