Is your feature request related to a problem? Please describe.
My team and I have encountered an issue where our collectors consume high memory usage when re-ingesting telemetry from a file storage queue after a disruption event with our backend. In these tests we have simulated an hour of connection failures to the backend to let our file storage queue grow. After an hour has passed, we restore the connection and see the spike in exported telemetry and memory usage.
Here is an example of the behavior we see from the persistence sending queue during the test period. Notice how the sending queue immediately drops to zero after reconnecting to the backend.
It seems like on reconnect with the backend, anything in the file storage queue gets consumed into a memory queue. We are hoping to control this memory spike so we can ensure memory won't pass a certain threshold when running on windows VMs
Describe the solution you'd like
Is there a feature we can add that will throttle how quickly the consumers pull from a file storage queue and send to the backend endpoint? Something that allows us to configure how many batches are pulled from the queue over a specified time frame?
Describe alternatives you've considered
We have tried utilizing the memory limiter and GOMEMLIMIT environment variable but neither have been successfully. My guess is the garbage collector won’t reclaim the memory since the telemetry is still being actively sent. We have also tried reducing the number of consumers and the size of batches but we are sill seeing the spiking.
Additional context
Collector version: v0.99 contrib
Tested on Windows 2016 server
Here is the config we used for testing in case there are any config changes we could make to improve memory usage with the current version of the collector.
The num_consumers setting in the persistent queue is capable of throttling the export path, consider lowering it to 1 and working back up if the recovery is too slow.
Is your feature request related to a problem? Please describe. My team and I have encountered an issue where our collectors consume high memory usage when re-ingesting telemetry from a file storage queue after a disruption event with our backend. In these tests we have simulated an hour of connection failures to the backend to let our file storage queue grow. After an hour has passed, we restore the connection and see the spike in exported telemetry and memory usage. Here is an example of the behavior we see from the persistence sending queue during the test period. Notice how the sending queue immediately drops to zero after reconnecting to the backend. It seems like on reconnect with the backend, anything in the file storage queue gets consumed into a memory queue. We are hoping to control this memory spike so we can ensure memory won't pass a certain threshold when running on windows VMs
Describe the solution you'd like Is there a feature we can add that will throttle how quickly the consumers pull from a file storage queue and send to the backend endpoint? Something that allows us to configure how many batches are pulled from the queue over a specified time frame?
Describe alternatives you've considered We have tried utilizing the memory limiter and GOMEMLIMIT environment variable but neither have been successfully. My guess is the garbage collector won’t reclaim the memory since the telemetry is still being actively sent. We have also tried reducing the number of consumers and the size of batches but we are sill seeing the spiking.
Additional context Collector version: v0.99 contrib Tested on Windows 2016 server
Here is the config we used for testing in case there are any config changes we could make to improve memory usage with the current version of the collector.