scalyr / scalyr-fluentd

The fluentd plugin for inserting log messages and metrics in Scalyr.
Apache License 2.0
6 stars 5 forks source link

Memory issue after docker image update. #44

Closed tophercullen closed 2 months ago

tophercullen commented 3 months ago

We run the scalyr/fluentd:latest docker image as a side-car for a few different Fargate tasks. Since the update to the docker hub image latest tag to 0.8.18 in the past 24 hours, these have all been going OOM more or less instantly on the scalyr container. Pinning everything to 0.8.17 works fine again.

One of these tasks is grossly over-provisioned in terms of memory (for CPU reasons), having something like 1gb+ more than it uses, and its still going OOM. So its seems unlikely this is an unfortunate edge case in multiple tasks with different uses, libraries, and resource allocations.

weilliu commented 3 months ago

@tophercullen Are you seeing the issue with the fargate integration? https://app.scalyr.com/solutions/fargate The only difference in 0.8.18 is adding the new fluentd docker image. https://github.com/scalyr/scalyr-fluentd/commit/3d6caf67dd3391805efe8e63187872c1c2c696e6 That itself shouldn't trigger any overutilization of the container resources.

Could you open the ticket on support.dataset.com and send us the screenshots/cmdline outputs showing OOM? Granting us your Scalyr team access will be helpful to troubleshoot the issue.

tophercullen commented 3 months ago

Yes, we are using the scalyr image in a side-car in fargate.

I am aware of the change that was made. Its why I first tried reverting to the previous image version, which again works as-is still. Clearly there are differences beyond a simple version change, as even the resulting image sizes are substantially different.

Reviewing the config again, I now see there's a memory limit on the container (100mb, same as in the linked docs). Previously, I thought the scalyr container was just being subject the the overarching task resources constraints (which seemed absurd it would be hitting).

Increasing this limit to 200mb appears to allow the new container version to function properly. At the documented 100mb, the current latest image (0.8.18) does not function properly for us. Somewhere between 100-200mb appears to be the actual required memory now. 0.8.17 still works with a 100mb.

Given nature of this memory issue, and the documented limit published by scalyr, I would classify this as a breaking change for this version. I recommend reverting the latest tag until such a time as an updated memory allocation can be determined, the documentation can be updated, and customers notified of the new memory requirement.

weilliu commented 3 months ago

Thanks for the additional context. I just created an internal ticket for engineering to look at the issue.

weilliu commented 2 months ago

The engineering has reviewed the issue and confirmed that this requirement is caused by a fluentd process update. The recommended memory limit is now 300mb for running the Fargate agent.