Closed davidgiga1993 closed 5 months ago
The terraform memory leakage is a nuisance. Thanks for making an issue.
Looking around upstream I find suggestions to set requests and limits on the ControllerConfig as a stop-gap solution: https://github.com/upbound/provider-aws/issues/325#issuecomment-1474056956
Linked from that same issue, there is another solution called ProviderScheduler: https://github.com/crossplane/upjet/pull/178 I don't know if we already implement that but definitely worth investigating.
Example implemenation of the ProviderScheduler solution: https://github.com/upbound/provider-aws/pull/627/files
We have a few hundred resources and observed that as well. You should check the queue of reconciles. Probably they pile up because the requests are not completed fast enough.
What helped us is the configuration with
- --poll=12h
- --sync=12h
to reduce the load.
Every change to the resource will trigger a reconcile anyway. The poll and sync stuff may only help, if somebody did manual changes to a resource which then get reset on next reconcile.
But the whole setup with the crossplane provider seems very fragile, we often have to do manual cleanups. :-/
To me, it looks like the poll interval doesn't even work 🤔. I've got dashboards being refreshed every minute anyways
Over the course of the last week I implemented the parts of this provider using the grafana go client instead of terraform as a proof of concept.
Please note that I don't want to advertise my implementation as a replacement, since only a few of the resources are implemented and everything is quite young. Instead I want to show this to you guys to have a look at it and decide for yourselves if this could be an option to replace the current terraform/upjet based implementation.
For @davidgiga1993 and me the new implementation solved the leak and cpu usage (s. screenshots above)
Before 15:00 the provider from this repository was used, after that my implementation was used (If wanted I can post an update after a longer observation period)
Here you can find the source code used: https://github.com/Argannor/provider-grafana
You can definitely advertise your implementation. The Terraform implementation is sub-optimal but I also do not have enough time to maintain a manually written provider. So, unfortunately, I can tell you that we will keep using upjet regardless of the performance issues
Fixed in v0.13.0
See https://github.com/grafana/crossplane-provider-grafana/issues/107 for more info!
When running the provider with a medium number of resources (100+ Users, 20+ Orgs) the memory consumption increases until it gets killed by the OOM of the resource limit. The
provider
process is consuming all the memory in that case.Also the memory usage in general is insanely high for what this provider is doing, especially when compared to the others. Additionally we're also facing the CPU resource issue where the entire crossplane provider consumes the CPUs of an entire node the entire time..
As far as I understand most of this comes probably from upjet? Wouldn't it make more sense to build a "proper" provider and not rely on terraform internally as it seems to be the root cause of some of those issues?