grafana / crossplane-provider-grafana

Crossplane provider of https://github.com/grafana/terraform-provider-grafana. Generated by https://github.com/upbound/upjet
Apache License 2.0
26 stars 14 forks source link

Memory leak in provider #87

Closed davidgiga1993 closed 5 months ago

davidgiga1993 commented 6 months ago

When running the provider with a medium number of resources (100+ Users, 20+ Orgs) the memory consumption increases until it gets killed by the OOM of the resource limit. The provider process is consuming all the memory in that case.

Also the memory usage in general is insanely high for what this provider is doing, especially when compared to the others. Additionally we're also facing the CPU resource issue where the entire crossplane provider consumes the CPUs of an entire node the entire time..

image

As far as I understand most of this comes probably from upjet? Wouldn't it make more sense to build a "proper" provider and not rely on terraform internally as it seems to be the root cause of some of those issues?

Duologic commented 6 months ago

The terraform memory leakage is a nuisance. Thanks for making an issue.

Looking around upstream I find suggestions to set requests and limits on the ControllerConfig as a stop-gap solution: https://github.com/upbound/provider-aws/issues/325#issuecomment-1474056956

Linked from that same issue, there is another solution called ProviderScheduler: https://github.com/crossplane/upjet/pull/178 I don't know if we already implement that but definitely worth investigating.

Duologic commented 6 months ago

Example implemenation of the ProviderScheduler solution: https://github.com/upbound/provider-aws/pull/627/files

patst commented 6 months ago

We have a few hundred resources and observed that as well. You should check the queue of reconciles. Probably they pile up because the requests are not completed fast enough.

What helped us is the configuration with

    - --poll=12h
    - --sync=12h

to reduce the load.

Every change to the resource will trigger a reconcile anyway. The poll and sync stuff may only help, if somebody did manual changes to a resource which then get reset on next reconcile.

But the whole setup with the crossplane provider seems very fragile, we often have to do manual cleanups. :-/

julienduchesne commented 6 months ago

To me, it looks like the poll interval doesn't even work 🤔. I've got dashboards being refreshed every minute anyways

Argannor commented 5 months ago

Over the course of the last week I implemented the parts of this provider using the grafana go client instead of terraform as a proof of concept.

Please note that I don't want to advertise my implementation as a replacement, since only a few of the resources are implemented and everything is quite young. Instead I want to show this to you guys to have a look at it and decide for yourselves if this could be an option to replace the current terraform/upjet based implementation.

For @davidgiga1993 and me the new implementation solved the leak and cpu usage (s. screenshots above)

Before 15:00 the provider from this repository was used, after that my implementation was used image image (If wanted I can post an update after a longer observation period)

Here you can find the source code used: https://github.com/Argannor/provider-grafana

julienduchesne commented 5 months ago

You can definitely advertise your implementation. The Terraform implementation is sub-optimal but I also do not have enough time to maintain a manually written provider. So, unfortunately, I can tell you that we will keep using upjet regardless of the performance issues

julienduchesne commented 5 months ago

Fixed in v0.13.0 image

See https://github.com/grafana/crossplane-provider-grafana/issues/107 for more info!