Closed exokernel closed 5 months ago
@exokernel: thanks for filing this. We got bitten by this as well as there are a couple of metrics (mainly the colocation metrics) that have a host
label and since we generate random subdomains, there are potentially endless host
values so we eventually run out of memory and container gets restarted (even with the problematic metrics in the metrics denylist which, as you've discovered, doesn't stop the exporter from collecting data)
@martinhaus: any chance this PR can be reviewed/merged?
closed in favor of #111
@martinhaus - I might be missing it but I don't actually see this change set in #111 ? Was it implemented a different way?
Hello @martinhaus and @haad, I am also interested in the change @exokernel made.
I might be missing something as well, but I don't think #111 introduced the change.
Could we re-open this PR?
In the meantime @exokernel, did you find a workaround?
Hello @martinhaus and @haad, I am also interested in the change @exokernel made.
I might be missing something as well, but I don't think #111 introduced the change.
Could we re-open this PR?
In the meantime @exokernel, did you find a workaround?
@PierreBart - For now our workaround is simply to use our own fork of the exporter based on my branch. We can't run the official version because without this change the exporter quickly exhausts available memory.
What?
Only create metrics that are not in the deny list.
Why?
Currently the deny list only applies to what is exposed in the
/metrics
endpoint but all metrics are collected internally regardless of if they are being reported or not. If you have a lot of sites in Cloudflare the high cardinality of the metrics causes them to use a lot of memory. I've seen the exporter use over 2GB of memory even when excluding most of the metrics.This change fixes that by passing the deny list down into the functions that collect the data and create the metrics so that it will only create new counters for a metric if it is not in the deny list.
I've tested this change on a large site with most metrics excluded and have observed a dramatic reduction in memory usage.
Would be great to get this into your repo so that everyone using the cloudflare_exporter can benefit. Please let me know if you need more proofs of working, have feedback, etc.