Open michaldo opened 5 months ago
Could you please verify if the two outputs are the same and you are not having four times more time series in the case of Prometheus 1.x?
How many time series (or lines) you have in your Prometheus output approximately?
Also, will the second and subsequent scrapes consume equal amount of memory or are they more lightweight?
What tool you used for the flame graph?
Will you get similar results if you look at a heap dump?
Also, Micrometer's overhead on the flame graphs seems pretty minimal to me, depending on the answer of the questions above, maybe this should be a Prometheus Client question at the end?
Hi Jonatan, I check memory usage on test cloud cluster when only micrometer-registry-prometheus is changed. So answer for question 1 and 3 is that environment is stable and memory consumption is stable.
It may be important that my application heavy use Kafka and prometheous output contains 4703 Kafka rows out of 4850 total Flame graph is generated with Intellij
After profiler output inspection, suspected fragments of code are io.micrometer.core.instrument.config.NamingConvention#toSnakeCase
and io.prometheus.metrics.model.snapshots.PrometheusNaming#replaceIllegalCharsInMetricName
. They can be refactor with less memory requirements
After profiler output inspection, suspected fragments of code are
io.micrometer.core.instrument.config.NamingConvention#toSnakeCase
andio.prometheus.metrics.model.snapshots.PrometheusNaming#replaceIllegalCharsInMetricName
. They can be refactor with less memory requirements
This conflicts with your flame graphs, I don't see any of these classes on it.
NamingConvention
has not been changed for about two years. PrometheusNamingConvention
did change in 1.13 (Boot 3.3) but now it is doing less work and it is delegating to io.prometheus.metrics.model.snapshots.PrometheusNaming
.
If you are really suspicious that io.prometheus.metrics.model.snapshots.PrometheusNaming
is the root cause, you might want to create an issue for Prometheus Client.
Indeed, suspected code I found has small impact on flame graph.
It is hard to determine root cause of memory consumption. I see that high level code is changed and memory consumption is visible on low level code - root cause is high level code recently changed or low level code not changed for years?
Anyway, I keep my opinion that pointed out methods are inefficient and micrometer-registry-prometheus.1.13.0 use more memory that micrometer-registry-prometheus-simpleclient.1.13.0
It is hard to determine root cause of memory consumption.
Analyzing a heap dump might help.
I see that high level code is changed and memory consumption is visible on low level code - root cause is high level code recently changed or low level code not changed for years?
I don't understand this question.
Anyway, I keep my opinion that pointed out methods are inefficient and micrometer-registry-prometheus.1.13.0 use more memory that micrometer-registry-prometheus-simpleclient.1.13.0
I never stated the opposite, I was trying to point out that you might opened the issue in the wrong repo, Micrometer is not the same as Prometheus Client (where io.prometheus.metrics.model.snapshots.PrometheusNaming
is from).
If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.
I reported that "micrometer-registry-prometheus" allocates more memory than "micrometer-registry-prometheus-simpleclient". Impact is visible on application with high Kafka usage, because Kafka client use thousands of metrics.
I reported real load flamegraph, because I expected the problem is more or less known/suspected/predicted. I wanted to give a hint for track a problem in real load.
Because problem is not known, I was asked to provide more details. It was hard to do in real environment, because real environment is switched to simple client. To provide details, I made experiments on laptop with plain client. But on laptop I have different load, and laptop-load details differs that real-load details. But overconsumption is still visible.
Continue with laptop load, I found that io.micrometer.core.instrument.config.NamingConvention#toSnakeCase is ineffectively implemented. I made https://github.com/micrometer-metrics/micrometer/pull/5271 for that.
We also discussed about correct target repo which should be fixed. In my opinion target repo is micrometer. We see that "prometheus-metrics-model" provides method PrometheusNaming#sanitizeLabelName
, which calls replaceIllegalCharsInMetricName()
, which allocate new String for every given String
Does it mean that "prometheus-metrics-model" library holds gun to micrometer team head and demands call the method for every scape and for each out of 7K Kafka metrics? No.
micrometer-registry-prometheus is responsible for well defined and well bounded task: return bunch of metrics on each request. Basically, it must print out something like
http_client_requests_seconds_sum{client_name="localhost",error="none",exception="none",method="GET",outcome="SUCCESS",status="200",uri="/productDetails?productCode={productCode}"} 0.4315684
If the task works ineffective, it must be solved within micrometer-registry-prometheus. Not put blame to "prometheus-metrics-model"
See #5288 which should bring things back in alignment in terms of NamingConvention not being called during PrometheusMeterRegistry#scrape
.
Continue with laptop load, I found that io.micrometer.core.instrument.config.NamingConvention#toSnakeCase is ineffectively implemented. I made https://github.com/micrometer-metrics/micrometer/pull/5271 for that.
It might be the case that it is ineffective but the section you fixed (it's hard to tell without a JMH benchmark) has been there for 7 years, there were no change there between 1.12 and 1.13 where you are experiencing the issue and I'm not sure if it is significant on the flame graph.
Does it mean that "prometheus-metrics-model" library holds gun to micrometer team head and demands call the method for every scape and for each out of 7K Kafka metrics? No.
On the other hand, we would like Micrometer to be as close to the "official" behavior as possible, hence the calls to the Prometheus Client.
micrometer-registry-prometheus is responsible for well defined and well bounded task: return bunch of metrics on each request. Basically, it must print out something like ... If the task works ineffective, it must be solved within micrometer-registry-prometheus. Not put blame to "prometheus-metrics-model"
I don't think I blamed prometheus-metrics-model
by saying that the issue might be in Prometheus Client because Micrometer's overhead on the flame graphs (the data you provided) seemed pretty minimal to me. I think there is a big difference between the two. :)
I think that IF there is a performance issue in a downstream dependency, it should be fixed there, not upstream since if I follow you logic then a perf issue in any library your app uses should be fixed in your app since it is using it? I don't think so, the issue should be fixed closest to its cause. Based on #5288, it seems Micrometer is calling the convention many times for each scrape. Would you be able to check if that PR fixes the issue you are seeing?
The changes from #5288 are available in 1.13.2
Here are result of my experiments on real usage
I checked profiling on laptop usage, but I realized that profiler output is not precise. Profile output is rather hint to search than proof of inefficiency.
Nevertheless, my findings:
@michaldo thank you for trying out the changes and reporting back. It seems things are improved but still not as good as before. Do you think it makes sense to report your latest finding to the Prometheus Java client and close this issue, or do you think there's more we can do in Micrometer for this?
Answer needs understanding Micrometer design. Simply, it depends on Prometheus Client role. If the role is a black box, performance is Prometheus Client issue. If the role is build block, perfomance is Micrometer issue or Prometheus Client issue.
No doubts metrics should have as low impact as possible. In my case several Kafka topics are used and it cause number of metrics is really high: 7K. Some prerformance bottlenecks, not observable with several dozen metrics, are observable with few thousands. That is important context.
Nevertheless, all Kafka metrics works out of the box - I didn't add single line of code. Three parties are involved in metric transmission: Spring Boot, Micrometer, and Prometheus Client. In current case Prometheus client always modifies given string to escape special characters. Considering that input is very often stable, escaping is waste of resources. When all parties involved in transmission are standardized libraries, they can agree transmission format and avoid runtime conversion
There is difference in computer science between API and protocol. Here API is Prometheus Client and protocol is
http_client_requests_seconds_sum{client_name="localhost",error="none",exception="none",method="GET",outcome="SUCCESS",status="200",uri="/productDetails?productCode={productCode}"} 0.4315684
API is friendly, but not necessary extremely effiecient.
To sum up, I think that it is worth to consider concept that Micrometer is responsible for efficient metrics transmission. The Micrometer should be aware how transmission looks from source to target and arrange the process to avoid redundant, repeatable, resource-consuming actions. Especially, unnecessary String conversions should be avoided. When all of above is known, Micrometer should check if Prometheus Client matches low impact requirements and decide: a) use Prometheus Client effieciently b) request a change in Promethesu Client c) use Prometheus protocol without the client
I observed that micrometer registry consumes more memory when I switch Spring Boot from 3.2.6 to 3.3.1
Environment
I attached profiler report collected by async profiler.
Green is memory consumption for micrometer-registry-prometheus:
scrape()
takes 446 MB Red is memory consumption for micrometer-registry-prometheus-simpleclient (I keep Spring Boot 3.3.1 but switch to simple client): 121 MB