Open WilliamLHarms opened 4 years ago
I try to understand what I all see here. Before starting to dig deeper, I'd like to first understand some things:
promregator.yaml
? Especially the value for cf.threadPool.size
and promregator.scraping.threads
are interesting to me.An easily configurable option, through env var or setting in the promregator yml that will allow us to set the necessary timeout interval.
From what I currently see, I'd say Promregator is totally overloaded. We must find the reason for this first - and I doubt that a timeout is the root cause for that...
"foundation" is a unique PCF instance on different cloud providers
Yes 158, it's just the begining.
9 orgs
We're using the auto discovery endpoint with prometheus /discovery Job conifg:
- job_name: ap01_promregator
honor_timestamps: true
scrape_interval: 15s
scrape_timeout: 14s
metrics_path: /metrics
scheme: http
file_sd_configs:
- files:
- /etc/prometheus/discovery/ap01_discovery.json
refresh_interval: 5m
relabel_configs:
- source_labels: [job]
separator: ;
regex: (ap01)_promregator
target_label: foundation
replacement: $1
action: replace
We are not setting cf.threadPool.size nor promregator.scraping.threads
Our next steps were going to be setting up a promregator in each org for each platform. 2 platforms X 2 for prod and non prod. Then multiple X 2 for failover redundancy. So that's alot of promregator instances to maintain. So yes, we are looking for better options that do not require us to have so many promregators to maintain.
Also, the app names typically end in -blue/-green and 1/2 of the apps are typically on while the other 1/2 are off.
Our next steps were going to be setting up a promregator in each org for each platform. 2 platforms X 2 for prod and non prod. Then multiple X 2 for failover redundancy. So that's alot of promregator instances to maintain. So yes, we are looking for better options that do not require us to have so many promregators to maintain.
I'd suggest to understand first what's up with the current promregator. I don't know what it is right now, but I have the impression that we are still missing one vital point here.
Also, the app names typically end in -blue/-green and 1/2 of the apps are typically on while the other 1/2 are off.
That should be the least of the problem, but may cause you delay when you b/g deploy. But that's on a totally different sheet of paper.
I am still trying to get a picture on what is going on in your case. You could help me a little by providing the values/graphs of the metrics:
Unfortunately, the three latter metrics are only available when promregator.metrics.internal
is set to true
(see also https://github.com/promregator/promregator/blob/3c33ddb632c91ee1b3af0eb836ae2b6d9fa101f7/docs/config.md#option-promregatormetricsinternal-optional). Is it possible to set this option in your environment?
PS: It's in the middle of the night for me now. I'll have a look at this case tomorrow again.
If you want to experiment a little, here are some thoughts:
cf.connectionPool.size
to 32promregator.scraping.threads
to 14 (warning! memory consumption might go up and - if applicable - you might have to adjust the JVM memory configuration in case you configured it manually).cf.request.timeout.appInSpace
to 3000@eaglerainbow - thanks for the tips. Very useful. I'll add all but the promregator.metrics.internal to our non prod env, if it seems to have an impact, I'll add it to prod. Typically, through the day when the devs are hard at work running builds / cf commands, we have issues with timeouts. I'll turn on that property in production then.
Get some sleep and thank you for the assist.
Just for my better understanding: Are you scraping the same set of applications in "non-prod" than in the "prod" environment?
Typically, through the day when the devs are hard at work running builds / cf commands, we have issues with timeouts.
That might or might not be a coincidence. Based on your graphs I would more assume that the problem is building up over time (see also JVM threads), due to the overload situation on the threads: The old threads are still blocked working (though the scraping interval for which they are running is already way gone) and hence are not available for the new scraping interval. The next scraping interval then has even less resources, which provokes even further issues. I have seen this pattern already too often. However, there is no simple cure for that (or at least I haven't seen it yet).
I'll turn on that property in production then.
If you are referring to promregator.metrics.internal
: This property does not change Promregator's scraping behavior as such. It just triggers Promregator to fetch additional internal metrics, which helps analyzing the situation. The reason why the switch is off is that the number of metrics get huge (so the data volume is the problem). Hence, if you want to play it safe, I would suggest enabling it in non-prod first and if you see that you can live with the data volume, you may also want to enable it in prod env.
Analyzing the graphs that you have sent above, here is what I get (in the sense of "thinking a-loud"):
promregator.metrics.internal
). promregator.scraping.threads
currently is set for this to 5 (default) ==> that's way too low. The question will be whether we have enough memory available to run more scraping threads.cf.connectionPool.size
is max(cores available, 8) * 2
. Assuming that we have 8 core available, this would be 16 threads. Hence, we would be able to cf-metadata fetch all the spaces in parallel, but would not have any connection free for orgs and other metadata request (such as for the instance scanner or the target resolver), ==> looks tight, hence could cause some latency due to an exhausted pool, but at least the order of magnitude still fits. Setting the connection pool size to 32 would be a good idea (but most likely is not the "killer"). However, increasing this value is "bad" for the CF CC.What puzzles me right now are the following observations:
promregator.scraping.threads
is still set to 5)! Moreover, on the graph this value increases each time in a step-fixed manner. I don't get a clue from where this is coming. However, all these threads then also explain to me, why the memory behavior gets so wired afterwards (too many threads doing non-sense things).metrics_path: /metrics
in your prometheus configuration do there? Impact isn't 100% clear to me right now. However, if this ends up in Prometheus trying to scrape for each single target to query promregator at endpoint /metrics
(where all metrics are provided) and not at /singleTargetMetrics/*
, then we will have both a big data mess and a huge processing issue! (Note: then endpoint's path is defined with the content of the file mention in file_sd_configs
). Can we ease my mind by either removing that line completely or setting it to some non-sense value (to ensure that if that line gets used somehow, we will see an error message and hence notice that "something's really going wrong here")?Pivotal resource is stating we should increase the timeouts to > 2.5s as the cloud controller timeout is at 30s when using it from the command line.
Requests from CF CLI are not comparable to CF CC API calls from Promregator. The CF CLI is intended for "single manual use", whilst Promregator is under permanent fire (with a cadence of 15s in your case). Requests to Promregator cannot be completed, if certain data from CF CC is not available in time. Setting CF CC timeout values in Promregator above the duration of the scraping interval is a very tricky thing and could cause major consistency issues in its caches. Promregator tries to be as efficient as possible to only ask for information which should be rather easy to determine. Unfortunately, there is rather a lot of by-catch that is rather expensive to generate. The client has no possibility to indicate to the CF CC that certain values do not matter and hence do not need to be generated.
Our Prometheus admin states that we have too many metrics on the containers and these need to be reduced.
Out of the data provided so far, I don't see any evidence for this statement. In contrast, the scrape request sizes look rather common to me (I don't want to claim anything on the load on their Prometheus' instances - so my statement is restricted to Promregator only here). However, there could be a point in that, by noticing that some app containers are rather slow in answering the scraping requests (see statement on foundation "an01" and "vn01" above). If a reduction of metrics on these containers allowed accelerated scraping by Promregator, then this would simplify the entire scraping process.
@eaglerainbow , thank you kindly for the great support in answering my questions.
To clarify ap01, vp01, an01, vn01 are PCF Platforms/foundaitons. P for prod, n for non prod.
Also, I made all the property changes you suggested to our an01 environment, moving forward all data and graphs will be based upon an01 only.
cf:
api_host: ${promregator.url}
username: ${promregator.username}
password: ${promregator.password}
connectionPool:
size: 32
request:
timeout:
appInSpace: 3000
logging:
level:
org:
cloudfoundry:
promregator: ERROR
promregator:
metrics:
internal: true
scraping:
threads: 14
We are still seeing drops in container count. (3:15 ish in the panels below) While I would expect a random scale I highly doubt we're geting devs shutting off apps.
--To your first set of questions--
We have signifigantly less containers running in the an01 foundaiton than the ap01 foundation and are seeing thesame problem as in ap01. The problem does materialize itself in the an01 env. Moving forward I'll be sharing data for an01, for simplicity.
That's an interesting point about the thread contention, old thread is blocked because it's still scraping while the next thread is starting up.
I have turned promregator.metrics.internal on in the an01 environment and built grafana panels their screen caps are below.
---- second post ---
Question 1: 50s scrape duration:
Yes I agree, this does allude to your earlier point about jvm thread contention. For some perspecitve on the "worst app" from my prometheus admin, curling the app's prometheus scrape from the promregator server itself: (just getting a random container picked by the go router)
an01 - 126 lines
time_namelookup: 0.006
time_connect: 0.093
time_appconnect: 0.410
time_pretransfer: 0.410
time_redirect: 0.000
time_starttransfer: 0.548
----------
time_total: 0.636
ap01 - 1607 lines
time_namelookup: 0.006
time_connect: 0.093
time_appconnect: 0.402
time_pretransfer: 0.402
time_redirect: 0.000
time_starttransfer: 0.501
----------
time_total: 0.764
The native scrape is very fast. Is there a way that I can isolate the scrape duration from the call to the Cloud Controller via promregator internal prometheus metrics? Isolate that time and the time it takes to ingest the scrapes from the containers? It would help out alot to know where the problem lies.
Also when I ask for more cloud controller workers, the PCF admin says your throwing 499's increase your timeout.
Question 2: Scrape request size: Yea there are times when under heavy usage the custom metrics of one of the apps that tracks the response times of the methods that massage external dependancies. See the flux in line count between prod and np above.
Question 3: See below for promregator.metrics.internal grafana panels
Question 4: seems like that on these landscapes we have some poorly performing implementations of the /metrics endpoint!
Oh boy, yea, that's my Prometheus admin channeling his rage through you :) Honestly, we are using the micrometer libs in spring, and have switched to quantiles from histograms. There's some steel toe in there as well, with the prometheus libs on top of that. Let me know if you want specifc NuGet or spring versions
I set the promregator scraping threads to your suggestion of 14, should I go higher?
Question 5: MORE THREADS!!!!
Yes, we have a very big cloud foundry footprint and have only started rolling product teams onto the idea of using more white box metrics with Micrometer.
Everything works swell except our promregator. We know that at some point, because of the magnitude we will need multiple promregators per platform but we don't know the trade offs. I really appreciate your write up here on the implications of the cf.connectionPool.size. I'll check with our Platform team to make sure they can scale the cloud controller workers as we light up more teams. But one thing that stands out, the 2.5s 499 response from the Promregator makes them very grumpy. Can that be tweaked? What are the implications to our scrape interval, etc? Can we just double everything?
Question 6: It is unclear whether this is really latency introduced by the CF Cloud Controller (waiting for the response over the network) or whether we even did not ask the CF CC yet due to a) missing connection pool capacity (see above), b) due to underlying JVM overload situation
I'll continue to go down the route of 1 promregator for the big orgs on each platform. Maybe we can isolate the bigger orgs from the smaller. I'll have a poc of this in Produciton using the latest promregator you pushed with the json logs, monday.
Also, we have some other APM tools that I'll get on the promregator box to see what's what.
Would a thread dump when the promregator is jenky help? Also, logs? (But those would have to be shared privately).
--Puzzlement Questions-- Q1: promregator.scraping.threads is set to 14 now.. New panels below.
Q2: new panels only show an01
Q3: _What does the metricspath: /metrics in your prometheus configuration do there? I think that's just noise, when I check our prometheus targets, Service Discovery, we are using /singleTargetMetrics and the scrapes contain data relevant to only one app, all containers within that app. I almost always see context deadline exceeded in the targets with state down. (Is it as simple as adjusting our scrape duration and iteration up?)
--Last post-- CF CLI / CF CCAPI What would you like to see our prometheus iterations be set to given what you see?
Simplifying container metrics: That does align to the prometheus admin's comments. To that end I've taken the worst performing ingest and factored out the histograms. While I'll miss the heatmaps, I need something rather than gaps in grafana and our "absent" alerts firing waking me up at night :)
Charts and Graphs - Please note that all property file changes were applied by 23:30
Thank you for cleaning everything up - and testing my suggestions! On first glance, the graphs already look significantly better. Apparently still we have the same problem. As you were able to reduce the scope (by still keeping the issue), it makes life for us much easier for sure!
I have only limited time right now, but here are the "low-hanging fruits": π
Is there a way that I can isolate the scrape duration from the call to the Cloud Controller via promregator internal prometheus metrics? Isolate that time and the time it takes to ingest the scrapes from the containers? It would help out alot to know where the problem lies.
I think what you are searching for is promregator_request_latency
: That's the time Promregator needs to open a connection to the ha_proxy, get the connection to the scraping endpoint, reading the response from there (whether failed or not) and closing the connection.
Any processing of the data received or any time necessary for fetching other kinds of metadata is not included.
I set the promregator scraping threads to your suggestion of 14, should I go higher?
Currently, I don't see any issue with thread congestion on the pool. With the heap memory free that you have, you can safely do this. (In general, increasing that pool value is perfectly fine, as long as you don't overshoot memory usage. If you over-do it, nothing severe will happen immediately - the resources offered just won't be used; however, you may run into issues, if you over-commit and the thread resources do get used, whilst you server is not able to stand the load).
However, I still have a little issue with understanding the scenario:
But one thing that stands out, the 2.5s 499 response from the Promregator makes them very grumpy.
I can understand π Promregator is also "grumpy" about not getting the data in time π . It would be even worse with them, if we did not 499er them but just went on - then over time we also fill up their inbound HTTP worker's thread queue with requests, which already had died (and no-one is still listening for). So, yes, 499 isn't nice, but still the better choice. π€
I have the impression that in the meantime were away from the initial scraping thread problem, more turning into the metadata issue.
Can that be tweaked? What are the implications to our scrape interval, etc? Can we just double everything?
In essence, yes. If you have 30s scraping interval, then you can also spare double as much time for reading metadata. In a nutshell it is this: Promregator may need up to 4 metadata requests to the CF CC. Metadata stuff is cached, so in many cases you are just fine with it. But once in a while you may have bad luck and all four requests have a cache miss. If the sum of the latency of all your metadata requests plus the time required for a single scrape request to the container is longer than your scraping interval of Prometheus, you have a severe situation: Threads will overtake themselves as Prometheus will already have fired the next request to Prometheus, even before "paperwork has finished". If you have too many targets (a.k.a. containers) to be scraped, you will run into thread pooling issues, as the old threads aren't freed up yet. (it's little more complicated than this, but that's the major risk here).
I'll have to get going now; I will have another closer look at the new graphs again (and all your other statements), but that'll be quite some time later...
The native scrape is very fast.
I agree to this assessment.
It would help out alot to know where the problem lies.
Given your figures, I currently would assume that we have less issues on "an01" with scraping the metrics ifself, but more with fetching the metadata.
Also when I ask for more cloud controller workers, the PCF admin says your throwing 499's increase your timeout.
From their perspective this is very logical. From Promregator's PoV this is a non-trivial thing.
Honestly, we are using the micrometer libs in spring, and have switched to quantiles from histograms. There's some steel toe in there as well, with the prometheus libs on top of that. Let me know if you want specifc NuGet or spring versions
There is nothing bad with using micrometer in spring even with histograms. The registry implementation there is usually performing very well, if you store the values in the registry. However, if you use the callback logic (there is one approach for each of the libraries, micrometer and for the prometheus client lib), then you can do big mistakes - calling just another getter method (such as "getSize()") isn't one for sure, but you can also place there logic with significant algorithmic complexity, which then would yield TCO for Promregator/Prometheus.
I set the promregator scraping threads to your suggestion of 14, should I go higher?
As mentioned above, you may do so. Just keep an eye on your jvm heap memory that it does not skyrocket. On the other hand, in the current scenario (assuming that the Promregator instance is responsible for an01 only), I don't think that it will provide any further positive result.
Yes, we have a very big cloud foundry footprint and have only started rolling product teams onto the idea of using more white box metrics with Micrometer.
π
Everything works swell except our promregator
That's bad :cry:
But one thing that stands out, the 2.5s 499 response from the Promregator makes them very grumpy. Can that be tweaked?
All CF Cloud Controller timeouts can be tweaked in Promregator. It can be done using the parameters cf.request.timeout.*
. The property of most relevance for you most likely is cf.request.timeout.appInSpace
. Theoretically, however, the same problem may also arise in all other request cases (all the other cf.request.timeout.*
parameters).
From the right hand side of your graph at https://user-images.githubusercontent.com/37589680/68530613-c2f0b300-02d7-11ea-945e-0f887f7ad938.png you can see that the new value of 3000 ms "sometimes" is enough to get the necessary metadata from the CFCC. At roughly 50% of the time, the CFCC can respond much faster - every other request, however, it is slower. This currently creates the impression to me that there might be two (sets of) instances - one that is running on "slower" hardware, the other on faster one. If your request lands on the faster one, we can keep the timeout; if it is handled by the slower one, we have a situation. That might something that is worthwhile to ask your CF admins: Whether they could check how uniformly the computing power is available across their CFCC instances.
I would be curious to get to know what the "required timeout value" would be. I'd suggest to run two experiments to achieve this:
cf.request.timeout.appInSpace
to 30000. cf.cache.timeout.application
to 30
. The latter will ensure that we will ask for the application-level metadata every 30 seconds (which is way too short for productive usage, but we want to understand how things are going). Look at the promregator_cffetch_latency_bucket
for allAppsInSpace Single page metric. What's the 98th percentile's value there? That would be the value you would have to enter as cf.request.timeout.appInSpace
. Is this value realistic? Can scraping still work without causing thread contention?
There is a second approach how we can derive that value (and hence triangulate it):
Run
$ time cf curl /v2/apps?q:space_guid=xyz
several times manually, whereas xyz
should be replaced by one of the space's guid's (kindly let me know in case you don't know how to get to it).
(Note: The request sent by Promregator is much more complex than this, but with this we should get a rather good first estimate).
For instance, with a space containing two apps, I currently get here
real 0m0,627s
user 0m0,000s
sys 0m0,015s
and that's from my home...
I'll have a poc of this in Produciton using the latest promregator you pushed with the json logs, monday.
I am interested in hearing your results on this.
Also, we have some other APM tools that I'll get on the promregator box to see what's what.
Might be helpful, but I don't expect much more insight than we have already today.
Would a thread dump when the promregator is jenky help?
I don't think so: we already know where the madness is mainly coming from. But we should keep this in our back-mind (for instance on this "large number of threads" thing, this could eventually become helpful)
Also, logs?
Those might help a little more, as it might give more insight on when what issue (and which metadata request) starts to fail.
(But those would have to be shared privately).
Sure. There is even a policy and a procedure for that case. For details you might want https://github.com/promregator/promregator/blob/5ddd47583f5aeac7e0b80d76de83ef117c815898/docs/analysis-material.md to give a read.
I think that's just noise, when I check our prometheus targets, Service Discovery, we are using /singleTargetMetrics and the scrapes contain data relevant to only one app, all containers within that app.
Then I feel reassured.
CF CLI / CF CCAPI What would you like to see our prometheus iterations be set to given what you see?
I currently don't know. That is why I have suggested above to determine the timeout value that CFCC assumes us to have in practice. Let's determine this value and then let's talk about its implications. One idea: Perhaps we can reduce the outer timeout values, which would then buy us time for the appsInSpace topic.
That does align to the prometheus admin's comments. To that end I've taken the worst performing ingest and factored out the histograms. While I'll miss the heatmaps, I need something rather than gaps in grafana and our "absent" alerts firing waking me up at night :)
I would rather not lose hope on that. First, I'd like to see a graph promregator_request_latency
against promregator_scrape_duration
.
With all the computing resources that you are willing to throw on Promregator, scraping request latency does not seem to be our major concern (and that would be the only thing you could influence with having the histograms removed).
-- Some Corollaries:
Side remark: If we had #116 available, this could potentially also help here in this case.
Thanks for the tips. :) After letting the new config run for a while, I'm not seeing failures on the promrgator side, the issue has now moved to the prometheus scape interval. Given what I've learned from this thread (thank you)
These errors seem to rotate on different apps/containers, always on 10s Even on apps that I know have lightning fast scrapes and only 2 containers.
/singleTargetMetrics/13f26934-e03f-4673-88e7-448466e71d65/2: context deadline exceeded
/singleTargetMetrics/b24aa087-b6d4-4120-80c9-15705344dbd7/0: context deadline exceeded
Is this because cf.request.timeout.appInSpace was increased and in turn it pushed the prometheus scrape to time out. Maybe set the Prometheus config for 30s scrape intervals? Do you concur Dr. Eagle?
Also, for our prod env ap01, I think it's safe to update the pools, threads, and I'll add the promregator metrics to the mix. Just to help monitor the situation.
I'll follow up on our NP CFCC and CCAPI hardware. Honestly, since this is all running on a large cloud provider, I think we're on the same hardware. We may just have different qty's of CFCC's and CCAPI workers. More in prod and less in NP. As NP the scrape intervals are sub second.. Then during the week, they go up to 7/8/9 then start to bomb out.
We have some apps that we scale out to 200+ containers at certain times of the year.. What are your opinions on CCAPI and Promregator sizing given the scope?
Also, regarding your point about the cache ticket #116, if you build it, I'll test it.
--- Experiment ---- Update the appInSpace timeout to 30000 and cache timeout to 30 with the threading and connection pools as previously discussed.
See how Prometheus reacts. Dr. Eagle, for the next test, can you share your hypothesis and expectations for the next test we try?
cf:
api_host: ${promregator.url}
username: ${promregator.username}
password: ${promregator.password}
connectionPool:
size: 32
request:
timeout:
appInSpace: 30000
cache:
timeout:
application: 30
logging:
level:
org:
cloudfoundry:
promregator: ERROR
promregator:
metrics:
internal: true
scraping:
threads: 14
authenticator:
type: none
Results:
real 0m4.344s
user 0m0.113s
sys 0m0.031s
real 0m4.312s
user 0m0.137s
sys 0m0.034s
real 0m3.884s
user 0m0.115s
sys 0m0.032s
real 0m4.061s
user 0m0.115s
sys 0m0.028s
real 0m4.492s
user 0m0.110s
sys 0m0.036s
real 0m4.156s
user 0m0.114s
sys 0m0.033s
These errors seem to rotate on different apps/containers, always on 10s Even on apps that I know have lightning fast scrapes and only 2 containers.
That is little surprising: The majority of latency is induced by the requests to the CFCC. In most cases the values are cached. It remains random, which container is being requested when the cache expires (and a new request to the CFCC is being triggered). Until the CFCC request is not completed, no scraping of the container can take place (side remark: changing the latter is exactly the idea of #116).
Is this because cf.request.timeout.appInSpace was increased and in turn it pushed the prometheus scrape to time out. Maybe set the Prometheus config for 30s scrape intervals? Do you concur Dr. Eagle?
Yes, the error messages produced by Prometheus exactly tells this: We did not get a response from the Promregator endpoint in time, i.e. until the next scraping request (to the same target) was initiated after
Having done so, you may then "pray" that we always "concurrently" ask for "one or two" responses from the CFCC, but we'll practically have never the situation that we'll send all four requests (besides perhaps once when we start Promregator initially). Even if that situation would occur, the worst thing to happen then was a single synchronous missed data point in all metrics (it may happen in multiple application containers in parallel).
What do you think about this approach?
BTW No. 1: In the graph "Promregator latency by foundation" we always see that the maximal latency is below 10s. If the scrape interval was 15s, then I would assume that performance was good enough. So it's a little surprising to me that we see error messages from Prometheus like context deadline exceeded
. :confused: Or have you have set scrape_timeout
to 10 (at Prometheus) in the meantime (initially you mentioned 14s)?
BTW No. 2: I am little confused about the graph "Promregator latency by Foundation" in general: What is in vn01 so much different that (apparently) make the metadata requests so much faster that the promregator_scrape_duration_seconds is halfed? :confused: Can we see the promregator_request_latency
in parallel to that (perhaps then not max
ed individually, but their average or so) -- to prove that the difference comes from fetching metadata?
Then during the week, they go up to 7/8/9 then start to bomb out.
Indeed, this could be an indicator of having NP not to have provisioned enough hardware resources to keep up with the load during the week accurately. This makes setting the appropriate timeouts in Promregator a nightmare. You can also go optimistically into this situation by just setting the timeouts high enough - but then you risk the `context deadline exceededΒ΄ messages again: Prometheus is requesting data faster than Promregator could fetch the metadata to deliver them. There is little way out to this.
We have some apps that we scale out to 200+ containers at certain times of the year.. What are your opinions on CCAPI and Promregator sizing given the scope?
Disclaimer: I have too little operations/sizing knowledge about the CFCC to give any sizing suggestions.
In general, containers of the same app is less problematic than having "many apps" in a space: CFCC will just tell us "there are 200 containers for the same app" - Promregator will unfold it internally only. So CFCC can be rather relaxed (it's always only "one line" for CFCC).
Promregator has the more critical case here: For your "200 instances per app" situation, there would also be 200 /singleTargetMetrics
lines in the discovery document. Prometheus would then start to scrape with 200 targets in parallel, creating 200 HTTP requests to Promregator within its scrape interval (I assume 15s). This means ~14 connections per second. The vast majority of metadata for these requests can be fetched efficiently from the cache, so we "only" need to forward the connection requests properly to the containers. However, this means that you must have enough promregator.scraping.threads
configured (in this case, assuming that the average latency to scrape the metrics from the containers would be at 1s, at least 14). Note that these 14 would be necessary for these 200 containers alone, i.e. if you also have other containers to be scraped at the same time, you have to increase the promregator.scraping.threads
accordingly. As already mentioned some time before, this means that you will have more CPU and memory consumption on the Promregator's host. There is no way of determining upfront what this means in GBs - for instance it also depends on the data volume the metric's endpoints provide (as Promregator has to decode and repackage all metrics on the fly). Therefore, my suggestion would be to simply test it out, looking at the JVM heap (and scale up linearly, if applicable).
NB: If you get more than 16 connections to Promregator in parallel, you might also need to increase the spring property server.tomcat.max-threads
(from 16 to some higher value, see also https://github.com/promregator/promregator/blob/5ddd47583f5aeac7e0b80d76de83ef117c815898/docs/config.md#changing-the-number-of-http-worker-threads). Otherwise inbound HTTP requests from Prometheus would start to get queued by Tomcat.
Also, regarding your point about the cache ticket #116, if you build it, I'll test it.
Thanks. But exactly building is here the critical thing. All this caching stuff in Promregator is rather complicated and you can easily shoot into your own foot, if you forget to consider some race condition that may happen (and you know what Murphy's Law says on this). Hence, please don't expect anything soon. I'd still consider this topic to be in "research" stage so far.
See how Prometheus reacts. Dr. Eagle, for the next test, can you share your hypothesis and expectations for the next test we try?
Post-delivering it: Hypothesis: We might run into "context deadline exceeded". Some metrics will be dropped (either by Promregator or Prometheus). Promregator will keep running over 5-10 minutes. I would not dare to run it too long, though, as you might get threads overtaking each other - and you'll cause a big mess.
Expectation: We get to know the realistic value of how long the CFCC requires to answer our requests. With that we can determine "reasonable" timeout values for Promregator. We need (say) the 98th percentiles of all the different types of CFCC requests that we send. Once we have all of them determined, we can check whether it is reasonable to stay on scrape interval of 15s. If not, we know it's impossible to keep the pace due to what the platform delivers to us. We then might have to increase it to 20s or even 30s.
Does this make sense?
-- On the results:
We had even more context deadline exceeded errors from Prometheus, almost 1/2 of the targets
That's not really surprising. I did not intend to look at what Prometheus is able to retrieve; the purpose was only to get the "reasonable timeout values" determined.
(10s is the config)
What do you mean by that?
real response from the curl, from the same box hosting the promregator that's reaching out to an01: 0m3.884s / real 0m4.492s
That means that our cf.request.timeout.appInSpace
should be around 4500-4800 or so. Strange that the graphs show slightly lower values (4000).
when I curl the applications /actuator/prometheus (just getting a random container from the go router), I get the data < 1s.
That's clear: You are measuring promregator_request_latency
here, but you (apparently) try to compare it against promregator_scrape_duration
, which mainly is "promregator_request_latency
+ metadata retrieval time
+ some_really_small_overhead".
While with the promregator sometimes prometheus hits the 10s wall.
Yes, that's (to all what we currently know) due to metadata retrieval time. You see that only "sometimes", because usually the metadata cache of Promregator is already filled (and valid). Only if some cache is not filled/must be refreshed, you get the peaks.
Next steps to my mind here would be:
promregator_request_latency
and promregator_scrape_duration
per Foundation.I hope that all my chatting above (it's a lot of text obviously) helps you to get a better understanding of the situation!
@eaglerainbow - Thank you for taking the time to answer all my questions, bearing through the large amount of data I have been throwing at you. I really do appreciate the help and hope to help return the favor one day.
I'm using the following configuration in our ap01 (prod environments, running in the prod space of the same foundation) Also, we moved prometheus to have a 25s scrape duration timeout with a 30s scrape interval.
Our platform engineers, whom called me out by calling them "resources", are taking a look at why the performance is so bad on the a01 platforms and so good on the v01 platforms. I am hoping they will find the problem (I wrote up a decent set of steps to take to troubleshoot this) If I get to a shareable RCA, I will do so. They were very surprised to see the metrics regarding latency from within the platform and the details from the test below.
Regarding the graphs, one warno, I couldn't find the metric for meta data retrieval time. Tiles and prometheus queries are below
I came to these settings by doing the test below to get the basic timings, then started increasing the timeouts to be set above the average. (15s). The promregator.scraping.timeouts that I have set seemed to get rid of the promregator errors that were being generated. But yea.. Our a*01 env's cloud controller api is very latent.
cf:
api_host: secret_api_host
username: promlogreader
connectionPool:
size: 32
request:
timeout:
appInSpace: 15000
org: 15000
appSummary: 15000
cache:
timeout:
application: 10
logging:
level:
org:
cloudfoundry:
promregator: ERROR
promregator:
discovery:
hostname: secret-hostname-for-discover
port: 443
authenticator:
type: none
metrics:
internal: true
requestLatency: true
scraping:
threads: 14
connectionTimeout: 10000
socketReadTimeout: 10000
The Test
time nc -zw30 api.sys.ap01.somePlace.com 443
time nc -zw30 api.sys.ap01.somePlace.com 443
Both had sub second response
Next, isolate how long it takes to do a curl to pull details from the org. Do a cf login to each platform then:
cf org --guid the-org-name
time cf curl /v2/apps?q:space_guid=secretGuid
In vp01 it's close to time nc command. In ap01, it's 10-20s
Our Platform Engineers are going to take a deeper dive into why.
Charts And Graphs On The New Config
1.0 - promregator_request_latency_bucket
max(histogram_quantile(1.0, sum(rate(promregator_request_latency_bucket{job="$job"}[1m])) by (le, job )) )
max - promregator_scrape_duration_seconds
max(promregator_scrape_duration_seconds{job="$job"})
MAX-promregator_scrape_duration_seconds - 1.0 promregator_request_latency_bucket = meta_data_calculation
sum(max(promregator_scrape_duration_seconds{job="$job"})) - sum(max(histogram_quantile(1.0, sum(rate(promregator_request_latency_bucket{job="$job"}[1m])) by (le, job ))) )
^^ that last blip is when I did a cf push with the connection and socket read timeouts increased to 10s from 5s.. I was getting alot of errors without that increase.
@eaglerainbow - one follow up question.. I'm having a real hard time the prometheus metrics for promregator itself. Can you help me narrow down how I can obtain:
I am trying to isolate the cost of each call within Promregator to provide a scrape to prometheus.
Also, to scale, should I increase containers or more unique application instances?
time cf curl /v2/apps?q:space_guid=secretGuid
In vp01 it's close to time nc command. In ap01, it's 10-20s
Wahoo! π That's really a lot!
Our Platform Engineers are going to take a deeper dive into why.
Sounds like a good plan to me. Either the CFCC containers are hopelessly overloaded or perhaps the backing database has degraded. In any case the performance is too sluggish for doing something such time-critical as scraping.
If I get to a shareable RCA, I will do so.
Would be really great to have that here as a reference, because I wouldn't be too surprised if similar things also happened to other Promregator users.
I'm having a real hard time the prometheus metrics for promregator itself.
Yeah, I do understand; also the graph's configuration above is a little strange.
In general, I'd propose to give https://github.com/promregator/promregator/blob/5ddd47583f5aeac7e0b80d76de83ef117c815898/docs/enrichment.md#additional-metrics-measuring-the-communication-to-the-targets a read. However, there are not all metrics mentioned (some of the metrics we are using here are not documented officially). To get these "additional" metrics you had to enable promregator.metrics.internal
.
So, let my try to explain a little:
Duration and volume of calls to the CF API
Promregator performs several calls to the CF API. All the performance information on that one is stored in the metrics promregator_cffetch_latency_bucket
, which is only visible with promregator.metrics.internal
enabled. The number of requests sent can be found at promregator_cffetch_latency_count
.
Each metric promregator_cffetch_latency_*
has a label called request_type
which describes the type of CF API call that is performed. For example: request_type="org"
is fetching the CF org data (mapping of org names to org Ids) - see also https://github.com/promregator/promregator/blob/5ddd47583f5aeac7e0b80d76de83ef117c815898/src/main/java/org/cloudfoundry/promregator/cfaccessor/ReactiveCFAccessorImpl.java#L228
Note that the different request_type
s have dependencies amongst each other: For example, you can only retrieve space data, if you have read the org data in which those spaces are located (otherwise you wouldn't know for which spaces you should be querying, as "reading everything" is a Bad Ideaβ’οΈ). That is also why one scraping request may trigger multiple sequential metadata requests to the CFCC and all of them need to be completed before you may send the scraping request (and if the metadata requests are too slow, ... well - you get me π).
There are two special request_type
s out there: space
and allApps
(maybe even more in the future). They split up in two request_types in fact: space
+ space_singlePage
and allApps
+ allApps_singlePage
. Background to this is that it may happen that not all data necessary can be fetched with one page (i.e. with one API call), but data must be read in a paged manner. The *_singlePage
request_type provides the time for retrieving each page, while the request_type without the suffix documents the "end-2-end time" for all pages (combined).
Timeouts for these requests are controlled by the configuration options cf.request.timeout.*
.
Unfortunately the name of the CF API calls do not 100% match the configuration options (this has historic reasons plus I don't want you to get exposed to all the internal complexity - the metrics are anyhow a "geek tool", so you need to know what you are doing...).
Just for the sake of completeness of this post, there are also the promregator_autorefreshingcachemap_*
metrics, which deal with the caches of the various request_type
s. For example, we can see there, how many cache hits and misses we had, when the last cleanup scan took place, if we had broken cache records, how long scanning took and how large the cache is (that is how I can determine by the metrics how many orgs/spaces/apps you have on the platform).
Duration and volume of calls to the scrape pages of each application instance
The latency (duration) for scraping at the containers (application instances) is available at promregator_request_latency_bucket
.
The number of scraping requests sent to the containers is available at promregator_request_latency_count
.
(Both metric values are only available, if you enable promregator.metrics.requestLatency
. Warning! Be aware of data volume, as the granularity is "per application instance"! Don't have it enabled with 100 containers, or Prometheus might take it amiss.)
The data volume returned by the containers when being scraped is available at promregator_request_size_bucket
.
Where the timeout is set in Promregator for the cffetch_latency_bucket I would like to raise it above 5s and see if that lowers the number of promregator_request_failure (I'm assuming that counter is a generic failure at any place in the flow of providing metrics)
Well, not really π
promregator_request_failure
is used to indicate, if the scraping from a container failed (or the necessary post-processing of the metrics provided by the scraping target was not possible). See also https://github.com/promregator/promregator/blob/5ddd47583f5aeac7e0b80d76de83ef117c815898/src/main/java/org/cloudfoundry/promregator/fetcher/CFMetricsFetcher.java#L179 on this. At that time calling the CFCC has already happened long ago - so these two topics should be rather uncorrelated.
If you run into a timeout situation and get promregator_request_failure
logged, something went wrong with the containers. Then, my next look would be into Promregator's logs, because each failure should raise a log warning there (usually by the logger named org.cloudfoundry.promregator.fetcher.CFMetricsFetcher
).
The timeout to wait for the scraping results of containers is controlled by configuration option promregator.scraping.maxProcessingTime
(and partially also with promregator.scraping.connectionTimeout
and promregator.scraping.socketReadTimeout
).
Where the timeout is set in Promregator for the cffetch_latency_bucket
As said above, promregator_cffetch_latency.*
and cf.request.timeout.*
belong together.
I am trying to isolate the cost of each call within Promregator to provide a scrape to prometheus.
That you can find in metric promregator_scrape_duration_seconds
: It's the last value of the "Prometheus-perceived latency Promregator produces". Being an end-2-end metric, it includes everything:
I had briefly thought about making this a histogram, but no-one really had asked for that so far. And as in Single Target Scraping mode the metric is also on "per application instance/container" level, a histogram would have blown up Prometheus again.
Also, to scale, should I increase containers or more unique application instances?
Depends on what you want to create pressure in Promregator:
(only teasing metadata fetching cannot be achieved).
Please let me know in case you still have any issues in understanding "what is what" - this stuff really is tricky (some might even call it "wicked").
@WilliamLHarms just fyi: there is some movement at #116
@eaglerainbow - some good news. We figured out the issue. Our UAA was a victium of a "noisy neightbor".. We scaled the UAA and restarted the VMs for the existing ones and latency for our off prem (an01/ap01) cloud platforms just about matched with the standard configuration:
We did however leave the config in there for the additional threads and pools.
cf.connectionPool.size: 32
promregator.scraping.threads: 14
I put a quick write up of our final config yamls to get us through the "dark times" below.
Good to hear about #116 getting some traction. I would be happy to try it out in one of our non prod clouds. (Still working on the roll outs for the json enabled promregator)
One final question, I'm trying to work out a simple Grafana panel, using Promregator debug metrics that show:
This is what I have so far in Grafana.. Please let me know if this is what I'm looking for.. (Yes I did read the metrics pages you posted before, but I couldn't align the available metrics I had.)
I swap out the max() for std() for some panels... Avg as well. Where $job is the promregator instance.
--worst possible round trip - meta / scrape
max(histogram_quantile(1.0, sum(rate(promregator_request_latency_bucket{job="$job"}[1m])) by (le, job )) )
--worst possible scrape duration
max(promregator_scrape_duration_seconds{job="$job"})
--net meta data
abs(sum(max(promregator_scrape_duration_seconds{job="$job"})) - sum(max(histogram_quantile(1.0, sum(rate(promregator_request_latency_bucket{job="$job"}[1m])) by (le, job ))) ) )
Most resilient Promregator/Prometheus config for my situation
---
applications:
- name: promregator-INSTANCE
path: promregator-0.6.2.jar
memory: 1g
instances: 1
env:
JAVA_OPTS: '-Dspring.config.location=file:$PWD/promregator_INSTANCE.yml'
PUBLISH_PORT: 8080
SKIPVERIFY: true
spring.application.name: promregator-INSTANCE
CF_PASSWORD: secret
To get this to apply, you have to add the promregator.yml file to the promregator distribution. Root level. That maps back to the -Dspring.config.location.
server:
port: XXXXX
cf:
api_host: XXX.ap01.XXXXXXX.com
username: xxxxx
connectionPool:
size: 32
request:
timeout:
appInSpace: 22073
org: 22073
appSummary: 22073
cache:
timeout:
application: 20
logging:
level:
org:
cloudfoundry:
promregator: ERROR
promregator:
discovery:
hostname: promregator-INSTANCE.xxx.ap01.xxx.com
port: 443
authenticator:
type: none
metrics:
internal: true
requestLatency: true
scraping:
threads: 14
connectionTimeout: 22073
socketReadTimeout: 22073
maxProcessingTime: 22073
targets:
Prometheus Configuration: You can get the promregator_INSTANCE_XXX_discover.json to a wget to the promregator discover endpoint. We found a simple cron that built the file every 5 min as a good option.
- job_name: INSTNACE_promregator
honor_timestamps: true
scrape_interval: 30s
scrape_timeout: 25s
scheme: https
file_sd_configs:
- files:
- /etc/prometheus/discovery/promregator_INSTANCE_XXX_discovery.json
refresh_interval: 5m
relabel_configs:
- source_labels: [job]
separator: ;
regex: .*(ap01)_promregator
target_label: foundation
replacement: $1
action: replace
some good news. We figured out the issue. Our UAA was a victium of a "noisy neightbor".. We scaled the UAA and restarted the VMs for the existing ones and latency for our off prem (an01/ap01) cloud platforms just about matched with the standard configuration
That is really some good news! This sounds to me that Promregator revealed some underlying issue just by testing out SLAs in practice π
We did however leave the config in there for the additional threads and pools.
That's okay; if you can spare the necessary resources (CPU+memory) for that, then this totally fine.
Good to hear about #116 getting some traction. I would be happy to try it out in one of our non prod clouds
Progress on #116 is better than expected: Finding caffeine really made the trick - Ben (the author of caffeine) has solved many problems generically that I thought we would stumble into a cache. I hope that I can provide a release-candidate soon for a first test. Locally on my dev machine, the results all look very promising.
This is what I have so far in Grafana.. Please let me know if this is what I'm looking for..
The first two formulas look good to me.
The last one is rather inaccurate, as with it you would attribute all efforts that Promregator is doing for processing the scraping to metadata fetching. Even worse, the formula does not consider parallel scraping (i.e. if two targets are scraped in parallel, you'll get "strange values").
I think you are searching more for promregator_cffetch_latency_bucket
. The bad thing is that you then have to deal with the label request_type
. For the sake of simplicity, however, you could simply say that you would sum up all the le
s across the request_type
s first - except those with regex .*_singlePage
. This approach also is a little wrong, because it implies that all metadata requests could be fired in parallel - but given that you have such a high cf.connectionPool.size
, this assumption is "less invalid" than the previous one.
The resulting histogram then would give an indication of the overall metadata request time.
cf: request: timeout: appInSpace: 22073 org: 22073 appSummary: 22073
Using odd request timeouts are rather arbitrary. Prime numbers for expiry values is much more interesting.
We found a simple cron that built the file every 5 min as a good option.
That's even how it was intended π
Coming back to your initial request:
A way to understand the interactions between the promregator, cloud controller, and container scrape duration.
An easily configurable option, through env var or setting in the promregator yml that will allow us to set the necessary timeout interval.
What do you think would be an appropriate extension to Promregator permitting you to achieve your goal easier/faster/more convenient?
@eaglerainbow - thanks for the feedback. I'll try out those metric changes and get them in our panels to see if I can clean things up.
Perhaps both could be accomplished by updating the promregator grafana template with metrics from this discussion. I would suggest a cleanup of this post in the documentation as a "fall back to 25s/30s" scrape timeouts and intervals if threasholds are reached.
Maybe even a few posted Prometheus alert queries to let you know "when the bad things happen". To get really fancy, maybe one of the labels are static to a page out here on git hub. That's how we run our SOPs (that aren't automated... yet)
Once again, I really do appreciate all of your help. Please do reach out if you need anything from your end.
Just thinking aloud: Given the good progress at #116 (with which all this scraping latency topic becomes much less critical, as with high latency still the old cache data is kept and not evicted anymore), I am asking how much effort we should still put into documentation on "fallback to 25s/30s" or an alert query. I am rather convinced that @WilliamLHarms would have not have fallen into the issue here, if we had had the new cache in place in first place.
My current idea is to keep this issue here open for a while to permit #116 progress. If we are able to move to the Caffeine cache soon, then we should not invest too much here. If we can't go there soon, then we should reconsider and provide some more documentation.
@eaglerainbow - good news. Running 0.7.3 resolved the majority of our CCAPI timeouts. Thank you for adding the Caffeine cache solution. We rolled out this change to all platforms and we are no longer dropping metrics.
Thanks for the feedback. I assume it's okay to close this issue here for you?
I just wanted to add that this thread helped me figure out a very tricky issue. My UAA (in a dev network) was also slow/overloaded, so the default timeouts didn't work for me. I had to set these properties to get it going. Obviously in a production environment, you'd probably want to fix the underlying issue, but this is helpful to know:
cf:
request:
timeout:
org: 10000
space: 10000
domain: 10000
app: 10000
routeMapping: 10000
sharedDomain: 10000
process: 10000
appSummary: 10000
appInSpace: 10000
@rrileyca Thanks for sharing. And just to explain for others, he is setting the timeout to '10 seconds' in his dev UAA.
Environment
Promregator running from java jar on a Cent OS server. Prometheus is federated. There are 2 Promregator instances per foundation.
Summary / Problem Statement
Our promregator is timing out.
Pivotal resource is stating we should increase the timeouts to > 2.5s as the cloud controller timeout is at 30s when using it from the command line.
Our Prometheus admin states that we have too many metrics on the containers and these need to be reduced.
Observed Behavior
Expected Behavior
A way to understand the interactions between the promregator, cloud controller, and container scrape duration.
An easily configurable option, through env var or setting in the promregator yml that will allow us to set the necessary timeout interval.