promregator / promregator

Prometheus Aggregator for Cloud Foundry (Core Component)
Apache License 2.0
77 stars 36 forks source link

Promregator Time outs in logs / Setting appropriate configuration parameter values #121

Open WilliamLHarms opened 4 years ago

WilliamLHarms commented 4 years ago

Environment

Promregator running from java jar on a Cent OS server. Prometheus is federated. There are 2 Promregator instances per foundation.

Summary / Problem Statement

Our promregator is timing out.

Pivotal resource is stating we should increase the timeouts to > 2.5s as the cloud controller timeout is at 30s when using it from the command line.

Our Prometheus admin states that we have too many metrics on the containers and these need to be reduced.

Observed Behavior

promregator_up by foundation:
{foundation="ap01"} 53
{foundation="vp01"} 102
{foundation="gp01"} 3

Nov 08 16:07:50 xxxx.server.com java[26381]: 2019-11-08 16:07:50.938 ERROR 26381 --- [     parallel-8] .c.p.c.ReactiveCFPaginatedRequestFetcher : Async retrieval of retrieveAllApplicationIdsInSpace with key acd4809e-1182-4fbf-90d4-208067f3ce92|0bd207e5-4c73-4674-ac34-20d3119ec5ca caused a timeout after 2500ms even though we tried three times

Nov 08 16:06:25 xxxx.server.com java[26381]: 2019-11-08 16:06:25.502 ERROR 26381 --- [     parallel-3] o.c.p.s.ReactiveAppInstanceScanner       : retrieving summary for space id 'd708f5a7-ea31-4b59-ac98-17e990d2853f' resulted in an exception

image image image image image image image

Expected Behavior

A way to understand the interactions between the promregator, cloud controller, and container scrape duration.

An easily configurable option, through env var or setting in the promregator yml that will allow us to set the necessary timeout interval.

eaglerainbow commented 4 years ago

I try to understand what I all see here. Before starting to dig deeper, I'd like to first understand some things:

eaglerainbow commented 4 years ago

An easily configurable option, through env var or setting in the promregator yml that will allow us to set the necessary timeout interval.

From what I currently see, I'd say Promregator is totally overloaded. We must find the reason for this first - and I doubt that a timeout is the root cause for that...

WilliamLHarms commented 4 years ago

"foundation" is a unique PCF instance on different cloud providers

Yes 158, it's just the begining.

9 orgs

We're using the auto discovery endpoint with prometheus /discovery Job conifg:

- job_name: ap01_promregator
  honor_timestamps: true
  scrape_interval: 15s
  scrape_timeout: 14s
  metrics_path: /metrics
  scheme: http
  file_sd_configs:
  - files:
    - /etc/prometheus/discovery/ap01_discovery.json
    refresh_interval: 5m
  relabel_configs:
  - source_labels: [job]
    separator: ;
    regex: (ap01)_promregator
    target_label: foundation
    replacement: $1
    action: replace

We are not setting cf.threadPool.size nor promregator.scraping.threads

Our next steps were going to be setting up a promregator in each org for each platform. 2 platforms X 2 for prod and non prod. Then multiple X 2 for failover redundancy. So that's alot of promregator instances to maintain. So yes, we are looking for better options that do not require us to have so many promregators to maintain.

Also, the app names typically end in -blue/-green and 1/2 of the apps are typically on while the other 1/2 are off.

eaglerainbow commented 4 years ago

Our next steps were going to be setting up a promregator in each org for each platform. 2 platforms X 2 for prod and non prod. Then multiple X 2 for failover redundancy. So that's alot of promregator instances to maintain. So yes, we are looking for better options that do not require us to have so many promregators to maintain.

I'd suggest to understand first what's up with the current promregator. I don't know what it is right now, but I have the impression that we are still missing one vital point here.

Also, the app names typically end in -blue/-green and 1/2 of the apps are typically on while the other 1/2 are off.

That should be the least of the problem, but may cause you delay when you b/g deploy. But that's on a totally different sheet of paper.

I am still trying to get a picture on what is going on in your case. You could help me a little by providing the values/graphs of the metrics:

Unfortunately, the three latter metrics are only available when promregator.metrics.internal is set to true (see also https://github.com/promregator/promregator/blob/3c33ddb632c91ee1b3af0eb836ae2b6d9fa101f7/docs/config.md#option-promregatormetricsinternal-optional). Is it possible to set this option in your environment?

PS: It's in the middle of the night for me now. I'll have a look at this case tomorrow again.

eaglerainbow commented 4 years ago

If you want to experiment a little, here are some thoughts:

WilliamLHarms commented 4 years ago

@eaglerainbow - thanks for the tips. Very useful. I'll add all but the promregator.metrics.internal to our non prod env, if it seems to have an impact, I'll add it to prod. Typically, through the day when the devs are hard at work running builds / cf commands, we have issues with timeouts. I'll turn on that property in production then.

Get some sleep and thank you for the assist.

eaglerainbow commented 4 years ago

Just for my better understanding: Are you scraping the same set of applications in "non-prod" than in the "prod" environment?

Typically, through the day when the devs are hard at work running builds / cf commands, we have issues with timeouts.

That might or might not be a coincidence. Based on your graphs I would more assume that the problem is building up over time (see also JVM threads), due to the overload situation on the threads: The old threads are still blocked working (though the scraping interval for which they are running is already way gone) and hence are not available for the new scraping interval. The next scraping interval then has even less resources, which provokes even further issues. I have seen this pattern already too often. However, there is no simple cure for that (or at least I haven't seen it yet).

I'll turn on that property in production then.

If you are referring to promregator.metrics.internal: This property does not change Promregator's scraping behavior as such. It just triggers Promregator to fetch additional internal metrics, which helps analyzing the situation. The reason why the switch is off is that the number of metrics get huge (so the data volume is the problem). Hence, if you want to play it safe, I would suggest enabling it in non-prod first and if you see that you can live with the data volume, you may also want to enable it in prod env.

eaglerainbow commented 4 years ago

Analyzing the graphs that you have sent above, here is what I get (in the sense of "thinking a-loud"):

What puzzles me right now are the following observations:

eaglerainbow commented 4 years ago

Pivotal resource is stating we should increase the timeouts to > 2.5s as the cloud controller timeout is at 30s when using it from the command line.

Requests from CF CLI are not comparable to CF CC API calls from Promregator. The CF CLI is intended for "single manual use", whilst Promregator is under permanent fire (with a cadence of 15s in your case). Requests to Promregator cannot be completed, if certain data from CF CC is not available in time. Setting CF CC timeout values in Promregator above the duration of the scraping interval is a very tricky thing and could cause major consistency issues in its caches. Promregator tries to be as efficient as possible to only ask for information which should be rather easy to determine. Unfortunately, there is rather a lot of by-catch that is rather expensive to generate. The client has no possibility to indicate to the CF CC that certain values do not matter and hence do not need to be generated.

Our Prometheus admin states that we have too many metrics on the containers and these need to be reduced.

Out of the data provided so far, I don't see any evidence for this statement. In contrast, the scrape request sizes look rather common to me (I don't want to claim anything on the load on their Prometheus' instances - so my statement is restricted to Promregator only here). However, there could be a point in that, by noticing that some app containers are rather slow in answering the scraping requests (see statement on foundation "an01" and "vn01" above). If a reduction of metrics on these containers allowed accelerated scraping by Promregator, then this would simplify the entire scraping process.

WilliamLHarms commented 4 years ago

@eaglerainbow , thank you kindly for the great support in answering my questions.

To clarify ap01, vp01, an01, vn01 are PCF Platforms/foundaitons. P for prod, n for non prod.

Also, I made all the property changes you suggested to our an01 environment, moving forward all data and graphs will be based upon an01 only.

cf:
  api_host: ${promregator.url}
  username: ${promregator.username}
  password: ${promregator.password}
  connectionPool:
    size: 32
  request:
    timeout:
      appInSpace: 3000
logging:
  level:
    org:
      cloudfoundry:
        promregator: ERROR
promregator:
  metrics:
    internal: true
  scraping:
    threads: 14

We are still seeing drops in container count. (3:15 ish in the panels below) While I would expect a random scale I highly doubt we're geting devs shutting off apps.

--To your first set of questions--

We have signifigantly less containers running in the an01 foundaiton than the ap01 foundation and are seeing thesame problem as in ap01. The problem does materialize itself in the an01 env. Moving forward I'll be sharing data for an01, for simplicity.

That's an interesting point about the thread contention, old thread is blocked because it's still scraping while the next thread is starting up.

I have turned promregator.metrics.internal on in the an01 environment and built grafana panels their screen caps are below.

---- second post ---

Question 1: 50s scrape duration:
Yes I agree, this does allude to your earlier point about jvm thread contention. For some perspecitve on the "worst app" from my prometheus admin, curling the app's prometheus scrape from the promregator server itself: (just getting a random container picked by the go router)

an01 - 126 lines
    time_namelookup:  0.006
       time_connect:  0.093
    time_appconnect:  0.410
   time_pretransfer:  0.410
      time_redirect:  0.000
 time_starttransfer:  0.548
                    ----------
         time_total:  0.636

ap01 - 1607 lines
    time_namelookup:  0.006
       time_connect:  0.093
    time_appconnect:  0.402
   time_pretransfer:  0.402
      time_redirect:  0.000
 time_starttransfer:  0.501
                    ----------
         time_total:  0.764

The native scrape is very fast. Is there a way that I can isolate the scrape duration from the call to the Cloud Controller via promregator internal prometheus metrics? Isolate that time and the time it takes to ingest the scrapes from the containers? It would help out alot to know where the problem lies.

Also when I ask for more cloud controller workers, the PCF admin says your throwing 499's increase your timeout.

Question 2: Scrape request size: Yea there are times when under heavy usage the custom metrics of one of the apps that tracks the response times of the methods that massage external dependancies. See the flux in line count between prod and np above.

Question 3: See below for promregator.metrics.internal grafana panels

Question 4: seems like that on these landscapes we have some poorly performing implementations of the /metrics endpoint!

Oh boy, yea, that's my Prometheus admin channeling his rage through you :) Honestly, we are using the micrometer libs in spring, and have switched to quantiles from histograms. There's some steel toe in there as well, with the prometheus libs on top of that. Let me know if you want specifc NuGet or spring versions

I set the promregator scraping threads to your suggestion of 14, should I go higher?

Question 5: MORE THREADS!!!!

Yes, we have a very big cloud foundry footprint and have only started rolling product teams onto the idea of using more white box metrics with Micrometer.

Everything works swell except our promregator. We know that at some point, because of the magnitude we will need multiple promregators per platform but we don't know the trade offs. I really appreciate your write up here on the implications of the cf.connectionPool.size. I'll check with our Platform team to make sure they can scale the cloud controller workers as we light up more teams. But one thing that stands out, the 2.5s 499 response from the Promregator makes them very grumpy. Can that be tweaked? What are the implications to our scrape interval, etc? Can we just double everything?

Question 6: It is unclear whether this is really latency introduced by the CF Cloud Controller (waiting for the response over the network) or whether we even did not ask the CF CC yet due to a) missing connection pool capacity (see above), b) due to underlying JVM overload situation

I'll continue to go down the route of 1 promregator for the big orgs on each platform. Maybe we can isolate the bigger orgs from the smaller. I'll have a poc of this in Produciton using the latest promregator you pushed with the json logs, monday.

Also, we have some other APM tools that I'll get on the promregator box to see what's what.

Would a thread dump when the promregator is jenky help? Also, logs? (But those would have to be shared privately).

--Puzzlement Questions-- Q1: promregator.scraping.threads is set to 14 now.. New panels below.

Q2: new panels only show an01

Q3: _What does the metricspath: /metrics in your prometheus configuration do there? I think that's just noise, when I check our prometheus targets, Service Discovery, we are using /singleTargetMetrics and the scrapes contain data relevant to only one app, all containers within that app. I almost always see context deadline exceeded in the targets with state down. (Is it as simple as adjusting our scrape duration and iteration up?)

--Last post-- CF CLI / CF CCAPI What would you like to see our prometheus iterations be set to given what you see?

Simplifying container metrics: That does align to the prometheus admin's comments. To that end I've taken the worst performing ingest and factored out the histograms. While I'll miss the heatmaps, I need something rather than gaps in grafana and our "absent" alerts firing waking me up at night :)

Charts and Graphs - Please note that all property file changes were applied by 23:30 image image

image image image

eaglerainbow commented 4 years ago

Thank you for cleaning everything up - and testing my suggestions! On first glance, the graphs already look significantly better. Apparently still we have the same problem. As you were able to reduce the scope (by still keeping the issue), it makes life for us much easier for sure!

I have only limited time right now, but here are the "low-hanging fruits": 😁

Is there a way that I can isolate the scrape duration from the call to the Cloud Controller via promregator internal prometheus metrics? Isolate that time and the time it takes to ingest the scrapes from the containers? It would help out alot to know where the problem lies.

I think what you are searching for is promregator_request_latency: That's the time Promregator needs to open a connection to the ha_proxy, get the connection to the scraping endpoint, reading the response from there (whether failed or not) and closing the connection. Any processing of the data received or any time necessary for fetching other kinds of metadata is not included.

I set the promregator scraping threads to your suggestion of 14, should I go higher?

Currently, I don't see any issue with thread congestion on the pool. With the heap memory free that you have, you can safely do this. (In general, increasing that pool value is perfectly fine, as long as you don't overshoot memory usage. If you over-do it, nothing severe will happen immediately - the resources offered just won't be used; however, you may run into issues, if you over-commit and the thread resources do get used, whilst you server is not able to stand the load).

However, I still have a little issue with understanding the scenario:

But one thing that stands out, the 2.5s 499 response from the Promregator makes them very grumpy.

I can understand 😁 Promregator is also "grumpy" about not getting the data in time 😈 . It would be even worse with them, if we did not 499er them but just went on - then over time we also fill up their inbound HTTP worker's thread queue with requests, which already had died (and no-one is still listening for). So, yes, 499 isn't nice, but still the better choice. πŸ€“

I have the impression that in the meantime were away from the initial scraping thread problem, more turning into the metadata issue.

Can that be tweaked? What are the implications to our scrape interval, etc? Can we just double everything?

In essence, yes. If you have 30s scraping interval, then you can also spare double as much time for reading metadata. In a nutshell it is this: Promregator may need up to 4 metadata requests to the CF CC. Metadata stuff is cached, so in many cases you are just fine with it. But once in a while you may have bad luck and all four requests have a cache miss. If the sum of the latency of all your metadata requests plus the time required for a single scrape request to the container is longer than your scraping interval of Prometheus, you have a severe situation: Threads will overtake themselves as Prometheus will already have fired the next request to Prometheus, even before "paperwork has finished". If you have too many targets (a.k.a. containers) to be scraped, you will run into thread pooling issues, as the old threads aren't freed up yet. (it's little more complicated than this, but that's the major risk here).

I'll have to get going now; I will have another closer look at the new graphs again (and all your other statements), but that'll be quite some time later...

eaglerainbow commented 4 years ago

The native scrape is very fast.

I agree to this assessment.

It would help out alot to know where the problem lies.

Given your figures, I currently would assume that we have less issues on "an01" with scraping the metrics ifself, but more with fetching the metadata.

Also when I ask for more cloud controller workers, the PCF admin says your throwing 499's increase your timeout.

From their perspective this is very logical. From Promregator's PoV this is a non-trivial thing.

Honestly, we are using the micrometer libs in spring, and have switched to quantiles from histograms. There's some steel toe in there as well, with the prometheus libs on top of that. Let me know if you want specifc NuGet or spring versions

There is nothing bad with using micrometer in spring even with histograms. The registry implementation there is usually performing very well, if you store the values in the registry. However, if you use the callback logic (there is one approach for each of the libraries, micrometer and for the prometheus client lib), then you can do big mistakes - calling just another getter method (such as "getSize()") isn't one for sure, but you can also place there logic with significant algorithmic complexity, which then would yield TCO for Promregator/Prometheus.

I set the promregator scraping threads to your suggestion of 14, should I go higher?

As mentioned above, you may do so. Just keep an eye on your jvm heap memory that it does not skyrocket. On the other hand, in the current scenario (assuming that the Promregator instance is responsible for an01 only), I don't think that it will provide any further positive result.

Yes, we have a very big cloud foundry footprint and have only started rolling product teams onto the idea of using more white box metrics with Micrometer.

πŸ‘

Everything works swell except our promregator

That's bad :cry:

But one thing that stands out, the 2.5s 499 response from the Promregator makes them very grumpy. Can that be tweaked?

All CF Cloud Controller timeouts can be tweaked in Promregator. It can be done using the parameters cf.request.timeout.*. The property of most relevance for you most likely is cf.request.timeout.appInSpace. Theoretically, however, the same problem may also arise in all other request cases (all the other cf.request.timeout.* parameters). From the right hand side of your graph at https://user-images.githubusercontent.com/37589680/68530613-c2f0b300-02d7-11ea-945e-0f887f7ad938.png you can see that the new value of 3000 ms "sometimes" is enough to get the necessary metadata from the CFCC. At roughly 50% of the time, the CFCC can respond much faster - every other request, however, it is slower. This currently creates the impression to me that there might be two (sets of) instances - one that is running on "slower" hardware, the other on faster one. If your request lands on the faster one, we can keep the timeout; if it is handled by the slower one, we have a situation. That might something that is worthwhile to ask your CF admins: Whether they could check how uniformly the computing power is available across their CFCC instances.

I would be curious to get to know what the "required timeout value" would be. I'd suggest to run two experiments to achieve this:

Look at the promregator_cffetch_latency_bucket for allAppsInSpace Single page metric. What's the 98th percentile's value there? That would be the value you would have to enter as cf.request.timeout.appInSpace. Is this value realistic? Can scraping still work without causing thread contention?

There is a second approach how we can derive that value (and hence triangulate it):

Run

$ time cf curl /v2/apps?q:space_guid=xyz

several times manually, whereas xyz should be replaced by one of the space's guid's (kindly let me know in case you don't know how to get to it). (Note: The request sent by Promregator is much more complex than this, but with this we should get a rather good first estimate). For instance, with a space containing two apps, I currently get here

real    0m0,627s
user    0m0,000s
sys     0m0,015s

and that's from my home...

I'll have a poc of this in Produciton using the latest promregator you pushed with the json logs, monday.

I am interested in hearing your results on this.

Also, we have some other APM tools that I'll get on the promregator box to see what's what.

Might be helpful, but I don't expect much more insight than we have already today.

Would a thread dump when the promregator is jenky help?

I don't think so: we already know where the madness is mainly coming from. But we should keep this in our back-mind (for instance on this "large number of threads" thing, this could eventually become helpful)

Also, logs?

Those might help a little more, as it might give more insight on when what issue (and which metadata request) starts to fail.

(But those would have to be shared privately).

Sure. There is even a policy and a procedure for that case. For details you might want https://github.com/promregator/promregator/blob/5ddd47583f5aeac7e0b80d76de83ef117c815898/docs/analysis-material.md to give a read.

I think that's just noise, when I check our prometheus targets, Service Discovery, we are using /singleTargetMetrics and the scrapes contain data relevant to only one app, all containers within that app.

Then I feel reassured.

CF CLI / CF CCAPI What would you like to see our prometheus iterations be set to given what you see?

I currently don't know. That is why I have suggested above to determine the timeout value that CFCC assumes us to have in practice. Let's determine this value and then let's talk about its implications. One idea: Perhaps we can reduce the outer timeout values, which would then buy us time for the appsInSpace topic.

That does align to the prometheus admin's comments. To that end I've taken the worst performing ingest and factored out the histograms. While I'll miss the heatmaps, I need something rather than gaps in grafana and our "absent" alerts firing waking me up at night :)

I would rather not lose hope on that. First, I'd like to see a graph promregator_request_latency against promregator_scrape_duration. With all the computing resources that you are willing to throw on Promregator, scraping request latency does not seem to be our major concern (and that would be the only thing you could influence with having the histograms removed).

-- Some Corollaries:

eaglerainbow commented 4 years ago

Side remark: If we had #116 available, this could potentially also help here in this case.

WilliamLHarms commented 4 years ago

Thanks for the tips. :) After letting the new config run for a while, I'm not seeing failures on the promrgator side, the issue has now moved to the prometheus scape interval. Given what I've learned from this thread (thank you)

These errors seem to rotate on different apps/containers, always on 10s Even on apps that I know have lightning fast scrapes and only 2 containers.

/singleTargetMetrics/13f26934-e03f-4673-88e7-448466e71d65/2: context deadline exceeded
/singleTargetMetrics/b24aa087-b6d4-4120-80c9-15705344dbd7/0: context deadline exceeded

Is this because cf.request.timeout.appInSpace was increased and in turn it pushed the prometheus scrape to time out. Maybe set the Prometheus config for 30s scrape intervals? Do you concur Dr. Eagle?

Also, for our prod env ap01, I think it's safe to update the pools, threads, and I'll add the promregator metrics to the mix. Just to help monitor the situation.

I'll follow up on our NP CFCC and CCAPI hardware. Honestly, since this is all running on a large cloud provider, I think we're on the same hardware. We may just have different qty's of CFCC's and CCAPI workers. More in prod and less in NP. As NP the scrape intervals are sub second.. Then during the week, they go up to 7/8/9 then start to bomb out.

We have some apps that we scale out to 200+ containers at certain times of the year.. What are your opinions on CCAPI and Promregator sizing given the scope?

Also, regarding your point about the cache ticket #116, if you build it, I'll test it.

--- Experiment ---- Update the appInSpace timeout to 30000 and cache timeout to 30 with the threading and connection pools as previously discussed.

See how Prometheus reacts. Dr. Eagle, for the next test, can you share your hypothesis and expectations for the next test we try?

cf:
  api_host: ${promregator.url}
  username: ${promregator.username}
  password: ${promregator.password}
  connectionPool:
    size: 32
  request:
    timeout:
      appInSpace: 30000
  cache:
    timeout:
      application: 30
logging:
  level:
    org:
      cloudfoundry:
        promregator: ERROR
promregator:
  metrics:
    internal: true
  scraping:
    threads: 14
  authenticator:
    type: none

Results:


real    0m4.344s
user    0m0.113s
sys     0m0.031s

real    0m4.312s
user    0m0.137s
sys     0m0.034s

real    0m3.884s
user    0m0.115s
sys     0m0.032s

real    0m4.061s
user    0m0.115s
sys     0m0.028s

real    0m4.492s
user    0m0.110s
sys     0m0.036s

real    0m4.156s
user    0m0.114s
sys     0m0.033s

image image image

image

image

eaglerainbow commented 4 years ago

These errors seem to rotate on different apps/containers, always on 10s Even on apps that I know have lightning fast scrapes and only 2 containers.

That is little surprising: The majority of latency is induced by the requests to the CFCC. In most cases the values are cached. It remains random, which container is being requested when the cache expires (and a new request to the CFCC is being triggered). Until the CFCC request is not completed, no scraping of the container can take place (side remark: changing the latter is exactly the idea of #116).

Is this because cf.request.timeout.appInSpace was increased and in turn it pushed the prometheus scrape to time out. Maybe set the Prometheus config for 30s scrape intervals? Do you concur Dr. Eagle?

Yes, the error messages produced by Prometheus exactly tells this: We did not get a response from the Promregator endpoint in time, i.e. until the next scraping request (to the same target) was initiated after seconds (or scrape_timeout was reached). Prometheus had fears that requests would be overtaking each other. If you increase the scrape interval (at Prometheus), you'd get rid of the error message, but this comes at the price of data accuricity. However, if this is regularly the case, I currently don't see much of a chance for anything else than have a high scrape interval. The only alternative that I see would be this:

Having done so, you may then "pray" that we always "concurrently" ask for "one or two" responses from the CFCC, but we'll practically have never the situation that we'll send all four requests (besides perhaps once when we start Promregator initially). Even if that situation would occur, the worst thing to happen then was a single synchronous missed data point in all metrics (it may happen in multiple application containers in parallel).

What do you think about this approach?

BTW No. 1: In the graph "Promregator latency by foundation" we always see that the maximal latency is below 10s. If the scrape interval was 15s, then I would assume that performance was good enough. So it's a little surprising to me that we see error messages from Prometheus like context deadline exceeded. :confused: Or have you have set scrape_timeout to 10 (at Prometheus) in the meantime (initially you mentioned 14s)?

BTW No. 2: I am little confused about the graph "Promregator latency by Foundation" in general: What is in vn01 so much different that (apparently) make the metadata requests so much faster that the promregator_scrape_duration_seconds is halfed? :confused: Can we see the promregator_request_latency in parallel to that (perhaps then not maxed individually, but their average or so) -- to prove that the difference comes from fetching metadata?

Then during the week, they go up to 7/8/9 then start to bomb out.

Indeed, this could be an indicator of having NP not to have provisioned enough hardware resources to keep up with the load during the week accurately. This makes setting the appropriate timeouts in Promregator a nightmare. You can also go optimistically into this situation by just setting the timeouts high enough - but then you risk the `context deadline exceededΒ΄ messages again: Prometheus is requesting data faster than Promregator could fetch the metadata to deliver them. There is little way out to this.

We have some apps that we scale out to 200+ containers at certain times of the year.. What are your opinions on CCAPI and Promregator sizing given the scope?

Disclaimer: I have too little operations/sizing knowledge about the CFCC to give any sizing suggestions.

In general, containers of the same app is less problematic than having "many apps" in a space: CFCC will just tell us "there are 200 containers for the same app" - Promregator will unfold it internally only. So CFCC can be rather relaxed (it's always only "one line" for CFCC). Promregator has the more critical case here: For your "200 instances per app" situation, there would also be 200 /singleTargetMetrics lines in the discovery document. Prometheus would then start to scrape with 200 targets in parallel, creating 200 HTTP requests to Promregator within its scrape interval (I assume 15s). This means ~14 connections per second. The vast majority of metadata for these requests can be fetched efficiently from the cache, so we "only" need to forward the connection requests properly to the containers. However, this means that you must have enough promregator.scraping.threads configured (in this case, assuming that the average latency to scrape the metrics from the containers would be at 1s, at least 14). Note that these 14 would be necessary for these 200 containers alone, i.e. if you also have other containers to be scraped at the same time, you have to increase the promregator.scraping.threads accordingly. As already mentioned some time before, this means that you will have more CPU and memory consumption on the Promregator's host. There is no way of determining upfront what this means in GBs - for instance it also depends on the data volume the metric's endpoints provide (as Promregator has to decode and repackage all metrics on the fly). Therefore, my suggestion would be to simply test it out, looking at the JVM heap (and scale up linearly, if applicable).

NB: If you get more than 16 connections to Promregator in parallel, you might also need to increase the spring property server.tomcat.max-threads (from 16 to some higher value, see also https://github.com/promregator/promregator/blob/5ddd47583f5aeac7e0b80d76de83ef117c815898/docs/config.md#changing-the-number-of-http-worker-threads). Otherwise inbound HTTP requests from Prometheus would start to get queued by Tomcat.

Also, regarding your point about the cache ticket #116, if you build it, I'll test it.

Thanks. But exactly building is here the critical thing. All this caching stuff in Promregator is rather complicated and you can easily shoot into your own foot, if you forget to consider some race condition that may happen (and you know what Murphy's Law says on this). Hence, please don't expect anything soon. I'd still consider this topic to be in "research" stage so far.

See how Prometheus reacts. Dr. Eagle, for the next test, can you share your hypothesis and expectations for the next test we try?

Post-delivering it: Hypothesis: We might run into "context deadline exceeded". Some metrics will be dropped (either by Promregator or Prometheus). Promregator will keep running over 5-10 minutes. I would not dare to run it too long, though, as you might get threads overtaking each other - and you'll cause a big mess.

Expectation: We get to know the realistic value of how long the CFCC requires to answer our requests. With that we can determine "reasonable" timeout values for Promregator. We need (say) the 98th percentiles of all the different types of CFCC requests that we send. Once we have all of them determined, we can check whether it is reasonable to stay on scrape interval of 15s. If not, we know it's impossible to keep the pace due to what the platform delivers to us. We then might have to increase it to 20s or even 30s.

Does this make sense?

-- On the results:

We had even more context deadline exceeded errors from Prometheus, almost 1/2 of the targets

That's not really surprising. I did not intend to look at what Prometheus is able to retrieve; the purpose was only to get the "reasonable timeout values" determined.

(10s is the config)

What do you mean by that?

real response from the curl, from the same box hosting the promregator that's reaching out to an01: 0m3.884s / real 0m4.492s

That means that our cf.request.timeout.appInSpace should be around 4500-4800 or so. Strange that the graphs show slightly lower values (4000).

when I curl the applications /actuator/prometheus (just getting a random container from the go router), I get the data < 1s.

That's clear: You are measuring promregator_request_latency here, but you (apparently) try to compare it against promregator_scrape_duration, which mainly is "promregator_request_latency + metadata retrieval time + some_really_small_overhead".

While with the promregator sometimes prometheus hits the 10s wall.

Yes, that's (to all what we currently know) due to metadata retrieval time. You see that only "sometimes", because usually the metadata cache of Promregator is already filled (and valid). Only if some cache is not filled/must be refreshed, you get the peaks.

Next steps to my mind here would be:

I hope that all my chatting above (it's a lot of text obviously) helps you to get a better understanding of the situation!

WilliamLHarms commented 4 years ago

@eaglerainbow - Thank you for taking the time to answer all my questions, bearing through the large amount of data I have been throwing at you. I really do appreciate the help and hope to help return the favor one day.

I'm using the following configuration in our ap01 (prod environments, running in the prod space of the same foundation) Also, we moved prometheus to have a 25s scrape duration timeout with a 30s scrape interval.

Our platform engineers, whom called me out by calling them "resources", are taking a look at why the performance is so bad on the a01 platforms and so good on the v01 platforms. I am hoping they will find the problem (I wrote up a decent set of steps to take to troubleshoot this) If I get to a shareable RCA, I will do so. They were very surprised to see the metrics regarding latency from within the platform and the details from the test below.

Regarding the graphs, one warno, I couldn't find the metric for meta data retrieval time. Tiles and prometheus queries are below

I came to these settings by doing the test below to get the basic timings, then started increasing the timeouts to be set above the average. (15s). The promregator.scraping.timeouts that I have set seemed to get rid of the promregator errors that were being generated. But yea.. Our a*01 env's cloud controller api is very latent.

cf:
  api_host: secret_api_host
  username: promlogreader
  connectionPool:
    size: 32
  request:
    timeout:
      appInSpace: 15000
      org: 15000
      appSummary: 15000
  cache:
    timeout:
       application: 10

logging:
  level:
    org:
      cloudfoundry:
        promregator: ERROR

promregator:
  discovery:
    hostname: secret-hostname-for-discover
    port: 443
  authenticator:
    type: none
  metrics:
    internal: true
    requestLatency: true
  scraping:
    threads: 14
    connectionTimeout: 10000
    socketReadTimeout: 10000

The Test

time nc -zw30 api.sys.ap01.somePlace.com 443
time nc -zw30 api.sys.ap01.somePlace.com 443

Both had sub second response

Next, isolate how long it takes to do a curl to pull details from the org. Do a cf login to each platform then:

cf org --guid the-org-name
time cf curl /v2/apps?q:space_guid=secretGuid

In vp01 it's close to time nc command. In ap01, it's 10-20s

Our Platform Engineers are going to take a deeper dive into why.

Charts And Graphs On The New Config image

1.0 - promregator_request_latency_bucket
max(histogram_quantile(1.0, sum(rate(promregator_request_latency_bucket{job="$job"}[1m])) by (le, job )) )

max - promregator_scrape_duration_seconds
max(promregator_scrape_duration_seconds{job="$job"})

MAX-promregator_scrape_duration_seconds -   1.0 promregator_request_latency_bucket =  meta_data_calculation 
sum(max(promregator_scrape_duration_seconds{job="$job"})) - sum(max(histogram_quantile(1.0, sum(rate(promregator_request_latency_bucket{job="$job"}[1m])) by (le, job ))) )  

image

^^ that last blip is when I did a cf push with the connection and socket read timeouts increased to 10s from 5s.. I was getting alot of errors without that increase.

WilliamLHarms commented 4 years ago

@eaglerainbow - one follow up question.. I'm having a real hard time the prometheus metrics for promregator itself. Can you help me narrow down how I can obtain:

I am trying to isolate the cost of each call within Promregator to provide a scrape to prometheus.

WilliamLHarms commented 4 years ago

Also, to scale, should I increase containers or more unique application instances?

eaglerainbow commented 4 years ago

time cf curl /v2/apps?q:space_guid=secretGuid In vp01 it's close to time nc command. In ap01, it's 10-20s

Wahoo! πŸ™„ That's really a lot!

Our Platform Engineers are going to take a deeper dive into why.

Sounds like a good plan to me. Either the CFCC containers are hopelessly overloaded or perhaps the backing database has degraded. In any case the performance is too sluggish for doing something such time-critical as scraping.

If I get to a shareable RCA, I will do so.

Would be really great to have that here as a reference, because I wouldn't be too surprised if similar things also happened to other Promregator users.

I'm having a real hard time the prometheus metrics for promregator itself.

Yeah, I do understand; also the graph's configuration above is a little strange. In general, I'd propose to give https://github.com/promregator/promregator/blob/5ddd47583f5aeac7e0b80d76de83ef117c815898/docs/enrichment.md#additional-metrics-measuring-the-communication-to-the-targets a read. However, there are not all metrics mentioned (some of the metrics we are using here are not documented officially). To get these "additional" metrics you had to enable promregator.metrics.internal. So, let my try to explain a little:

Duration and volume of calls to the CF API

Promregator performs several calls to the CF API. All the performance information on that one is stored in the metrics promregator_cffetch_latency_bucket, which is only visible with promregator.metrics.internal enabled. The number of requests sent can be found at promregator_cffetch_latency_count.

Each metric promregator_cffetch_latency_* has a label called request_type which describes the type of CF API call that is performed. For example: request_type="org" is fetching the CF org data (mapping of org names to org Ids) - see also https://github.com/promregator/promregator/blob/5ddd47583f5aeac7e0b80d76de83ef117c815898/src/main/java/org/cloudfoundry/promregator/cfaccessor/ReactiveCFAccessorImpl.java#L228

Note that the different request_types have dependencies amongst each other: For example, you can only retrieve space data, if you have read the org data in which those spaces are located (otherwise you wouldn't know for which spaces you should be querying, as "reading everything" is a Bad Ideaℒ️). That is also why one scraping request may trigger multiple sequential metadata requests to the CFCC and all of them need to be completed before you may send the scraping request (and if the metadata requests are too slow, ... well - you get me πŸ˜ƒ).

There are two special request_types out there: space and allApps (maybe even more in the future). They split up in two request_types in fact: space + space_singlePage and allApps + allApps_singlePage. Background to this is that it may happen that not all data necessary can be fetched with one page (i.e. with one API call), but data must be read in a paged manner. The *_singlePage request_type provides the time for retrieving each page, while the request_type without the suffix documents the "end-2-end time" for all pages (combined).

Timeouts for these requests are controlled by the configuration options cf.request.timeout.*. Unfortunately the name of the CF API calls do not 100% match the configuration options (this has historic reasons plus I don't want you to get exposed to all the internal complexity - the metrics are anyhow a "geek tool", so you need to know what you are doing...).

Just for the sake of completeness of this post, there are also the promregator_autorefreshingcachemap_* metrics, which deal with the caches of the various request_types. For example, we can see there, how many cache hits and misses we had, when the last cleanup scan took place, if we had broken cache records, how long scanning took and how large the cache is (that is how I can determine by the metrics how many orgs/spaces/apps you have on the platform).

Duration and volume of calls to the scrape pages of each application instance

The latency (duration) for scraping at the containers (application instances) is available at promregator_request_latency_bucket.

The number of scraping requests sent to the containers is available at promregator_request_latency_count.

(Both metric values are only available, if you enable promregator.metrics.requestLatency. Warning! Be aware of data volume, as the granularity is "per application instance"! Don't have it enabled with 100 containers, or Prometheus might take it amiss.)

The data volume returned by the containers when being scraped is available at promregator_request_size_bucket.

Where the timeout is set in Promregator for the cffetch_latency_bucket I would like to raise it above 5s and see if that lowers the number of promregator_request_failure (I'm assuming that counter is a generic failure at any place in the flow of providing metrics)

Well, not really πŸ˜€ promregator_request_failure is used to indicate, if the scraping from a container failed (or the necessary post-processing of the metrics provided by the scraping target was not possible). See also https://github.com/promregator/promregator/blob/5ddd47583f5aeac7e0b80d76de83ef117c815898/src/main/java/org/cloudfoundry/promregator/fetcher/CFMetricsFetcher.java#L179 on this. At that time calling the CFCC has already happened long ago - so these two topics should be rather uncorrelated.

If you run into a timeout situation and get promregator_request_failure logged, something went wrong with the containers. Then, my next look would be into Promregator's logs, because each failure should raise a log warning there (usually by the logger named org.cloudfoundry.promregator.fetcher.CFMetricsFetcher).

The timeout to wait for the scraping results of containers is controlled by configuration option promregator.scraping.maxProcessingTime (and partially also with promregator.scraping.connectionTimeout and promregator.scraping.socketReadTimeout).

Where the timeout is set in Promregator for the cffetch_latency_bucket

As said above, promregator_cffetch_latency.* and cf.request.timeout.* belong together.

I am trying to isolate the cost of each call within Promregator to provide a scrape to prometheus.

That you can find in metric promregator_scrape_duration_seconds: It's the last value of the "Prometheus-perceived latency Promregator produces". Being an end-2-end metric, it includes everything:

I had briefly thought about making this a histogram, but no-one really had asked for that so far. And as in Single Target Scraping mode the metric is also on "per application instance/container" level, a histogram would have blown up Prometheus again.

Also, to scale, should I increase containers or more unique application instances?

Depends on what you want to create pressure in Promregator:

(only teasing metadata fetching cannot be achieved).

Please let me know in case you still have any issues in understanding "what is what" - this stuff really is tricky (some might even call it "wicked").

eaglerainbow commented 4 years ago

@WilliamLHarms just fyi: there is some movement at #116

WilliamLHarms commented 4 years ago

@eaglerainbow - some good news. We figured out the issue. Our UAA was a victium of a "noisy neightbor".. We scaled the UAA and restarted the VMs for the existing ones and latency for our off prem (an01/ap01) cloud platforms just about matched with the standard configuration:

We did however leave the config in there for the additional threads and pools.

cf.connectionPool.size: 32
promregator.scraping.threads: 14

I put a quick write up of our final config yamls to get us through the "dark times" below.

Good to hear about #116 getting some traction. I would be happy to try it out in one of our non prod clouds. (Still working on the roll outs for the json enabled promregator)

One final question, I'm trying to work out a simple Grafana panel, using Promregator debug metrics that show:

This is what I have so far in Grafana.. Please let me know if this is what I'm looking for.. (Yes I did read the metrics pages you posted before, but I couldn't align the available metrics I had.)

I swap out the max() for std() for some panels... Avg as well. Where $job is the promregator instance.

--worst possible round trip - meta / scrape
max(histogram_quantile(1.0, sum(rate(promregator_request_latency_bucket{job="$job"}[1m])) by (le, job )) )

--worst possible scrape duration
max(promregator_scrape_duration_seconds{job="$job"})

--net meta data
abs(sum(max(promregator_scrape_duration_seconds{job="$job"})) - sum(max(histogram_quantile(1.0, sum(rate(promregator_request_latency_bucket{job="$job"}[1m])) by (le, job ))) )  )

Most resilient Promregator/Prometheus config for my situation

---
applications:
- name: promregator-INSTANCE
  path: promregator-0.6.2.jar
  memory: 1g
  instances: 1
  env:
    JAVA_OPTS: '-Dspring.config.location=file:$PWD/promregator_INSTANCE.yml'
    PUBLISH_PORT: 8080
    SKIPVERIFY: true
    spring.application.name: promregator-INSTANCE
    CF_PASSWORD: secret

To get this to apply, you have to add the promregator.yml file to the promregator distribution. Root level. That maps back to the -Dspring.config.location.

server:
  port: XXXXX
cf:
  api_host: XXX.ap01.XXXXXXX.com
  username: xxxxx
  connectionPool:
    size: 32
  request:
    timeout:
      appInSpace: 22073
      org: 22073
      appSummary: 22073
  cache:
    timeout:
       application: 20

logging:
  level:
    org:
      cloudfoundry:
        promregator: ERROR

promregator:
  discovery:
    hostname: promregator-INSTANCE.xxx.ap01.xxx.com
    port: 443
  authenticator:
    type: none
  metrics:
    internal: true
    requestLatency: true
  scraping:
    threads: 14
    connectionTimeout: 22073
    socketReadTimeout: 22073
    maxProcessingTime: 22073

  targets:

Prometheus Configuration: You can get the promregator_INSTANCE_XXX_discover.json to a wget to the promregator discover endpoint. We found a simple cron that built the file every 5 min as a good option.

- job_name: INSTNACE_promregator
  honor_timestamps: true
  scrape_interval: 30s
  scrape_timeout: 25s
  scheme: https
  file_sd_configs:
  - files:
    - /etc/prometheus/discovery/promregator_INSTANCE_XXX_discovery.json
    refresh_interval: 5m
  relabel_configs:
  - source_labels: [job]
    separator: ;
    regex: .*(ap01)_promregator
    target_label: foundation
    replacement: $1
    action: replace
eaglerainbow commented 4 years ago

some good news. We figured out the issue. Our UAA was a victium of a "noisy neightbor".. We scaled the UAA and restarted the VMs for the existing ones and latency for our off prem (an01/ap01) cloud platforms just about matched with the standard configuration

That is really some good news! This sounds to me that Promregator revealed some underlying issue just by testing out SLAs in practice 😁

We did however leave the config in there for the additional threads and pools.

That's okay; if you can spare the necessary resources (CPU+memory) for that, then this totally fine.

Good to hear about #116 getting some traction. I would be happy to try it out in one of our non prod clouds

Progress on #116 is better than expected: Finding caffeine really made the trick - Ben (the author of caffeine) has solved many problems generically that I thought we would stumble into a cache. I hope that I can provide a release-candidate soon for a first test. Locally on my dev machine, the results all look very promising.

This is what I have so far in Grafana.. Please let me know if this is what I'm looking for..

The first two formulas look good to me. The last one is rather inaccurate, as with it you would attribute all efforts that Promregator is doing for processing the scraping to metadata fetching. Even worse, the formula does not consider parallel scraping (i.e. if two targets are scraped in parallel, you'll get "strange values"). I think you are searching more for promregator_cffetch_latency_bucket. The bad thing is that you then have to deal with the label request_type. For the sake of simplicity, however, you could simply say that you would sum up all the le s across the request_types first - except those with regex .*_singlePage. This approach also is a little wrong, because it implies that all metadata requests could be fired in parallel - but given that you have such a high cf.connectionPool.size, this assumption is "less invalid" than the previous one. The resulting histogram then would give an indication of the overall metadata request time.

cf:
 request:
   timeout:
     appInSpace: 22073
     org: 22073
     appSummary: 22073

Using odd request timeouts are rather arbitrary. Prime numbers for expiry values is much more interesting.

We found a simple cron that built the file every 5 min as a good option.

That's even how it was intended 😁

eaglerainbow commented 4 years ago

Coming back to your initial request:

A way to understand the interactions between the promregator, cloud controller, and container scrape duration.

An easily configurable option, through env var or setting in the promregator yml that will allow us to set the necessary timeout interval.

What do you think would be an appropriate extension to Promregator permitting you to achieve your goal easier/faster/more convenient?

WilliamLHarms commented 4 years ago

@eaglerainbow - thanks for the feedback. I'll try out those metric changes and get them in our panels to see if I can clean things up.

Perhaps both could be accomplished by updating the promregator grafana template with metrics from this discussion. I would suggest a cleanup of this post in the documentation as a "fall back to 25s/30s" scrape timeouts and intervals if threasholds are reached.

Maybe even a few posted Prometheus alert queries to let you know "when the bad things happen". To get really fancy, maybe one of the labels are static to a page out here on git hub. That's how we run our SOPs (that aren't automated... yet)

Once again, I really do appreciate all of your help. Please do reach out if you need anything from your end.

eaglerainbow commented 4 years ago

Just thinking aloud: Given the good progress at #116 (with which all this scraping latency topic becomes much less critical, as with high latency still the old cache data is kept and not evicted anymore), I am asking how much effort we should still put into documentation on "fallback to 25s/30s" or an alert query. I am rather convinced that @WilliamLHarms would have not have fallen into the issue here, if we had had the new cache in place in first place.

My current idea is to keep this issue here open for a while to permit #116 progress. If we are able to move to the Caffeine cache soon, then we should not invest too much here. If we can't go there soon, then we should reconsider and provide some more documentation.

WilliamLHarms commented 4 years ago

@eaglerainbow - good news. Running 0.7.3 resolved the majority of our CCAPI timeouts. Thank you for adding the Caffeine cache solution. We rolled out this change to all platforms and we are no longer dropping metrics.

eaglerainbow commented 4 years ago

Thanks for the feedback. I assume it's okay to close this issue here for you?

rrileyca commented 2 years ago

I just wanted to add that this thread helped me figure out a very tricky issue. My UAA (in a dev network) was also slow/overloaded, so the default timeouts didn't work for me. I had to set these properties to get it going. Obviously in a production environment, you'd probably want to fix the underlying issue, but this is helpful to know:

cf:
  request:
    timeout:
      org: 10000
      space: 10000
      domain: 10000
      app: 10000
      routeMapping: 10000
      sharedDomain: 10000
      process: 10000
      appSummary: 10000
      appInSpace: 10000
checketts commented 2 years ago

@rrileyca Thanks for sharing. And just to explain for others, he is setting the timeout to '10 seconds' in his dev UAA.