Open mkey-cdx opened 4 years ago
Tweaking jolokia with executor=cached helped to improve scraping reliability. Jolokia keeps responsive while seatsat is running.
-javaagent:/usr/share/java/jolokia-jvm-1.6.2-agent.jar=executor=cached,allowDnsReverseLookup=false,discoveryEnabled=false
However, we are still facing issue with some table stats reading timeouts and related metrics are missing. I found that there is a hard-coded 3s timeout here https://github.com/suhailpatel/seastat/blob/master/jolokia/client.go#L24 This looks a bit harsh. Perhaps we should have the ability to configure those timeouts.
Hi @mkey-cdx, thanks for your issue report. I'm quite surprised that it's timing out with that number of tables. Typically, these stats are most expensive when you have lots of keyspaces and tables but 7 keyspaces and 67 tables is not a large number nor is the number of nodes (5) very large either.
Would you be able to tell me how many cores/RAM is available to your C* instance per node? That may play a factor.
I have created a 0.5.2
release with some modifications to add some more configurability to seastat server
--concurrency int maximum number of concurrent requests to Jolokia (default 10)
--timeout duration how long before we timeout a Jolokia request (default 3s)
I would recommend first try changing the --concurrency
flag to something lower like 2 and work your way up until you find a sweet spot. The default is 10 which is what I found works reasonably on a setup with 8-16 dedicated cores available. I wouldn't expect any individual request to take over 3s unless Jolokia is stuck.
I'd love to hear what combination of parameters works for you. Additionally, I hope once you find the right combination, it's zippy enough to bring that interval period right down. For context, Seastat is used on a cluster scraping over 1500 tables every 30s and it does this with ease.
Hi @suhailpatel The new --timeout parameter really helps with usability in my case, and I'm trying different concurrent jolokia requests to find the sweet spot as you suggested. It seems like jolokia responsiveness really depends on IO performance. The scraping duration varies widely and I can correlate with system IOwait. Here is an example with the default 10 concurrents requests:
./seastat-linux-amd64-0.5.2 server --port 5666 --interval 30s --timeout 30s -v debug π Seastat Cassandra Exporter 0.5.2
INFO[2020-12-07T13:53:42+01:00] β Communicating with Jolokia 1.6.2 (http://localhost:8778)
INFO[2020-12-07T13:53:42+01:00] π·οΈ Starting scraper (interval: 30s)
INFO[2020-12-07T13:53:42+01:00] π Listening on :5666
DEBU[2020-12-07T13:53:42+01:00] π Refreshed table list, got 67 tables (took 94 ms)
DEBU[2020-12-07T13:53:44+01:00] πΈοΈ Finished scrape for 67 tables (took 1353 ms)
DEBU[2020-12-07T13:54:15+01:00] πΈοΈ Finished scrape for 67 tables (took 1135 ms)
DEBU[2020-12-07T13:54:45+01:00] πΈοΈ Finished scrape for 67 tables (took 1255 ms)
DEBU[2020-12-07T13:55:16+01:00] πΈοΈ Finished scrape for 67 tables (took 2440 ms)
DEBU[2020-12-07T13:55:45+01:00] πΈοΈ Finished scrape for 67 tables (took 1291 ms)
DEBU[2020-12-07T13:56:15+01:00] πΈοΈ Finished scrape for 67 tables (took 1267 ms)
DEBU[2020-12-07T13:57:14+01:00] π¦ Could not get table stats for XXXX: err reading table: Post "http://localhost:8778/jolokia/read": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
DEBU[2020-12-07T13:57:14+01:00] π¦ Could not get table stats for XXXX: err reading table: Post "http://localhost:8778/jolokia/read": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
DEBU[2020-12-07T13:57:15+01:00] πΈοΈ Finished scrape for 67 tables (took 30915 ms)
DEBU[2020-12-07T13:57:29+01:00] πΈοΈ Finished scrape for 67 tables (took 14035 ms)
DEBU[2020-12-07T13:57:49+01:00] πΈοΈ Finished scrape for 67 tables (took 5075 ms)
DEBU[2020-12-07T13:58:29+01:00] πΈοΈ Finished scrape for 67 tables (took 14907 ms)
DEBU[2020-12-07T13:58:44+01:00] π Refreshed table list, got 67 tables (took 52 ms)
DEBU[2020-12-07T13:59:14+01:00] π¦ Could not get table stats for XXXX: err reading table: Post "http://localhost:8778/jolokia/read": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
DEBU[2020-12-07T13:59:14+01:00] π¦ Could not get table stats for XXXX: err reading table: Post "http://localhost:8778/jolokia/read": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
DEBU[2020-12-07T13:59:15+01:00] πΈοΈ Finished scrape for 67 tables (took 30879 ms)
DEBU[2020-12-07T13:59:44+01:00] πΈοΈ Finished scrape for 67 tables (took 29075 ms)
DEBU[2020-12-07T14:00:12+01:00] πΈοΈ Finished scrape for 67 tables (took 28047 ms)
DEBU[2020-12-07T14:00:15+01:00] πΈοΈ Finished scrape for 67 tables (took 1279 ms)
DEBU[2020-12-07T14:00:45+01:00] πΈοΈ Finished scrape for 67 tables (took 1315 ms)
Timeouts start to occur as the system is suffering from IOwait (at about 13:57 then at 13:59 in this example). Because the cluster was idle by that time, I suspect IOs were caused by jolokia itself. IO activity is coming from the cassandra process (jolokia is attached as java agent) and goes completely down as I stop the seastat process. I'm wondering if the concurrent setting can mitigate this behavior.
The JMX exporter we have in production is configured to whitelist/blacklist mbeans from being exposed during scrape. Do you know if we can implement a similar mechanism in seatstat? or perhaps at jolokia level?
FYI the cluster I use for the testings has 5 nodes with 24 cores/128G RAM each. Data is located in HDD jbods.
Hello @suhailpatel, we started to consider using your tool as an alternative to the cassandra jmx exporter to help performance and memory usage issues we are facing. However, even if the collector has good performance/memory usage, debug logs shows that the scraper will often timeout during jolokia requests. This test has been done with seatstat v0.5.1 in a running 5 nodes cassandra v3.11 cluster, holding 150TB of data across 7 keyspaces and 67 tables (including system and reaper ones). Jolokia v1.6.2 is attached as a java agent.
Checking directly at jolokia in parallel shows that it becomes unresponsive a few seconds after we start seastat:
In practice, this results in metrics not being exported at all by the collector and next scraper runs will simply stop the jolokia scraping attempt from time to time:
π¦ Could not fetch version, bailing out
Two concerns here: