Jolokia timeouts while running seastat

mkey-cdx commented 3 years ago

Hello @suhailpatel, we started to consider using your tool as an alternative to the cassandra jmx exporter to help performance and memory usage issues we are facing. However, even if the collector has good performance/memory usage, debug logs shows that the scraper will often timeout during jolokia requests. This test has been done with seatstat v0.5.1 in a running 5 nodes cassandra v3.11 cluster, holding 150TB of data across 7 keyspaces and 67 tables (including system and reaper ones). Jolokia v1.6.2 is attached as a java agent.

./seastat-linux-amd64-0.5.1 server --port 5666 --interval 120s -v debug
🌊 Seastat Cassandra Exporter 0.5.1
INFO[2020-10-27T10:20:05+01:00] ☕ Communicating with Jolokia 1.6.2 (http://localhost:8778)
INFO[2020-10-27T10:20:05+01:00] 🕷️ Starting scraper (interval: 2m0s)
INFO[2020-10-27T10:20:05+01:00] 👂 Listening on :5666
DEBU[2020-10-27T10:20:05+01:00] 🐝 Refreshed table list, got 67 tables (took 104 ms)
DEBU[2020-10-27T10:20:22+01:00] 🦂 Could not get table stats for system.available_ranges: err reading table: Post "http://localhost:8778/jolokia/read": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
DEBU[2020-10-27T10:20:22+01:00] 🦂 Could not get table stats for system_distributed.repair_history: err reading table: Post "http://localhost:8778/jolokia/read": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
DEBU[2020-10-27T10:20:22+01:00] 🦂 Could not get table stats for system_schema.triggers: err reading table: Post "http://localhost:8778/jolokia/read": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
DEBU[2020-10-27T10:20:22+01:00] 🦂 Could not get table stats for system_schema.types: err reading table: Post "http://localhost:8778/jolokia/read": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
DEBU[2020-10-27T10:20:22+01:00] 🦂 Could not get table stats for reaper_db.cluster: err reading table: Post "http://localhost:8778/jolokia/read": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
DEBU[2020-10-27T10:20:22+01:00] 🦂 Could not get table stats for system_schema.tables: err reading table: Post "http://localhost:8778/jolokia/read": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
DEBU[2020-10-27T10:20:22+01:00] 🦂 Could not get table stats for system_auth.role_permissions: err reading table: Post "http://localhost:8778/jolokia/read": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
DEBU[2020-10-27T10:20:22+01:00] 🦂 Could not get table stats for system.peer_events: err reading table: Post "http://localhost:8778/jolokia/read": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
DEBU[2020-10-27T10:20:22+01:00] 🦂 Could not get table stats for reaper_db.running_reapers: err reading table: Post "http://localhost:8778/jolokia/read": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[...]
^CINFO[2020-10-27T10:20:34+01:00] 🏁 Received OS signal interrupt, shutting down
INFO[2020-10-27T10:20:34+01:00] 😴 Server has shut down

Checking directly at jolokia in parallel shows that it becomes unresponsive a few seconds after we start seastat:

time curl http://localhost:8778/jolokia/version
{"request":{"type":"version"},"value":{"agent":"1.6.2","protocol":"7.2","config":{"listenForHttpService":"true","maxCollectionSize":"1000","authIgnoreCerts":"false","agentId":"10.200.40.200-22482-378bf509-jvm","debug":"false","agentType":"jvm","policyLocation":"classpath:\/jolokia-access.xml","agentContext":"\/jolokia","serializeException":"false","mimeType":"text\/plain","maxDepth":"15","authMode":"basic","authMatch":"any","discoveryEnabled":"false","streaming":"true","canonicalNaming":"true","historyMaxEntries":"10","allowErrorDetails":"true","allowDnsReverseLookup":"false","realm":"jolokia","includeStackTrace":"true","maxObjects":"0","useRestrictorService":"false","debugMaxEntries":"100"},"info":{}},"timestamp":1603790576,"status":200}
real    0m0.031s
user    0m0.004s
sys     0m0.008s
Mon Oct 26 14:34:46 CET 2020

time curl http://localhost:8778/jolokia/version
[...]
real    4m27.056s
user    0m0.007s
sys     0m0.033s

In practice, this results in metrics not being exported at all by the collector and next scraper runs will simply stop the jolokia scraping attempt from time to time: 🦂 Could not fetch version, bailing out

Two concerns here:

Is there a way to limit the data scope retrieved by the scraper? It looks that jolokia get stuck under heavy data load and would benefit from an objects blacklisting/whitelisting.
How can we detect such a scraping failure? It seems that we can compare the seastat_last_scrape_timestamp value with the interval set but something like seastat_last_scrape_status would be more convenient.

mkey-cdx commented 3 years ago

Tweaking jolokia with executor=cached helped to improve scraping reliability. Jolokia keeps responsive while seatsat is running.

-javaagent:/usr/share/java/jolokia-jvm-1.6.2-agent.jar=executor=cached,allowDnsReverseLookup=false,discoveryEnabled=false

However, we are still facing issue with some table stats reading timeouts and related metrics are missing. I found that there is a hard-coded 3s timeout here https://github.com/suhailpatel/seastat/blob/master/jolokia/client.go#L24 This looks a bit harsh. Perhaps we should have the ability to configure those timeouts.

suhailpatel commented 3 years ago

Hi @mkey-cdx, thanks for your issue report. I'm quite surprised that it's timing out with that number of tables. Typically, these stats are most expensive when you have lots of keyspaces and tables but 7 keyspaces and 67 tables is not a large number nor is the number of nodes (5) very large either.

Would you be able to tell me how many cores/RAM is available to your C* instance per node? That may play a factor.

I have created a 0.5.2 release with some modifications to add some more configurability to seastat server

      --concurrency int     maximum number of concurrent requests to Jolokia (default 10)
      --timeout duration    how long before we timeout a Jolokia request (default 3s)

I would recommend first try changing the --concurrency flag to something lower like 2 and work your way up until you find a sweet spot. The default is 10 which is what I found works reasonably on a setup with 8-16 dedicated cores available. I wouldn't expect any individual request to take over 3s unless Jolokia is stuck.

I'd love to hear what combination of parameters works for you. Additionally, I hope once you find the right combination, it's zippy enough to bring that interval period right down. For context, Seastat is used on a cluster scraping over 1500 tables every 30s and it does this with ease.

mkey-cdx commented 3 years ago

Hi @suhailpatel The new --timeout parameter really helps with usability in my case, and I'm trying different concurrent jolokia requests to find the sweet spot as you suggested. It seems like jolokia responsiveness really depends on IO performance. The scraping duration varies widely and I can correlate with system IOwait. Here is an example with the default 10 concurrents requests:

./seastat-linux-amd64-0.5.2 server --port 5666 --interval 30s --timeout 30s -v debug                                                                                                                                                                                 🌊 Seastat Cassandra Exporter 0.5.2
INFO[2020-12-07T13:53:42+01:00] ☕ Communicating with Jolokia 1.6.2 (http://localhost:8778)
INFO[2020-12-07T13:53:42+01:00] 🕷️ Starting scraper (interval: 30s)
INFO[2020-12-07T13:53:42+01:00] 👂 Listening on :5666
DEBU[2020-12-07T13:53:42+01:00] 🐝 Refreshed table list, got 67 tables (took 94 ms)
DEBU[2020-12-07T13:53:44+01:00] 🕸️ Finished scrape for 67 tables (took 1353 ms)
DEBU[2020-12-07T13:54:15+01:00] 🕸️ Finished scrape for 67 tables (took 1135 ms)
DEBU[2020-12-07T13:54:45+01:00] 🕸️ Finished scrape for 67 tables (took 1255 ms)
DEBU[2020-12-07T13:55:16+01:00] 🕸️ Finished scrape for 67 tables (took 2440 ms)
DEBU[2020-12-07T13:55:45+01:00] 🕸️ Finished scrape for 67 tables (took 1291 ms)
DEBU[2020-12-07T13:56:15+01:00] 🕸️ Finished scrape for 67 tables (took 1267 ms)
DEBU[2020-12-07T13:57:14+01:00] 🦂 Could not get table stats for XXXX: err reading table: Post "http://localhost:8778/jolokia/read": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
DEBU[2020-12-07T13:57:14+01:00] 🦂 Could not get table stats for XXXX: err reading table: Post "http://localhost:8778/jolokia/read": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
DEBU[2020-12-07T13:57:15+01:00] 🕸️  Finished scrape for 67 tables (took 30915 ms)
DEBU[2020-12-07T13:57:29+01:00] 🕸️  Finished scrape for 67 tables (took 14035 ms)
DEBU[2020-12-07T13:57:49+01:00] 🕸️  Finished scrape for 67 tables (took 5075 ms)
DEBU[2020-12-07T13:58:29+01:00] 🕸️  Finished scrape for 67 tables (took 14907 ms)
DEBU[2020-12-07T13:58:44+01:00] 🐝 Refreshed table list, got 67 tables (took 52 ms)
DEBU[2020-12-07T13:59:14+01:00] 🦂 Could not get table stats for XXXX: err reading table: Post "http://localhost:8778/jolokia/read": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
DEBU[2020-12-07T13:59:14+01:00] 🦂 Could not get table stats for XXXX: err reading table: Post "http://localhost:8778/jolokia/read": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
DEBU[2020-12-07T13:59:15+01:00] 🕸️  Finished scrape for 67 tables (took 30879 ms)
DEBU[2020-12-07T13:59:44+01:00] 🕸️  Finished scrape for 67 tables (took 29075 ms)
DEBU[2020-12-07T14:00:12+01:00] 🕸️  Finished scrape for 67 tables (took 28047 ms)
DEBU[2020-12-07T14:00:15+01:00] 🕸️  Finished scrape for 67 tables (took 1279 ms)
DEBU[2020-12-07T14:00:45+01:00] 🕸️  Finished scrape for 67 tables (took 1315 ms)

Timeouts start to occur as the system is suffering from IOwait (at about 13:57 then at 13:59 in this example). Because the cluster was idle by that time, I suspect IOs were caused by jolokia itself. IO activity is coming from the cassandra process (jolokia is attached as java agent) and goes completely down as I stop the seastat process. I'm wondering if the concurrent setting can mitigate this behavior.

The JMX exporter we have in production is configured to whitelist/blacklist mbeans from being exposed during scrape. Do you know if we can implement a similar mechanism in seatstat? or perhaps at jolokia level?

FYI the cluster I use for the testings has 5 nodes with 24 cores/128G RAM each. Data is located in HDD jbods.

suhailpatel / seastat

Jolokia timeouts while running seastat #11