pyr / cyanite

cyanite stores your metrics
http://cyanite.io
Other
446 stars 79 forks source link

Cyanite PID Hangs #239

Closed jacobrichard closed 7 years ago

jacobrichard commented 8 years ago

We have a stack sending around 300k metrics through a cyanite pid into cassandra. Running the latest copy of master performance seemed to have improved, but after being up for approximately 8 hours, the cyanite process seems to be frozen.

We have a carbon-c-relay sending to cyanite, and I see this in the carbon relay logs: [2016-08-30 11:55:47] (ERR) failed to write() to 127.0.0.1:2109: Resource temporarily unavailable [2016-08-30 11:55:47] (ERR) server 127.0.0.1:2109: OK

The cyanite log does not show anything (debug logging was not enabled). Here is a thread dump of the running state of the process in question:

cyanite-insert.txt

ifesdjeen commented 8 years ago

Are you running from the latest Master? There was a problem with a ring buffer. It hangs on recursive dispatch and overflow. It's fixed by the one of outstanding PRs, although @tehlers320 reported it raises CPU pressure a lot. I've added a little fix (parking the thread) to avoid busy-spin and hope it'll help: https://github.com/pyr/cyanite/pull/237

The only downside is currently that new scheduling system requires more configuration:

queue:
  queue-capacity: 1048576
  pool-size: 4

I'd set pool-size to amount of cores. Queue capacity should withstand your max load. You can set it to arbitrarily large power of 2 if you're not sure.

Could you let it bake for 10-12 hours and see if it helps to avoid the block?

ifesdjeen commented 8 years ago

Hey @jacobrichard just checking if the previous message went through.

jacobrichard commented 8 years ago

Yes, sorry -- I'm going to pull in 237 and try running this overnight.

ifesdjeen commented 8 years ago

👍

jacobrichard commented 7 years ago

I did quite a bit of load testing with that pull request and did not see a recurrence of the blocking. WIth a suitably large (~12G) heap configured, I was able to push two instances to around 4M metrics per minute.

I'm not sure why such a large heap is required, aside from the aggregation engine in cyanite needing to hold object references for a long time, thus preventing them from being GC'd. A larger heap definitely resulted in fairly substantial performance increases.

A separate issue of blocking has arisen when attempting to traverse an extremely deep metric tree, but I believe @tehlers320 is opening a separate issue for that.

ifesdjeen commented 7 years ago

I'd need co profile the heap, it's non-obvious for me why we need 12G. I assume for 40M per minute it comes down to roughly 66K per second, which means that in any given point in time we're holding 66K strings and as many 4 x long which is not that much. There might be some overhead involved with the Atoms, since they wrap pretty much everything.

I'm going to profile on my own, although I do not have any machines available with 12G heaps. It would be extremely helpful if you could do a heapdump and check the top contributors. Although it might be I can figure them out from the smaller heap as well.

ifesdjeen commented 7 years ago

Fixed in latest master.