Closed jacobrichard closed 7 years ago
Are you running from the latest Master? There was a problem with a ring buffer. It hangs on recursive dispatch and overflow. It's fixed by the one of outstanding PRs, although @tehlers320 reported it raises CPU pressure a lot. I've added a little fix (parking the thread) to avoid busy-spin and hope it'll help: https://github.com/pyr/cyanite/pull/237
The only downside is currently that new scheduling system requires more configuration:
queue:
queue-capacity: 1048576
pool-size: 4
I'd set pool-size to amount of cores. Queue capacity should withstand your max load. You can set it to arbitrarily large power of 2 if you're not sure.
Could you let it bake for 10-12 hours and see if it helps to avoid the block?
Hey @jacobrichard just checking if the previous message went through.
Yes, sorry -- I'm going to pull in 237 and try running this overnight.
👍
I did quite a bit of load testing with that pull request and did not see a recurrence of the blocking. WIth a suitably large (~12G) heap configured, I was able to push two instances to around 4M metrics per minute.
I'm not sure why such a large heap is required, aside from the aggregation engine in cyanite needing to hold object references for a long time, thus preventing them from being GC'd. A larger heap definitely resulted in fairly substantial performance increases.
A separate issue of blocking has arisen when attempting to traverse an extremely deep metric tree, but I believe @tehlers320 is opening a separate issue for that.
I'd need co profile the heap, it's non-obvious for me why we need 12G. I assume for 40M per minute it comes down to roughly 66K per second, which means that in any given point in time we're holding 66K strings and as many 4 x long
which is not that much. There might be some overhead involved with the Atoms, since they wrap pretty much everything.
I'm going to profile on my own, although I do not have any machines available with 12G heaps. It would be extremely helpful if you could do a heapdump and check the top contributors. Although it might be I can figure them out from the smaller heap as well.
Fixed in latest master.
We have a stack sending around 300k metrics through a cyanite pid into cassandra. Running the latest copy of master performance seemed to have improved, but after being up for approximately 8 hours, the cyanite process seems to be frozen.
We have a carbon-c-relay sending to cyanite, and I see this in the carbon relay logs:
[2016-08-30 11:55:47] (ERR) failed to write() to 127.0.0.1:2109: Resource temporarily unavailable [2016-08-30 11:55:47] (ERR) server 127.0.0.1:2109: OK
The cyanite log does not show anything (debug logging was not enabled). Here is a thread dump of the running state of the process in question:
cyanite-insert.txt