`processTimers` consumes a lot of CPU

MartinNowak commented 10 years ago

My MySQL->MongoDB process spends about 10% of it's time in __vdso_clock_gettime and __vdso_gettimeofday. This will be worse in a virtual box, where gettimeofday is way more expensive. Maybe it's possible to use the cheaper TickDuration, or at least increase the timer granularity in runEventLoop.

s-ludwig commented 10 years ago

Using just TickDuration would make the timeouts too unreliable. Maybe, if it is consistently much faster, it could be used as a limited pre-check, for example skipping the Clock.currTime whenever the TickDuration indicates that no timers have fired until now, and it's ahead of the last known Clock.currTime by less than some relatively small amount of time. That would at least help whenever there are no timers firing.

For now, I've added a simple early-out when there are no timers pending at all.

MartinNowak commented 10 years ago

Well, this resolved my performance problem.

Maybe, if it is consistently much faster, it could be used as a limited pre-check, for example skipping the Clock.currTime whenever the TickDuration indicates that no timers have fired until now, and it's ahead of the last known Clock.currTime by less than some relatively small amount of time.

Yeah, that's what I meant by using TickDuration. Basically just adding some sort of granularity to processTimers. I think I noticed yesterday that processTimers was called from both runEventLoop and from notifyIdle through processEvents. So there seem to be 2 calls per event round.

s-ludwig commented 10 years ago

I think I noticed yesterday that processTimers was called from both runEventLoop and from notifyIdle through processEvents. So there seem to be 2 calls per event round.

This should only happen when the loop in notifyIdle is executed more than once (because the custom idle handler returned true or because a task way yielded again after being resumed). The logic in that loop is a little complicated, though, maybe it can be simplified a bit.

s-ludwig commented 10 years ago

Interestingly, the tick based clocks both seem to be slower than getting the absolute time using Clock.currStdTime (not sure how it behaves in a VM though). Times normalized to currStdTime:

Method	Windows	Linux
`currTime(UTC())`	2.79	1.14
`currStdTime`	1.00	1.00
`currSystemTick`	1.49	1.08
`currAppTick`	2.15	1.16

So since the time is stored as a long internally anyway, directly using clockCurrStdTime seems like the best solution.

s-ludwig commented 10 years ago

Looking at the code, it seems strange that currTime(UTC()) is significantly slower than currStdTime, though. All it does is instantiating a SysTime struct with the standard time and a static instance of UTC.

etcimon commented 10 years ago

There's 2 issues currently in libevent, one that I've fixed.

1) When processing timers, consumeTimers may reschedule timers from tasks, and the main thread will also reschedule timers (going along with all the OS calls) when leaving that function. https://github.com/etcimon/vibe.d/blob/native-events/source/vibe/core/drivers/libasync.d#L369

2) When a timer is removed, TimerQueue.destroy only removes it from TimerQueue.m_timers but not from m_timeoutHeap. This reschedules each of those timeouts in the OS consecutively for no reason. This means the actual timer in waitForData stays active while the callback is non-existent. The more CPU-expensive task of rescheduling the timer is causing a higher load on servers.

s-ludwig commented 10 years ago

1) Good point, I've added that early-out to the libevent backend, too. 2) Interesting. The code to remove a timer from the heap, at least if it is the first one, seems to have gotten lost along the way... But ultimately I think that we need to switch to a red-black tree anyway, because a heap doesn't really allow to remove arbitrary elements. The only issue is that this would have to be an implementation that doesn't GC-allocate each node, so that (as far as I understood) the one in Phobos isn't really suitable.

BTW, I just realized now that LibasyncDriver uses TimerQueue. Wouldn't it make more sense to implement efficient timers directly in libasync and directly wrap those?

etcimon commented 10 years ago

But ultimately I think that we need to switch to a red-black tree anyway, because a heap doesn't really allow to remove arbitrary elements.

It would probably be best to check the next elements in the timeout heap to see if a callback is set, before using them as the next timeout. To avoid permuting the entire heap every time an element is removed from the front, we can keep an index to the front and remove them in batch (5k elements at a time), that'll be a very decent optimization. For inserting, I think Array and BinaryHeap should be replaced with AllocArray and std.sort. Rather than red-black-trees, we could have array buckets for multiple timeout ranges (x < 50ms, 50ms <= y < 1s, 1s <= z). This would allow an O(1) optimization when the timeout is higher than the one at the back of the array (which could be 100% of inserts if only waitForData is abusing timers from a single timeout value), and it would avoid the best-case O(log n) insert times from rbt.

BTW, I just realized now that LibasyncDriver uses TimerQueue. Wouldn't it make more sense to implement efficient timers directly in libasync and directly wrap those?

Yes, I had been thinking of this solution for AsyncTimer =) should I?

s-ludwig commented 10 years ago

With the bucket approach, wouldn't you have to constantly move timers between the different buckets (i.e. a timer > 1s will eventually be < 1s and eventually < 50ms)? Since this is basically a sorting problem, I don't think that you can get any better than O(N*log(N)) for N timers.

But array + sort will definitely be slower than a heap due to the number of elements that need to be shifted. It will basically be O(N²) at best (binary search for insertion position, O(log(N)), array insertion, O(N), and that for each element = O(N*(log(N)+N)) = O(N²)).

To avoid permuting the entire heap every time an element is removed from the front, we can keep an index to the front and remove them in batch (5k elements at a time), that'll be a very decent optimization.

That doesn't really work for a heap, because the elements inside are not sorted.

Overall, I think that an RBT is a very decent solution. It would be interesting to perform a simple benchmark against the heap solution.

etcimon commented 10 years ago

Doesn't look too hard to adapt the std.containers.rbtree with vibe freelists. The nodes are definitely allocated only here:

https://github.com/D-Programming-Language/phobos/blob/master/std/container/rbtree.d#L739

and definitely deallocated only here: https://github.com/D-Programming-Language/phobos/blob/master/std/container/rbtree.d#L483

If I'm not mistaken, it would only need to be copied near memory.d and use FreeListObjectAlloc!RBNode's alloc and free at those locations

MartinNowak commented 10 years ago

How many timers? If the number isn't too big (~thousands) an array + sort (bubble sort for very small arrays) will be faster than any sorted tree container. You can readily use BinaryHeap from std.container.

s-ludwig commented 10 years ago

BinaryHeap is used right now. The problem is that you can't remove elements from the middle. We could of course lazily switch from a sorted array to an RBT, but I'm not sure if that's actually worth the gain (assuming that node allocation is not the main overhead of the tree implementation).

vibe-d / vibe.d

`processTimers` consumes a lot of CPU #721