Open MartinNowak opened 10 years ago
Using just TickDuration
would make the timeouts too unreliable. Maybe, if it is consistently much faster, it could be used as a limited pre-check, for example skipping the Clock.currTime
whenever the TickDuration
indicates that no timers have fired until now, and it's ahead of the last known Clock.currTime
by less than some relatively small amount of time. That would at least help whenever there are no timers firing.
For now, I've added a simple early-out when there are no timers pending at all.
Well, this resolved my performance problem.
Maybe, if it is consistently much faster, it could be used as a limited pre-check, for example skipping the Clock.currTime whenever the TickDuration indicates that no timers have fired until now, and it's ahead of the last known Clock.currTime by less than some relatively small amount of time.
Yeah, that's what I meant by using TickDuration. Basically just adding some sort of granularity to processTimers. I think I noticed yesterday that processTimers was called from both runEventLoop and from notifyIdle through processEvents. So there seem to be 2 calls per event round.
I think I noticed yesterday that processTimers was called from both runEventLoop and from notifyIdle through processEvents. So there seem to be 2 calls per event round.
This should only happen when the loop in notifyIdle
is executed more than once (because the custom idle handler returned true
or because a task way yield
ed again after being resumed). The logic in that loop is a little complicated, though, maybe it can be simplified a bit.
Interestingly, the tick based clocks both seem to be slower than getting the absolute time using Clock.currStdTime
(not sure how it behaves in a VM though). Times normalized to currStdTime
:
Method | Windows | Linux |
---|---|---|
currTime(UTC()) |
2.79 | 1.14 |
currStdTime |
1.00 | 1.00 |
currSystemTick |
1.49 | 1.08 |
currAppTick |
2.15 | 1.16 |
So since the time is stored as a long
internally anyway, directly using clockCurrStdTime
seems like the best solution.
Looking at the code, it seems strange that currTime(UTC())
is significantly slower than currStdTime
, though. All it does is instantiating a SysTime
struct with the standard time and a static instance of UTC
.
There's 2 issues currently in libevent, one that I've fixed.
1) When processing timers, consumeTimers may reschedule timers from tasks, and the main thread will also reschedule timers (going along with all the OS calls) when leaving that function. https://github.com/etcimon/vibe.d/blob/native-events/source/vibe/core/drivers/libasync.d#L369
2) When a timer is removed, TimerQueue.destroy
only removes it from TimerQueue.m_timers
but not from m_timeoutHeap
. This reschedules each of those timeouts in the OS consecutively for no reason. This means the actual timer in waitForData
stays active while the callback is non-existent. The more CPU-expensive task of rescheduling the timer is causing a higher load on servers.
1) Good point, I've added that early-out to the libevent backend, too. 2) Interesting. The code to remove a timer from the heap, at least if it is the first one, seems to have gotten lost along the way... But ultimately I think that we need to switch to a red-black tree anyway, because a heap doesn't really allow to remove arbitrary elements. The only issue is that this would have to be an implementation that doesn't GC-allocate each node, so that (as far as I understood) the one in Phobos isn't really suitable.
BTW, I just realized now that LibasyncDriver
uses TimerQueue
. Wouldn't it make more sense to implement efficient timers directly in libasync and directly wrap those?
But ultimately I think that we need to switch to a red-black tree anyway, because a heap doesn't really allow to remove arbitrary elements.
It would probably be best to check the next elements in the timeout heap to see if a callback is set, before using them as the next timeout. To avoid permuting the entire heap every time an element is removed from the front, we can keep an index to the front and remove them in batch (5k elements at a time), that'll be a very decent optimization. For inserting, I think Array
and BinaryHeap
should be replaced with AllocArray
and std.sort
. Rather than red-black-trees, we could have array buckets for multiple timeout ranges (x < 50ms, 50ms <= y < 1s, 1s <= z). This would allow an O(1) optimization when the timeout is higher than the one at the back of the array (which could be 100% of inserts if only waitForData
is abusing timers from a single timeout value), and it would avoid the best-case O(log n) insert times from rbt.
BTW, I just realized now that LibasyncDriver uses TimerQueue. Wouldn't it make more sense to implement efficient timers directly in libasync and directly wrap those?
Yes, I had been thinking of this solution for AsyncTimer
=) should I?
With the bucket approach, wouldn't you have to constantly move timers between the different buckets (i.e. a timer > 1s will eventually be < 1s and eventually < 50ms)? Since this is basically a sorting problem, I don't think that you can get any better than O(N*log(N)) for N timers.
But array + sort will definitely be slower than a heap due to the number of elements that need to be shifted. It will basically be O(N²) at best (binary search for insertion position, O(log(N)), array insertion, O(N), and that for each element = O(N*(log(N)+N)) = O(N²)).
To avoid permuting the entire heap every time an element is removed from the front, we can keep an index to the front and remove them in batch (5k elements at a time), that'll be a very decent optimization.
That doesn't really work for a heap, because the elements inside are not sorted.
Overall, I think that an RBT is a very decent solution. It would be interesting to perform a simple benchmark against the heap solution.
Doesn't look too hard to adapt the std.containers.rbtree with vibe freelists. The nodes are definitely allocated only here:
https://github.com/D-Programming-Language/phobos/blob/master/std/container/rbtree.d#L739
and definitely deallocated only here: https://github.com/D-Programming-Language/phobos/blob/master/std/container/rbtree.d#L483
If I'm not mistaken, it would only need to be copied near memory.d and use FreeListObjectAlloc!RBNode
's alloc
and free
at those locations
How many timers? If the number isn't too big (~thousands) an array + sort (bubble sort for very small arrays) will be faster than any sorted tree container. You can readily use BinaryHeap from std.container.
BinaryHeap
is used right now. The problem is that you can't remove elements from the middle. We could of course lazily switch from a sorted array to an RBT, but I'm not sure if that's actually worth the gain (assuming that node allocation is not the main overhead of the tree implementation).
My MySQL->MongoDB process spends about 10% of it's time in
__vdso_clock_gettime
and__vdso_gettimeofday
. This will be worse in a virtual box, where gettimeofday is way more expensive. Maybe it's possible to use the cheaper TickDuration, or at least increase the timer granularity inrunEventLoop
.