Closed wathenjiang closed 3 months ago
Drive-by comments.
I have no idea if the maintainers will accept this idea. On the surface, this is what it looks like is happening with these changes.
A timer wheel's slot's entries are now counted. This increases the timer wheel sizes in memory but is meant to allow cycling through them once.
When running through a timer wheel slot list, the original count is used to try to know when to stop popping from the back of the list, because logic is now added that pushes back to the front. But because the lock has to be released and then reacquired when the local stack-frame slice of temporary expired events has become full, another thread may have come in and added an entry (that conceivably has a 'now' deadline) so it seems this new entry will not be seen by the loop.
This is going to be a very rare event, as it also must coincide with the local stack-frame's slice having filled up, so it is conceivable a solution is to punt out of the looping at that point and restart at the top of the function, getting the new total size of the list, starting with an empty stack frame slice too, and just starting again to look for entries that expired 'now'.
It is nice that the poll function logic is simplified. (If that's actually correct - I haven't groked why that would be the case but I'm probably not seeing the whole picture.)
The Expiration
Clone
isn't necessary as the one use of clone()
can be avoided by passing a reference, or just passing the Expriation.deadline
directly.
There is a new use of unsafe
in wheel/mod.rs, in the new function add_entry
. To match other coding style in that file, the add_entry
might be labeled unsafe
and the SAFETY
comment that would be added in time/mod.rs would say why it was safe to be added back.
The idea is commendable. It seems in the common case, the work the runtime has to do is being streamlined. But there's at least the one corner case I mentioned above and the logic is spread out over four files so someone with a clearer idea of all the workings would have to look at this and say that the timer wheel logic was still tight because the corner cases seem too hard to test for exhaustively and even fuzz testing isn't going to be a guarantee the whole thing still hangs together.
But also, is the added size of the timer wheels worth it? The sharding and the lazy initialization changes that went in recently already made the timer wheel much more efficient for busy systems.
Thanks for your comments.
A timer wheel's slot's entries are now counted. This increases the timer wheel sizes in memory but is meant to allow cycling through them once.
Yes. The current approach comes at the cost of increased space usage, which is the cost. :) And what, Gul'dan, must we give it return?
When running through a timer wheel slot list, the original count is used to try to know when to stop popping from the back of the list, because logic is now added that pushes back to the front. But because the lock has to be released and then reacquired when the local stack-frame slice of temporary expired events has become full, another thread may have come in and added an entry (that conceivably has a 'now' deadline) so it seems this new entry will not be seen by the loop.
This is going to be a very rare event, as it also must coincide with the local stack-frame's slice having filled up, so it is conceivable a solution is to punt out of the looping at that point and restart at the top of the function, getting the new total size of the list, starting with an empty stack frame slice too, and just starting again to look for entries that expired 'now'.
It is nice that the poll function logic is simplified. (If that's actually correct - I haven't groked why that would be the case but I'm probably not seeing the whole picture.)
Yes. But the current design also does not ensures that the very new entry can be seen by the second traverse loop.
Because of we set_elapsed
immediately, this will not happen.
The
Expiration
Clone
isn't necessary as the one use ofclone()
can be avoided by passing a reference, or just passing theExpriation.deadline
directly.
Yes, the Expiration
Clone
isn't necessary, and I will remove the #[derive(Clone)]
of that type.
There is a new use of
unsafe
in wheel/mod.rs, in the new functionadd_entry
. To match other coding style in that file, theadd_entry
might be labeledunsafe
and theSAFETY
comment that would be added in time/mod.rs would say why it was safe to be added back.
Yes, we should add SAFETY comments to the new function add_entry
in wheel/mod.rs.
I rename it reinsert_entry
now.
The idea is commendable. It seems in the common case, the work the runtime has to do is being streamlined. But there's at least the one corner case I mentioned above and the logic is spread out over four files so someone with a clearer idea of all the workings would have to look at this and say that the timer wheel logic was still tight because the corner cases seem too hard to test for exhaustively and even fuzz testing isn't going to be a guarantee the whole thing still hangs together.
But also, is the added size of the timer wheels worth it? The sharding and the lazy initialization changes that went in recently already made the timer wheel much more efficient for busy systems.
Yes. The benefit here depends on how we use the Timer. But if there is a chance to traverse the list only once instead of twice, it is definitely a performance win.
The benchmark code is as follows:
#[tokio::main]
async fn main() {
let now = tokio::time::Instant::now();
let five_seconds_later = now + tokio::time::Duration::from_secs(5);
let mut handles = vec![];
for _ in 0..1024 * 1024 {
handles.push(tokio::spawn(async move {
tokio::time::sleep_until(five_seconds_later).await;
}));
}
for h in handles {
let _ = h.await.unwrap();
}
println!("{:?}", now.elapsed());
}
The benchmark results of the above are as follows:
original version: 5.846156162s
this pr version: 5.72068938s
On Linux with the Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz
CPU.
This shows there is a time reduction of (0.846156162 - 0.72068938) / 0.846156162 = 14.8%
.
The benchmark results of the above are as follows:
- original version: 5.846156162s
- this pr version: 5.72068938s
On Linux with the
Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz
CPU.This shows there is a time reduction of
(0.846156162 - 0.72068938) / 0.846156162 = 14.8%
.
Why are times 5.8 and 5.7 in one part and 0.8 and 0.7 in another?
The benchmark results of the above are as follows:
- original version: 5.846156162s
- this pr version: 5.72068938s
On Linux with the
Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz
CPU. This shows there is a time reduction of(0.846156162 - 0.72068938) / 0.846156162 = 14.8%
.Why are times 5.8 and 5.7 in one part and 0.8 and 0.7 in another?
Here, I want to measure the time taken for time wheel related operations (which may mainly include traversing the time wheel and waking up and executing tasks).
5
seconds is the time we have to wait, while 0.8
and 0.7
measure the time of related operations on the time wheel.
As FrankReh worried, introducing counters for all levels will lead to additional memory consumption. This memory problem is exacerbated because the timer wheel is sharded by worker threads.
I decide to only use a counter for highest level. Now, if our time wheel shard size is 8, the total memory increment of timer wheels would be 3968 bytes. As far as I know, 8 work threads will take 16MB(2MB * 8) as least, so this 4KB extra memory might be acceptable.
@wathenjiang Apologies if I have it backwards or if I read the new code incorrectly. Don't you mean to provide counters only for the shortest intervals? I think that would mean level 0, not level 5. But I could easily be missing something.
Also, I do like the idea of treating the most frequently accessed level differently from the rest. It made me wonder if the whole sharding idea could have been done differently, and only shard the first one, two or three levels. But that would come at the cost of more code that needs to be maintained. And I'm not a maintainer. Hardly even a contributor these days. But I hold timer wheel ideas very dear and I haven't been asked not to thrown in my 2 cents now and then.
Also, it would be nice if you got some indications from @Darksonn if you might even be on the right track or if you should just stop. Best I can tell from the comment left by @Darksonn above, your fundamental idea hasn't been given much of a go-ahead yet and you shouldn't take the wrong impression by my comments that your efforts will be rewarded.
@wathenjiang Apologies if I have it backwards or if I read the new code incorrectly. Don't you mean to provide counters only for the shortest intervals? I think that would mean level 0, not level 5. But I could easily be missing something.
Also, I do like the idea of treating the most frequently accessed level differently from the rest. It made me wonder if the whole sharding idea could have been done differently, and only shard the first one, two or three levels. But that would come at the cost of more code that needs to be maintained. And I'm not a maintainer. Hardly even a contributor these days. But I hold timer wheel ideas very dear and I haven't been asked not to thrown in my 2 cents now and then.
Also, it would be nice if you got some indications from @Darksonn if you might even be on the right track or if you should just stop. Best I can tell from the comment left by @Darksonn above, your fundamental idea hasn't been given much of a go-ahead yet and you shouldn't take the wrong impression by my comments that your efforts will be rewarded.
I appreciate your comments. If I have been unclear, let me make it clear: we should only deal with the highest level's infinite loop issue.
The idea of the counter is to avoid infinite loops, if there are other solutions, I'm open to them.
Not using counters for the low level is implementation specific, not design specific. I'd love to see the memory savings. On the other hand, if the memory usage is not affected much, we can also use the previous version.
Yes, feedback is important.
Thank you @FrankReh for prodding me. I will try to summarize my thoughts on this PR. The idea of using a counter to avoid the infinite loop makes me a bit worried, and I can't really tell whether it's correct. It seems to me that the counter is being used to solve a problem that we have encountered many times in Tokio. Previous times we encountered this problem, we solved it using GuardedLinkedList
rather than a counter. If we are going to use a different solution here, I need to understand why GuardedLinkedList
can't work.
Let me try to summarize my understanding of the PR.
Currently we move items to the pending
list one-by-one, and then we traverse pending
afterwards. The perf issue is that moving them one-by-one means that we traverse the entries twice.
However, as far as I understand, the items that are moved correspond exactly to those that are stored in a specific bucket, correct? Could we move all of those items into pending
in one operation? One of the advantages of linked lists is that you can make that kind of batch operation cheaply, after all.
However, in this comment you object that when we move all of the items into the GuardedLinkedList
, we would need to traverse the entire list to mark the entries as being in pending
rather than the timer wheel. And that this is because the entries could be dropped while they are in pending
.
However, the documentation for LinkedList::remove
says this:
That is to say, you are allowed to call remove
on a linked list in the timer wheel even though the entry is actually in some other GuardedLinkedList
as long as the timer wheel and the GuardedLinkedList
are protected by the same mutex. In the other users of GuardedLinkedList
we have been able to avoid a traversal to mark entries as being in a GuardedLinkedList
, so I'm thinking that we can probably also avoid that traversal in this case.
Thank you @FrankReh for prodding me. I will try to summarize my thoughts on this PR. The idea of using a counter to avoid the infinite loop makes me a bit worried, and I can't really tell whether it's correct. It seems to me that the counter is being used to solve a problem that we have encountered many times in Tokio. Previous times we encountered this problem, we solved it using
GuardedLinkedList
rather than a counter. If we are going to use a different solution here, I need to understand whyGuardedLinkedList
can't work.Let me try to summarize my understanding of the PR.
Currently we move items to the
pending
list one-by-one, and then we traversepending
afterwards. The perf issue is that moving them one-by-one means that we traverse the entries twice.However, as far as I understand, the items that are moved correspond exactly to those that are stored in a specific bucket, correct? Could we move all of those items into
pending
in one operation? One of the advantages of linked lists is that you can make that kind of batch operation cheaply, after all.However, in this comment you object that when we move all of the items into the
GuardedLinkedList
, we would need to traverse the entire list to mark the entries as being inpending
rather than the timer wheel. And that this is because the entries could be dropped while they are inpending
.However, the documentation for
LinkedList::remove
says this:That is to say, you are allowed to call
remove
on a linked list in the timer wheel even though the entry is actually in some otherGuardedLinkedList
as long as the timer wheel and theGuardedLinkedList
are protected by the same mutex. In the other users ofGuardedLinkedList
we have been able to avoid a traversal to mark entries as being in aGuardedLinkedList
, so I'm thinking that we can probably also avoid that traversal in this case.
Thanks! I have never noticed the comment of the remove
method in linked_list
. Sorry for mistaking such important information.
I believe this is the best way to address our purpose.
I looked it over and actually didn't have anything to add. The GuardedListedList
is a very nice resource that tokio provides for some of its drivers. Glad that didn't have to be invented for this. I will say it removes the pending
entry so simplifies things in a way and I didn't notice any new uses of unsafe
. There is a new constant and the tests around the constants have changed a bit so requires someone who has internalized all the timer wheel workers to say the changes don't open up any holes.
Just wait for @Darksonn to find time. There are a lot of changes so probably takes more than a few minutes to decide if it looks right.
But from my perspective .. nicely done.
Overall looks reasonable to me.
This change has been reverted in #6715.
Motivation
Currently we use a pending linked list to temporarily store the timer entry to be fired.
If there are a large number of timers that need to be waken at once, then there will be a cache miss issue here.
Solution
For synchronizing the state between threads, when we want to modify the state of entries in wheel, we must hold the lock of wheel. However, when we hold a lock, in order to avoid the deadlock issue, we are not allowed wake the walker of entries.
Furthermore, to avoid the issue of the infinite loop, we need to take all of the entries off the slot before processing any of them. And, throughout this process, we always hold the lock.
The main reasons for using a
pending
linked list currently mentioned above.In addition to supporting random deletion,
slot
linked list only have FIFOinsert_head
andpop_back
. Given this, if a counter is introduced for each slot, the maximum number ofpop_back
s can be determined before traversing the slot, which solves the issue of the infinite loop.Furthermore, by solving the infinite loop issue, we no longer need to take all of the entries off the slot before processing any of them. Finally, in this PR, we removed the
pending
linked list.Expected benefits are as follows