time: avoid traversing entries in the time wheel twice

tokio-rs / tokio

A runtime for writing reliable asynchronous applications with Rust. Provides I/O, networking, scheduling, timers, ...

https://tokio.rs

MIT License

26.59k stars 2.45k forks source link

time: avoid traversing entries in the time wheel twice #6584

Closed wathenjiang closed 3 months ago

wathenjiang commented 4 months ago

Motivation

Currently we use a pending linked list to temporarily store the timer entry to be fired.

If there are a large number of timers that need to be waken at once, then there will be a cache miss issue here.

Solution

For synchronizing the state between threads, when we want to modify the state of entries in wheel, we must hold the lock of wheel. However, when we hold a lock, in order to avoid the deadlock issue, we are not allowed wake the walker of entries.

Furthermore, to avoid the issue of the infinite loop, we need to take all of the entries off the slot before processing any of them. And, throughout this process, we always hold the lock.

The main reasons for using a pending linked list currently mentioned above.

In addition to supporting random deletion, slot linked list only have FIFO insert_head and pop_back. Given this, if a counter is introduced for each slot, the maximum number of pop_backs can be determined before traversing the slot, which solves the issue of the infinite loop.

Furthermore, by solving the infinite loop issue, we no longer need to take all of the entries off the slot before processing any of them. Finally, in this PR, we removed the pending linked list.

Expected benefits are as follows

Since there is no need to traverse the linked list twice, the wake-up time will be reduced.
Since there is no need to process all entries in the linked list after locking, the lock contention of time wheel will be reduced.

FrankReh commented 4 months ago

Drive-by comments.

I have no idea if the maintainers will accept this idea. On the surface, this is what it looks like is happening with these changes.

A timer wheel's slot's entries are now counted. This increases the timer wheel sizes in memory but is meant to allow cycling through them once.

When running through a timer wheel slot list, the original count is used to try to know when to stop popping from the back of the list, because logic is now added that pushes back to the front. But because the lock has to be released and then reacquired when the local stack-frame slice of temporary expired events has become full, another thread may have come in and added an entry (that conceivably has a 'now' deadline) so it seems this new entry will not be seen by the loop.

This is going to be a very rare event, as it also must coincide with the local stack-frame's slice having filled up, so it is conceivable a solution is to punt out of the looping at that point and restart at the top of the function, getting the new total size of the list, starting with an empty stack frame slice too, and just starting again to look for entries that expired 'now'.

It is nice that the poll function logic is simplified. (If that's actually correct - I haven't groked why that would be the case but I'm probably not seeing the whole picture.)

The Expiration Clone isn't necessary as the one use of clone() can be avoided by passing a reference, or just passing the Expriation.deadline directly.

There is a new use of unsafe in wheel/mod.rs, in the new function add_entry. To match other coding style in that file, the add_entry might be labeled unsafe and the SAFETY comment that would be added in time/mod.rs would say why it was safe to be added back.

The idea is commendable. It seems in the common case, the work the runtime has to do is being streamlined. But there's at least the one corner case I mentioned above and the logic is spread out over four files so someone with a clearer idea of all the workings would have to look at this and say that the timer wheel logic was still tight because the corner cases seem too hard to test for exhaustively and even fuzz testing isn't going to be a guarantee the whole thing still hangs together.

But also, is the added size of the timer wheels worth it? The sharding and the lazy initialization changes that went in recently already made the timer wheel much more efficient for busy systems.

wathenjiang commented 4 months ago

Thanks for your comments.

A timer wheel's slot's entries are now counted. This increases the timer wheel sizes in memory but is meant to allow cycling through them once.

Yes. The current approach comes at the cost of increased space usage, which is the cost. :) And what, Gul'dan, must we give it return?

When running through a timer wheel slot list, the original count is used to try to know when to stop popping from the back of the list, because logic is now added that pushes back to the front. But because the lock has to be released and then reacquired when the local stack-frame slice of temporary expired events has become full, another thread may have come in and added an entry (that conceivably has a 'now' deadline) so it seems this new entry will not be seen by the loop.

This is going to be a very rare event, as it also must coincide with the local stack-frame's slice having filled up, so it is conceivable a solution is to punt out of the looping at that point and restart at the top of the function, getting the new total size of the list, starting with an empty stack frame slice too, and just starting again to look for entries that expired 'now'.

It is nice that the poll function logic is simplified. (If that's actually correct - I haven't groked why that would be the case but I'm probably not seeing the whole picture.)

~~Yes. But the current design also does not ensures that the very new entry can be seen by the second traverse loop.~~

Because of we set_elapsed immediately, this will not happen.

The Expiration Clone isn't necessary as the one use of clone() can be avoided by passing a reference, or just passing the Expriation.deadline directly.

Yes, the Expiration Clone isn't necessary, and I will remove the #[derive(Clone)] of that type.

There is a new use of unsafe in wheel/mod.rs, in the new function add_entry. To match other coding style in that file, the add_entry might be labeled unsafe and the SAFETY comment that would be added in time/mod.rs would say why it was safe to be added back.

Yes, we should add SAFETY comments to the new function add_entry in wheel/mod.rs.

I rename it reinsert_entry now.

The idea is commendable. It seems in the common case, the work the runtime has to do is being streamlined. But there's at least the one corner case I mentioned above and the logic is spread out over four files so someone with a clearer idea of all the workings would have to look at this and say that the timer wheel logic was still tight because the corner cases seem too hard to test for exhaustively and even fuzz testing isn't going to be a guarantee the whole thing still hangs together.

But also, is the added size of the timer wheels worth it? The sharding and the lazy initialization changes that went in recently already made the timer wheel much more efficient for busy systems.

Yes. The benefit here depends on how we use the Timer. But if there is a chance to traverse the list only once instead of twice, it is definitely a performance win.

wathenjiang commented 4 months ago

The benchmark code is as follows:

#[tokio::main]
async fn main() {
    let now = tokio::time::Instant::now();
    let five_seconds_later = now + tokio::time::Duration::from_secs(5);

    let mut handles = vec![];
    for _ in 0..1024 * 1024 {
        handles.push(tokio::spawn(async move {
            tokio::time::sleep_until(five_seconds_later).await;
        }));
    }
    for h in handles {
        let _ = h.await.unwrap();
    }
    println!("{:?}", now.elapsed());
}

The benchmark results of the above are as follows:

original version: 5.846156162s
this pr version: 5.72068938s

On Linux with the Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz CPU.

This shows there is a time reduction of (0.846156162 - 0.72068938) / 0.846156162 = 14.8%.

FrankReh commented 4 months ago

The benchmark results of the above are as follows:

original version: 5.846156162s

this pr version: 5.72068938s

On Linux with the Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz CPU.

This shows there is a time reduction of (0.846156162 - 0.72068938) / 0.846156162 = 14.8%.

Why are times 5.8 and 5.7 in one part and 0.8 and 0.7 in another?

wathenjiang commented 4 months ago

The benchmark results of the above are as follows:

original version: 5.846156162s

this pr version: 5.72068938s

On Linux with the Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz CPU. This shows there is a time reduction of (0.846156162 - 0.72068938) / 0.846156162 = 14.8%.

Why are times 5.8 and 5.7 in one part and 0.8 and 0.7 in another?

Here, I want to measure the time taken for time wheel related operations (which may mainly include traversing the time wheel and waking up and executing tasks).

5 seconds is the time we have to wait, while 0.8 and 0.7 measure the time of related operations on the time wheel.

wathenjiang commented 4 months ago

As FrankReh worried, introducing counters for all levels will lead to additional memory consumption. This memory problem is exacerbated because the timer wheel is sharded by worker threads.

I decide to only use a counter for highest level. Now, if our time wheel shard size is 8, the total memory increment of timer wheels would be 3968 bytes. As far as I know, 8 work threads will take 16MB(2MB * 8) as least, so this 4KB extra memory might be acceptable.

FrankReh commented 4 months ago

@wathenjiang Apologies if I have it backwards or if I read the new code incorrectly. Don't you mean to provide counters only for the shortest intervals? I think that would mean level 0, not level 5. But I could easily be missing something.

Also, I do like the idea of treating the most frequently accessed level differently from the rest. It made me wonder if the whole sharding idea could have been done differently, and only shard the first one, two or three levels. But that would come at the cost of more code that needs to be maintained. And I'm not a maintainer. Hardly even a contributor these days. But I hold timer wheel ideas very dear and I haven't been asked not to thrown in my 2 cents now and then.

Also, it would be nice if you got some indications from @Darksonn if you might even be on the right track or if you should just stop. Best I can tell from the comment left by @Darksonn above, your fundamental idea hasn't been given much of a go-ahead yet and you shouldn't take the wrong impression by my comments that your efforts will be rewarded.

wathenjiang commented 4 months ago

@wathenjiang Apologies if I have it backwards or if I read the new code incorrectly. Don't you mean to provide counters only for the shortest intervals? I think that would mean level 0, not level 5. But I could easily be missing something.

Also, I do like the idea of treating the most frequently accessed level differently from the rest. It made me wonder if the whole sharding idea could have been done differently, and only shard the first one, two or three levels. But that would come at the cost of more code that needs to be maintained. And I'm not a maintainer. Hardly even a contributor these days. But I hold timer wheel ideas very dear and I haven't been asked not to thrown in my 2 cents now and then.

Also, it would be nice if you got some indications from @Darksonn if you might even be on the right track or if you should just stop. Best I can tell from the comment left by @Darksonn above, your fundamental idea hasn't been given much of a go-ahead yet and you shouldn't take the wrong impression by my comments that your efforts will be rewarded.

I appreciate your comments. If I have been unclear, let me make it clear: we should only deal with the highest level's infinite loop issue.

The idea of the counter is to avoid infinite loops, if there are other solutions, I'm open to them.

Not using counters for the low level is implementation specific, not design specific. I'd love to see the memory savings. On the other hand, if the memory usage is not affected much, we can also use the previous version.

Yes, feedback is important.

Darksonn commented 4 months ago

Thank you @FrankReh for prodding me. I will try to summarize my thoughts on this PR. The idea of using a counter to avoid the infinite loop makes me a bit worried, and I can't really tell whether it's correct. It seems to me that the counter is being used to solve a problem that we have encountered many times in Tokio. Previous times we encountered this problem, we solved it using GuardedLinkedList rather than a counter. If we are going to use a different solution here, I need to understand why GuardedLinkedList can't work.

Let me try to summarize my understanding of the PR.

Currently we move items to the pending list one-by-one, and then we traverse pending afterwards. The perf issue is that moving them one-by-one means that we traverse the entries twice.

However, as far as I understand, the items that are moved correspond exactly to those that are stored in a specific bucket, correct? Could we move all of those items into pending in one operation? One of the advantages of linked lists is that you can make that kind of batch operation cheaply, after all.

However, in this comment you object that when we move all of the items into the GuardedLinkedList, we would need to traverse the entire list to mark the entries as being in pending rather than the timer wheel. And that this is because the entries could be dropped while they are in pending.

However, the documentation for LinkedList::remove says this:

https://github.com/tokio-rs/tokio/blob/8e15c234c60cf8132c490ccf03dd31738cfeaca8/tokio/src/util/linked_list.rs#L192-L200

That is to say, you are allowed to call remove on a linked list in the timer wheel even though the entry is actually in some other GuardedLinkedList as long as the timer wheel and the GuardedLinkedList are protected by the same mutex. In the other users of GuardedLinkedList we have been able to avoid a traversal to mark entries as being in a GuardedLinkedList, so I'm thinking that we can probably also avoid that traversal in this case.

wathenjiang commented 4 months ago

Thank you @FrankReh for prodding me. I will try to summarize my thoughts on this PR. The idea of using a counter to avoid the infinite loop makes me a bit worried, and I can't really tell whether it's correct. It seems to me that the counter is being used to solve a problem that we have encountered many times in Tokio. Previous times we encountered this problem, we solved it using GuardedLinkedList rather than a counter. If we are going to use a different solution here, I need to understand why GuardedLinkedList can't work.

Let me try to summarize my understanding of the PR.

Currently we move items to the pending list one-by-one, and then we traverse pending afterwards. The perf issue is that moving them one-by-one means that we traverse the entries twice.

However, as far as I understand, the items that are moved correspond exactly to those that are stored in a specific bucket, correct? Could we move all of those items into pending in one operation? One of the advantages of linked lists is that you can make that kind of batch operation cheaply, after all.

However, in this comment you object that when we move all of the items into the GuardedLinkedList, we would need to traverse the entire list to mark the entries as being in pending rather than the timer wheel. And that this is because the entries could be dropped while they are in pending.

However, the documentation for LinkedList::remove says this:

https://github.com/tokio-rs/tokio/blob/8e15c234c60cf8132c490ccf03dd31738cfeaca8/tokio/src/util/linked_list.rs#L192-L200

That is to say, you are allowed to call remove on a linked list in the timer wheel even though the entry is actually in some other GuardedLinkedList as long as the timer wheel and the GuardedLinkedList are protected by the same mutex. In the other users of GuardedLinkedList we have been able to avoid a traversal to mark entries as being in a GuardedLinkedList, so I'm thinking that we can probably also avoid that traversal in this case.

Thanks! I have never noticed the comment of the remove method in linked_list. Sorry for mistaking such important information.

I believe this is the best way to address our purpose.

FrankReh commented 4 months ago

I looked it over and actually didn't have anything to add. The GuardedListedList is a very nice resource that tokio provides for some of its drivers. Glad that didn't have to be invented for this. I will say it removes the pending entry so simplifies things in a way and I didn't notice any new uses of unsafe. There is a new constant and the tests around the constants have changed a bit so requires someone who has internalized all the timer wheel workers to say the changes don't open up any holes.

Just wait for @Darksonn to find time. There are a lot of changes so probably takes more than a few minutes to decide if it looks right.

But from my perspective .. nicely done.

Darksonn commented 4 months ago

Overall looks reasonable to me.

Darksonn commented 2 months ago

This change has been reverted in #6715.