lemunozm / message-io

Fast and easy-to-use event-driven network library.
Apache License 2.0
1.12k stars 75 forks source link

Cancelling timed messages fails sometimes #136

Closed garin1000 closed 1 year ago

garin1000 commented 1 year ago

Hi,

In my project, I send/cancel timed messages very rapidly. They are used as per-message timeouts for receiving a sequence of messages containing bulk data, which is quite fast on localhost (2.5ms per message). After a received message, I cancel the corresponding timeout and create a new one for the next block of data.

Rarely (not exactly reproducable, maybe every 1000 timed sends), cancelling does not work, so that I receive a timed message which should already have been cancelled.

I added some debut prints and noticed an issue when inserting and cancelling a lot of timed messages in short time:

Sometimes, my code calls EventSender::send_with_timer() and EventSender::cancel_timer() while EventReceiver::enque_timers() is still working on removing timers. The remove_timer_receiver queue will then also iterate over the new remove message for a timer, which hadn't even been added to the timer BTreeMap. In the next call of EventSender::enque_timers(), the timer_receiver queue contains the timer, which should already have been cancelled. Of course, this timer then fires at some time.

The EventReceiver::enque_timers() functions seems to be called too rarely and then handles several crossbeam messages at once, which increases the probability of this race condition. In my opinion, the actual issue is that timer creation and cancellation are handled in different queues. Due to this, the creation/cancellation order can be reversed. If you have them in a single queue, the order would always be correct, and it would be guaranteed, that a timer is created before is is cancelled.

Looking forward to your opinion on this issue :-)

Best, Garin

lemunozm commented 1 year ago

Thanks for sharing this and for doing this excellent debug process. Good catch! It took me really little time to get into the problem, and yes, indeed:

the actual issue is that timer creation and cancellation are handled in different queues

That is exactly what is happening. And the solution, as you say, is to merge both queues. Probably using an enum to differentiate the event could be enough:

enum Timer<E> {
    Create(Duration, E)
    Cancel(TimerId)
} 

Would you like to make a PR with the fix?

garin1000 commented 1 year ago

I didn't implement a fix, yet. I try to find some time this evening and make a PR.

I will probably keep the tuples, and just change timer.1 to take the new Timer enum. That then would be

enum Timer<E> {
    Create(E),
    Cancel
}
lemunozm commented 1 year ago

Thanks @garin1000, no hurry!