Closed Yspadadden closed 7 years ago
I think a better solution would be, do not send more heartbeat requests, while we are waiting for the heartbeat response.
Replacing https://github.com/msemys/esjc/blob/master/src/main/java/com/github/msemys/esjc/tcp/handler/HeartbeatHandler.java#L51 with if (evt instanceof IdleStateEvent && timeoutTask == null)
should be enough.
That was my first attempt as well. The problem with that approach is thread safety. If you first check for null on timeoutTask and then assign something to timeoutTask in a non-atomic way, this is a race condition. If two threads attempt this almost at the same time you could end up with the same situation I am trying to fix. Namely that one timeoutTask gets overwritten by another and can't be canceled anymore.
Of course you could argue that there can't be two threads doing this simultaneously because idle events are 500 ms apart. And you might be right in practice 99.9% of the time. My solution works always. No matter how many idle events appear simultaneously. And when I fix things I like to fix all edge cases no matter how improbable.
So, please give my approach another consideration. I don't think it adds a lot of complexity to the code and is thread safe. Also, ConcurrentHashMap should be fast as it locks sparingly.
Your fix is good and it works, but I would like to avoid of collecting heartbeat timeout futures, because we are interesting only in one timeouted heartbeat (we don't need to schedule them all and wait for only one). What about if we use atomic flag or lock (or synchronized methods)? That be thread safe and it won't send more heartbeat requests while waiting.
private final Object timeoutTaskLock = new Object();
...
...
@Override
public void userEventTriggered(ChannelHandlerContext ctx, Object evt) throws Exception {
synchronized (timeoutTaskLock) {
if (evt instanceof IdleStateEvent && timeoutTask == null) { ... }
}
}
private void cancelTimeoutTask() {
synchronized (timeoutTaskLock) { ... }
}
That would also work. I'll update the pull request with this solution.
If a second heartbeat was triggered while we have not received the response to the first heartbeat, the timeout task of the first heartbeat was overwritten. This caused the timeout task to be triggered always as there is no way to cancel it anymore.
We now remember the timeout per heartbeat and support having multiple pending heartbeats.
Also, there was a race condition when setting up the timeout task. If the heartbeat returned really fast, it was possible for the timeout task to be scheduled after the heartbeat response has already been processed.