lightninglabs / lightning-node-connect

MIT License
78 stars 22 forks source link

gbn: set resend timeout dynamically #88

Closed ViktorTigerstrom closed 5 months ago

ViktorTigerstrom commented 7 months ago

Based on #87

This PR updates the resend timeout to be dynamically set based on how long it took for the other party to respond during the handshake process. The timeout is set to the time it took for the server to respond multiplied by the resendMultiplier, unless the duration is shorter than the default resend timeout. If the the resend timeout has been manually set, the resend timeout will always be set to that value, and won't be dynamically set.

Prior to this PR, the timeout before a client resends the queue of packets was always a fixed value. This fixed timeout isn't suitable for all clients as the latency for different clients varies.

ViktorTigerstrom commented 7 months ago

Rebased on the latest version of #87.

ViktorTigerstrom commented 7 months ago

Great idea regarding modularising it to a Timeout manager @ellemouton! I've updated this PR to a new TimeoutManager struct, which handles all different timeouts throughout the gbn package. Currently it only updates the resend timeout with the handshake messages, but can be easily updated to update it when we send/receive the other messages as well. I hope this addresses this the way you ment in your feedback :).

ViktorTigerstrom commented 7 months ago

Rebased on #87

ellemouton commented 7 months ago

try rebasing linter fix: https://github.com/lightninglabs/lightning-node-connect/actions/runs/6930173289?pr=92

ViktorTigerstrom commented 7 months ago

Nice addition! Gonna be a game changer!

Thanks :fire::rocket:!! I've addressed your feedback with the latest push 🚀!

I'll provide a more detailed answer to this tomorrow, but will provide a short answer for now. I should probably have mentioned this in the PR description previously, but the reason I've based this on #87, was that I noticed some potential compatibility issues when manually testing this, if only 1 of the 2 counterparties updated to a version of LNC that set the resend timeout to a different value than the default. It led to a resend loop of the packets that seemed potentially even worse than original packeting resend bug, but the fix in #87 largely fixed that if that party had also implemented that fix. I'll do more thorough testing tomorrow though, and see if my initial tests were correct.

  • A test for TimeoutMgr should please be added :)

This has been added now 🔥

ViktorTigerstrom commented 7 months ago

I've now rebased this PR on upstream, and updated it to modify the TimeoutManager through functional options in the config. This was added with commit 25c0c68 I've also updated the PR to now update the resend timeout when we receive ACKs for sent PackageData msgs, as long as they weren't resent. I added this is a separate commit fd36941, as the commit prior was starting to get quite large. Let me know if you think it's better that I squash it with the commit above though!

Finally, I'm also planning to make a small update tomorrow in the TimeoutManager which increases temporarily increases the HandshakeTimeout when we are resending SYN msgs. It got too late to add that today. Though this will be a small additional feature, so this PR can be reviewed before that's been added.

ViktorTigerstrom commented 6 months ago

Added two new commits as the last commits of the PR that adds the following functionality:

When we need to resend a data packet, or when we need to resend the SYN message during the handshake, and the other side does not respond within given timeout, we will boost the timeouts by 50% for each time we need to resend without receiving any response. This ensures that if the original timeouts are set to a too short duration given the current network latency, we will eventually boost the timeouts to a long enough duration that allows the other side to be able to respond within the timeout.

Note that this is currently also applied if to the resend timeout if we resend due to receiving a NACK from the other side due to packet loss during the transfer. If you'd like, I can refactor and pass along a bool that indicates if the resend happened due to a timeout or due to packet loss. I opted to not do that with the current implementation due to that there's no real tangible downside to temporarily boosting the resend timeout in such situations, and passing along the bool made things less clean.

ViktorTigerstrom commented 6 months ago

Thanks for the review @guggero! I've addressed your feedback :)

lightninglabs-deploy commented 6 months ago

@ellemouton: review reminder @viktortigerstrom, remember to re-request review from reviewers when ready

ViktorTigerstrom commented 5 months ago

Thanks a lot for the review @guggero! I've addressed your feedback with the latest push.

Same question here about the integration tests though: Do we want them to pass in this PR already or are there further changes needed in follow-up PRs? Do we know what's causing the failures?

I will fully investigate this before lunch tomorrow, but I'm quite confident that the issue why this has become worse since the addition of #87 (even though the issue existed prior to that), is that we will now end up waiting for the syncer to timeout in a bunch of the reconnection itests.

ViktorTigerstrom commented 5 months ago

Rebased on the new version of #87.

ViktorTigerstrom commented 5 months ago

As explained in #87, the latest CI workflow now failed on the pre-existing issue (2).

ViktorTigerstrom commented 5 months ago

Rebased on #87 which has now added the itest fix :rocket:

ViktorTigerstrom commented 5 months ago

Rebased on new version of #87

ViktorTigerstrom commented 5 months ago

Just rebased on master now that #87 has been merged, to make it clear which commits are part of this PR.

ViktorTigerstrom commented 5 months ago

Hmm got a new itest error I haven't seen before. Seems likely to be caused by something in this PR. Looking into what's happening.

ViktorTigerstrom commented 5 months ago

Ok so there turned out to be a few more issues effecting the itests, some that which were pre-existing and some some which where made a bit worse by this PR that now increases the resend timeout:

(1) Now with this PR, we may back-off and wait for the sync for potentially longer duration than the Ping ticker, If we have just sent a previous ping, both the pingTicker and pongTicker may have ticked when waiting to sync. That may cause that we continuously resend a ping without ever timing out due to the PongTicker Tick. I addressed this with: fa371d8

(2) This issue the issue that happened earlier above, and was pre-existing, and I'd like to ensure that you agree that my solution to it is the correct way to handle the issue: We had a bug that the server could end up in a scenario where it would not complete the handshake if the server restarted the handshake after sending its SYN, but before received the SYNACK response back from the client, before timing out due to latency or packet loss. If the client then received the SYN, it will proceeded to send the SYNACK and complete the handshake regardless of if the server receives the SYNACK in time or not. If this happened, the server would then get stuck in the handshake while the client would leave the handshake phase.

I therefore changed the handshake logic for the server, so that if the server receives a SYNACK or PacketData after having sent it's SYN but timing out in the handshake, it would treat those as the completion of handshake. This was updated with b447950. Do you agree that this is a correct handling in the handshake?

(3) In itests, due to the client closing it's context before closing the connection, any FIN message sent by the client would always fail. I didn't really see any itest failures due to this, but I corrected it to ensure it doesn't cause any errors in the future with 6b5ca8b.

@ellemouton, given the discussion in https://github.com/lightninglabs/lightning-node-connect/pull/87#discussion_r1442604224, is this way of ensuring that c.cancel() gets executed even if c.grpcConn.Close() errors ok as it closely resembles how we normally do with returnErr in itests? Or should I explicitly check the c.grpcConn.Close() and run c.cancel() if & if not an error was thrown.

(4) Finally I bumped the itest timeout to 60seconds just a precaution for users which have a really bad connection 256546e. This was done as I've realised that if a user get's the resendTimeout with boosting set to more than 10secs, the user is bound to timeout on the reconnection tests if unlucky. This is because when the other party (server or client) is down during the test, if the syncer has then just resent the packets, it will take 30 seconds (10 seconds X 3) for the syncing to time out and complete. That will then time out the itest.

I should probably add that I haven't really had that bad latency so it hasn't been an issue for me yet, but I've been close to it when testing. Therefore I changed this as a precaution, but will drop this commit if you think it's an unnecessary addition.

ViktorTigerstrom commented 5 months ago

Thanks a lot for the reviews @guggero & @ellemouton! I've addressed your feedback with the latest push!

@ellemouton, please specifically check my response to https://github.com/lightninglabs/lightning-node-connect/pull/88#discussion_r1451870412, as that wasn't addressed with this latest push!

ViktorTigerstrom commented 5 months ago

Thanks a lot for the review and testing @ellemouton :rocket::fire:!!

Let me know if you disagree with https://github.com/lightninglabs/lightning-node-connect/pull/88#discussion_r1453334124, else this should be good to go :tada::rocket:!!

ViktorTigerstrom commented 5 months ago

After discussing this with @ellemouton offline, I've updated so that the TimeoutManager's Received function grabs the main mu in the beginning of the function. This ensures that any GetResendTimeout call we receive concurrently after the one thread has just gotten a Received call that will update the resend timeout, will then return the updated timeout!