private-octopus / picoquic

Minimal implementation of the QUIC protocol
MIT License
523 stars 153 forks source link

Setting idle timeout does not take effect for connection timeout #1595

Closed TimEvens closed 6 months ago

TimEvens commented 6 months ago

Use-case:

Server is not listening/running, which results in connection timeout. We are attempting to tune the client to adapt to situations when the server isn't running or if the user puts in the wrong IP/port/etc. In this case, the connection will not be successful because there will be no packets in response to the client attempt to make a connection.

This could be from server to client as well. Regardless, we do want the configured idle timeout to work.

Details:

Apparently there is no connection timeout. It would be great if we did have a connection timeout option and local error reason that conveys that. Right now we do get the IDLE TIMEOUT local error for connection timeouts.

Considering there is no remote, there is no negotiation of idle_timeout with the remote.

In our attempts with the API, setting idle timeout by picoquic_set_default_tp() and/or picoquic_set_default_idle_timeout() have no effect on changing the idle timeout from the default of 30 seconds. For example, I tried changing it to 10 seconds, but timeouts were still at 30 seconds. I believe this might be related to issue #1594.

From looking at the code, it appears https://github.com/private-octopus/picoquic/blob/master/picoquic/quicctx.c#L1227 hard codes it instead of using the supplied variable in *tp. Maybe I missed it, but I did not see in the code where it is updated based on one of the api methods mention above.

When does default_idle_timeout get applied and how does that affect the order of API calls to picoquic_set_default_tp()?

Example API usage that we tried, but did not change the idle timeout of 30 seconds

    local_tp_options.idle_timeout = 10000; // 10 seconds
    //...    
    picoquic_set_default_tp(quic_ctx, &local_tp_options);
    picoquic_set_default_idle_timeout(quic_ctx, 10000);

Regarding negotiation of idle timeout

The negotiation of idle timeout appears to be the lower value of local vs remote. For example, if remote idle timeout is lower than local, then set to remote idle timeout? Is that correct? I'm a bit concerned with this logic because it would require us to always have the server higher than clients. Some clients need a longer timeout depending on their network conditions. If the server is set too low, the client is unable to do anything to improve their timeout. At a minimum it would be good to notify the client (or server) that the idle timeout is being overridden by the remote. TCP keepalievs and idle timers are unidirectional, which have their pros and cons, but they have been working well in production. What is the background on why negotiation is used instead of unidirectional idle timeouts?

huitema commented 6 months ago

The timeout is indeed the lower value of local and remote. This is specified in RFC 9000, 10.1. Idle Timeout

_If a max_idle_timeout is specified by either endpoint in its transport parameters (Section 18.2), the connection is silently closed and its state is discarded when it remains idle for longer than the minimum of the max_idletimeout value advertised by both endpoints.

TimEvens commented 6 months ago

Thanks. That does highlight why it's using the lower of the two. max_idle_timeout makes more sense when considering configuration of server vs client. For server side we would either NOT want to set max_idle_timeout or it would set it to the max allowed for a client connection.

In our case, server makes more sense to be set to a high max value or nothing at all, which in effect results in the client idle_timeout being used bi-directionally.

huitema commented 6 months ago

Just to be sure, I am adding a test of idle timer negotiation in PR #1596. The test verifies the expected behavior. It passes, with no code change required.

And no, we cannot use "max of 2" -- that would break interop. The idle timer is really meant to protect servers against the common case of abrupt client departure. The server will be programmed, for example, to delete a connection if it was inactive for more than 10 seconds. If that's the case, it does not make sense for the server to accept larger values, because that would allow attackers to negotiate large values and DOS the server. It also does not make sense for the client to use larger values than the server, because that would mean trying to use a connection after the server has deleted it.

TimEvens commented 6 months ago

The issue I'm seeing is with https://github.com/private-octopus/picoquic/blob/master/picoquic/sender.c#L3863. Negotiation is working, it's before negotiation.

It's hitting the below override of our configured default idle_timeout.

    if (cnx->cnx_state >= picoquic_state_ready) {
        uint64_t rto = picoquic_current_retransmit_timer(cnx, cnx->path[0]);
        idle_timer = cnx->idle_timeout;
        if (idle_timer < 3 * rto) {
            idle_timer = 3 * rto;
        }
        idle_timer += cnx->latest_receive_time;

        if (idle_timer < cnx->idle_timeout) {
            idle_timer = UINT64_MAX;
        }
    }
    else if (cnx->local_parameters.idle_timeout > (PICOQUIC_MICROSEC_HANDSHAKE_MAX / 1000)) {
        idle_timer = cnx->start_time + cnx->local_parameters.idle_timeout*1000ull;
    }
    else {
       /* =======================>
        * This is where the timeout of 30 seconds is coming from, which overrides the configured default idle_timeout
        * The above if only accepts the timeout if it's greater than PICOQUIC_MICROSEC_HANDSHAKE_MAX
        * <======================= */
        idle_timer = cnx->start_time + PICOQUIC_MICROSEC_HANDSHAKE_MAX;
    }

The state is not connected and the configured idle_timeout is less than 30 seconds, so the last else block is being used... 30 second timeout regardless of the lower value we set. Based on this code, if we set it longer than 30 it should use it. In our case, we need/want the connection timeout to be shorter because often the reason of connection not being established is due to a typo in the IP and/or port.

huitema commented 6 months ago

Oh, I see. You want to lower the "handshake timer", and make sure that the connection gives up if the server does not respond in the expected time. Makes sense. What is your expectation if the timer is set to zero? Wait forever?

TimEvens commented 6 months ago

What is your expectation if the timer is set to zero? Wait forever?

Challenge is that idle timer is used for connection timeout as well.

For idle_timeout I didn't see RFC9000 mention that idle timer MUST be set. Therefore, if it's zero, it should use the remote value. If both remote and local are zero, then idle timer should be disabled. Personally, we would never use zero on both sides. We would always ensure that at least one side was set.

For connection_timeout it needs to be some value that at least allows the handshake which is based on the RTT. Considering we don't know the RTT when the server doesn't respond, we need to put some line in the sand. I would put that at 10 seconds. If it takes longer than 10 seconds to connect by default, then we should probably timeout. The endpoint making the connection can decide to increase or decrease this timeout... but it should probably be at least >= 125ms.

For our use-case, we would like to use a 5 second connection timeout. This is just under the end-user annoyance threshold. Anything longer than 5 seconds will trigger end-users to get a bit frustrated, especially when it's their fault typing in the wrong ports/ips/... :) BTW, our connect times, including all setup for data flows, are expected to be within 3 seconds.

huitema commented 6 months ago

So basically, if the timeout is set, use that for the handshake, and if not, use a default. We can live with that until someones complains, but I think the "deep space" folks will be disappointed...

huitema commented 6 months ago

I am fixing this in PR #1596. Might need to change the title, of course. There will be two variables:

There is argument to be made about not establishing a connection if it takes too long (short handshake timer) and not dropping it too quickly if it is established, maybe setting the idle timeout to zero.

I am pretty sure that this needs more work, but it is probably good enough for now.

TimEvens commented 6 months ago

Excellent, thanks so much!