Need guidance regarding using HTTP client in slow networks to download files

warmcat / libwebsockets

canonical libwebsockets.org networking library

https://libwebsockets.org

Other

4.77k stars 1.49k forks source link

Need guidance regarding using HTTP client in slow networks to download files #2279

Closed p-ranav closed 3 years ago

p-ranav commented 3 years ago

Hello,

Thank you for this library. I am currently working on a HTTPS file downloader using libwebsockets and have hit a sort of hurdle.

lws_protocols is https
The parameter that I'm struggling with is lws_context_creation_info.timeout_secs_ah_idle
- Let's say I set this to 25 mins
- If the download doesn't complete within 25 mins, I see the error: ah excessive hold: wsi 0xa1adf0
- If the connection fails during the download process, the downloader waits until the 25 mins are over for the timeout to trigger - at which point an lws_callback_reasons = LWS_CALLBACK_CLOSED_CLIENT_HTTP is handled
In this issue, it is mentioned that this timeout must be set to some large value (e.g., the author of that issue sets it to 1 hour). However, if the connection failed for some reason, the process waits for the 1 hour timeout to end before I am notified with LWS_CALLBACK_CLOSED_CLIENT_HTTP

Here is a summary table of my observations

Attempt	Description	Idle Timeout	Connection Interrupted?	Network Throttled?NetLimiterDownload Speed	Download Successful?	Notes
1	Download the file in gigabit network	5 mins			YES	Best case scenario - Fast, reliable network
2	Start download and disable network	5 mins	YES		NO	Timeout correctly at 5 mins
4	Limit the network download speed to 900 KB/s	5 mins		YES - 900 KB/s	YES	Download took 2.391𝑚𝑖𝑛 (143.46 secs)
5	Limit the network download speed to 700KB/s	5 mins		YES - 700 KB/s	YES	Download took 3.757𝑚𝑖𝑛 (225.43 secs)
6	Limit the network download speed to 500 KB/s	5 mins		YES - 500 KB/s	NO	ah excessive hold: wsi 0x1884df0
7	Limit the network download speed to 150 KB/s	25 mins		YES - 150 KB/s	YES	Download took 14.876𝑚𝑖𝑛 (892.56 secs)
8	Limit the network download speed to 100 KB/s	25 mins		YES - 100 KB/s	NO	ah excessive hold: wsi 0xa1adf0Timeout at 25.994𝑚𝑖𝑛

So, I need some guidance here. I am okay with setting this timeout to some large value (to correctly download even in very slow networks). That said, I would also like to be able to detect connection problems early without having to wait for the timeout to expire. Can you guide me regarding how I can achieve this with the library?

Thank you.

lws-team commented 3 years ago

There are a few different things here.

ah are relatively expensive to allocate, on a linux box they're something over 4KB by default. So we don't want them to be treated as immortal and get all our resources used up, or have them leak. So the immediate thing about "excessive" ah allocation time is a sanity check to detect and protect against that. If in your case an ah being allocated for (ie, an http transaction living for) 1h is reasonable, you can set the ah timeout accordingly, and end of that problem; all of the fails on "excessive hold" in your table will just work then.
ah used to be guarded more jealously before h2, held in a pool and connections made to wait to acquire a free one, etc. With h2, one connection may demand many ah immediately and we either have to come up with them right away, or fail the stream. So with h2, no choice but to be more relaxed about it and allocate them on demand. They are still protected from leaking.
this is not the connection timeout, this is a separate sanity check about protecting ah over and above the wsi timeout. You should set it to be bigger the longest time any http transaction should ever reasonably be allowed to live on your server under any conditions, 10 seconds, 10 hours, depends on your policy about it. Normally, you should never see it fire, but eg, slowloris type attacks will trigger it.

However, if the connection failed for some reason, the process waits for the 1 hour timeout to end before I am notified with LWS_CALLBACK_CLOSED_CLIENT_HTTP

wsi also have their own individual timeout. If you are serving a file, once it gets past the other steps to set up the transaction, lws should reset the timeout to what you set in info->timeout_secs every time it sends a chunk to the client, ie, it lets it live another 15s by default each time it sends something. If the client stops accepting things, it should time it out after info->timeout_secs for that reason, nothing to do with ah monitoring.

soderstrom-rikard commented 3 years ago

Hi @lws-team,

@p-ranav is a collegue of mine that is working on this issue from our side. But I am quite confused about what this answer here.

Isn't the allocated_headers (https://github.com/warmcat/libwebsockets/blob/v4.2-stable/lib/roles/http/private-lib-roles-http.h#L104 ) a library "concept"? As a user of the library, why should the user have to know about this depending on the network speed?

From what I can see, the reason that it becomes a "total lifetime" is that the ah->assigned is assigned during first connect and from that point on compared with the current time. https://github.com/warmcat/libwebsockets/blob/v4.2-stable/lib/roles/http/header.c#L592

    ah = pt->http.ah_list;
    while (ah) {
        int len;
        char buf[256];
        const unsigned char *c;

        if (!ah->in_use || !ah->wsi || !ah->assigned ||
            (ah->wsi->a.vhost &&
             (now - ah->assigned) <
             ah->wsi->a.vhost->timeout_secs_ah_idle + 360)) {
            ah = ah->next;
            continue;
        }

Finally ending up here https://github.com/warmcat/libwebsockets/blob/v4.2-stable/lib/roles/http/header.c#L619

        lwsl_notice("%s: ah excessive hold: wsi %p\n"
                "  peer address: %s\n"
                "  ah pos %lu\n", __func__, lws_wsi_tag(wsi),
                buf, (unsigned long)ah->pos);

When we have sucessfully recieved a chunk of data from the server, would you consider it bad practice or error-phrone to reset the ah->assigned to the current time?

From my perspective it would seem like the correct thing to do, since we know at that point that the connection is still valid and we do not want it to be interrupted. In fact, what is the rationale for libwebsockets not to do so?

Looking forward to hearing why I am wrong on this :)

Best regards, Rikard

lws-team commented 3 years ago

As a user of the library, why should the user have to know about this depending on the network speed?

... because stuff times out? Is that really such a novel concept?

When we have sucessfully recieved a chunk of data from the server, would you consider it bad practice or error-phrone to reset the ah->assigned to the current time? Looking forward to hearing why I am wrong on this :)

It's something you would think about if you did not read what I wrote in my reply properly, especially

If in your case an ah being allocated for (ie, an http transaction living for) 1h is reasonable, you can set the ah timeout accordingly wsi also have their own individual timeout.

Also you are using a version of lws that is several years old. If you want support please use a recent version.

soderstrom-rikard commented 3 years ago

@lws-team

For the sake of argument I've updated the links to the latest stable version libwebsockets in my previous comment (even thou relevant parts of the code is mostly identical). And yes, we will also update to the latest version, as you say it is way passed its bedtime ^^-

.. because stuff times out? Is that really such a novel concept?

Yes things timeout, timeouts itself is definitely not a novel concept. But I think we are misinterpreting each other here. We can probably at least agree that this is not clear as to why it exists and how users (such as myself) are supposed to use it. It definitely is not clear to me.

Lets make a hypothetical example of an infinite data stream that you receive over https. For example the video stream of a camera. Something that by definition have no timeout, except maybe infinity. Could libwebsockets handle that using the https protocol today?

I think sadly due to an implementation detail, the answer is no. So why is a total lifetime timeout of the allocated_headers enforced in all cases?

The opposite is of course also true, there are definitely situations where you would want such a timeout, but certainly not always.

It's something you would think about if you did not read what I wrote in my reply properly, especially

If in your case an ah being allocated for (ie, an http transaction living for) 1h is reasonable, you can set the ah timeout accordingly wsi also have their own individual timeout.

I assume that what you are saying is that we should set both these timeouts?

Best regards, Rikard

lws-team commented 3 years ago

In the case of ah timeout, it is not a fixed number but user-settable.

How long is reasonable for an http connection to hold an ah depends on your use-case. If it's different from the default, set it to what suits your use-case.

OP is saying "I set it to 25m, when my connections took longer than 25m, trouble!" Yes... so don't set it to 25m which is inside what you evidently consider a reasonable transaction time, figure out what is a reasonable limit to your use case and set it a bit beyond that, so it only closes connections that you also think are unreasonably long-lived. You are running a server, so you must defend it against immortal connections and resource exhaustion.

It is not that complicated it needs a tag team and the answer repeating.

I assume that what you are saying is that we should set both these timeouts?

What I said is in response to

However, if the connection failed for some reason, the process waits for the 1 hour timeout to end before I am notified with

There is another timeout dedicated to the wsi. If it is files you are serving, as I explained the approach is to keep setting the timeout a few seconds (you can also control this duration) ahead each time we send something. So stalled connections should not act as described, they should give up after 15s of not being able to send any more by default on main / v4.2 branch. It's broken on current stuff? Show me how to reproduce it and I will study it.

soderstrom-rikard commented 3 years ago

For a server I would agree.

In this use case we are the client not the server, we are not hosting the files we are retrieving the files. Therefor it seems very odd that we would ever want our download aborted due to it taking "too long".

lws-team commented 3 years ago

Well, I wrongly assumed it was serving.

But it's still the case you don't want to tie up heap beyond what is reasonable for what it is trying to do, for client, lws is used on resource-constrained targets like ESP32 where the tls tunnel costs 30%-40% of available heap.

Advice is the same, set it to a number in line with what is reasonable for your use-case, and update lws to see about the other wsi timeout problem.

soderstrom-rikard commented 3 years ago

That would depend on the use-case even for an ESP32.

The camera stream is again the perfect counter-example for this. It would cost more to allocate/de-allocate every X timeout than simply allocating once at startup and then keep going.

Now that I understand your point of view better, I could form a patch that adds a hash-define for enabling the behavior I've been referring to, while keeping the library's current behavior as its default.

For example something that starts like this (I would also remove variables and members that would become unused from this hash-define).


         if (!ah->in_use || !ah->wsi || !ah->assigned ||
            (ah->wsi->a.vhost &&
#if !defined(LWS_WITHOUT_TIMEOUT_AH_SECS)
             (now - ah->assigned) <
             ah->wsi->a.vhost->timeout_secs_ah_idle + 360)
#endif
             ) {
            ah = ah->next;
            continue;
        }

Whether or not you want that feature in your library is up to you.

Best regards, Rikard