jetty / jetty.project

Eclipse Jetty® - Web Container & Clients - supports HTTP/2, HTTP/1.1, HTTP/1.0, websocket, servlets, and more
https://eclipse.dev/jetty
Other
3.83k stars 1.91k forks source link

Review min[Request|Response]DataRate behavior #4014

Open sbordet opened 5 years ago

sbordet commented 5 years ago

Changing the HttpConfiguration.min[Request|Response]DataRate value from 0 to non-zero causes many tests to fail, but some unexpectedly.

Current behavior for reads is different from HTTP/1.1 and HTTP/2, as HTTP/1.1 calls HttpInput.addContent() from read(...), while HTTP/2 calls it from parsing (and the read() may happen way later).

The current behavior for writes is that the content production must be greater than the min rate. An application that writes small chunks of content separated by pauses (e.g. GC, sleeps or DB queries) may risk to not produce content fast enough (Ws=Write start, We=Write end):

....Ws.We(10B)....(idle)....Ws.We(20B)
    |--------------------------|
    t0                         check=>(10+20)/(now-t0)

now-t0 may be large (and trip the min threshold) because the application was stalled, not because the writes were stalled (in fact, the writes were fast).

An alternative approach for writes is to just measure how long the writes take (from write begin to write end), and how many bytes they write:

....Ws.We(10B)....(idle)....Ws....(tcp stalled)....We(20B)
    |**|                    |**********************|
                                                   check=>(10+20)/(Σ(We-Ws))

This is now measuring whether writes are stalled (the number of asterisks), not content production, which may be better with respect to protecting the server from slow clients.

mattiaslundstrom commented 4 years ago

One additional problem with the read side of this implementation (ie minRequestDataRate handling) is that it is only measured on the data content. This gives some protection against a slow POST DoS attack, but no protection against a classical Slowloris attack where the client is sending the header data extremely slowly. It would be much better if the data rate could be measured for all network reads for the request, but this seems like it would require some significant rework.

gregw commented 4 years ago

It is possible for use to capture a timestamp of the very first select that relates to a request (ie as parsing is started). The HTTP parser could then check the data rate easily enough as it is already tracking the header size to protect against large headers. @sbordet could h2 do the same?

sbordet commented 4 years ago

@gregw a HTTP/2 client that wants to send headers very slowly would have to use a HEADERS frame + CONTINUATION frames. We should be able to record the unfinished HEADERS frame time and track the arrival of subsequent CONTINUATION frames.

sbordet commented 4 years ago

@mattiaslundstrom @gregw note also that a client sending the headers slowly is not a big problem, since we don't allocate a thread until we have all the headers. So a client sending the headers slowly will just cause NIO to wake up, read and parse the little it sent, store that as (part of) a header, and go idle threadless again. I don't think we should complicate things too much for slow headers as it's not a problem for the server.

The content, however, it's a problem because an application may read it with blocking APIs and keep threads blocked.

sbordet commented 4 years ago

For reads the current behavior is that we record the first timestamp t0 when the first content arrives, and sum the bytes of content arrived (C) in HttpInput.addContent(). Then, we check the rate inside read(byte[], int, int).

In 9.4.x HTTP/1.1 we only trigger content arrival every time we read:

....C(10B)....(idle)....Rs.C(20B).Re
    |-----------------------------|
    t0                            check=>(10+20)/(now-t0)

In 9.4.x HTTP/2 content arrives at any time and it's queued:

....C(10B)..C(20B)....(idle)....Rs.Re
    |------------------------------|
    t0                             check=>(10+20)/(now-t0)

In Jetty 10.0.x the two behaviors will be coalesced into the 9.4.x HTTP/1.1 (see #4318).

Therefore, like in writes, reads suffer from the fact that the application idle time is taken into account, while it should not.

An alternative approach is to record how long reads really take:

....C(10B)....(idle)....Rs.Re.Rs...(no content)....C(20B).Re
                        |**|  |***************************|
                                                          check=>(10+20)/(Σ(Re-Rs))

I'm thinking this new behavior should be implemented in Jetty 10.

sbordet commented 4 years ago

@gregw @lorban your thoughts?

lorban commented 4 years ago

So you're basically suggesting to take both timestamps and checking them in the HttpInput.read() method?

This sounds like the most straightforward and reasonable method to me.

gregw commented 4 years ago

I can see a problem with this new approach.... ah but it also affects our current approach. Ah but my idea for fix will solve both problems: Consider receiving chunked content like:

05;
Hello
fffff;
Lots of data ...........

If we block for some time waiting for this raw content, then the initial small chunk will be added and we'll read only 5 bytes, but the read could have taken a long time because the second chunk is really rather large. So if we check the data rate in either addContent or in the read, then both total_data/total_time and total_data/time_in_reads will give a false positive for the data rate limit!

The fix is to only check the data rate before we block (or schedule async callback) - ie if there is content to be consumed, we don't even look at the data rate. Only once are about to block (or schedule an async callback) should we look at the data rate and then total_data/total_time is the best way to do this because: a) it is the real data rate and thus easy to explain; b) if the application is slower than the network, then data will have arrived and we won't look at the data rate so we won't suffer from the problem that this issue was created for.

sbordet commented 4 years ago

@gregw I don't see the problem.

the read could have taken a long time because the second chunk is really rather large

I don't think reads take time if there is content, no matter how large it is. If there is content, we copy it into the byte[] and return immediately. Even if the application gives a byte[1], but we have content available, we will have a very short time to just fill the array, so the rate won't be tripped, even if the application does read() + sleep() + read() + sleep() + ....

Checking for the rate should be done in read() and I don't see how it could give a false positive, except for corner cases (e.g. a GC pause at exactly the right time with a super small application byte[]).

sbordet commented 4 years ago

Also, if we have gzipped content, what do we measure as data rate? The gzipped content, or the ungzipped content?

gregw commented 4 years ago

@sbordet Consider the following:

With the current algorithm, on the first read parseAndFill will add the 1 byte chunk and we divide 1 byte by the 10s block and the data rate is failed. With your proposal, the first read still blocks for 10s, it still only sees 1 byte and the data rate is again incorrectly failed.

With my proposal:

gregw commented 4 years ago

With regards to gzip/transformations, I think we should check raw bytes. So I propose that we check data rate of raw_bytes/total_time only before we block or schedule an async callback.

sbordet commented 4 years ago

@gregw I don't understand. If an application initiates a read where it blocks for 10s and after 10s it receives 1 byte only, that's exactly the case we want the min data rate to apply, as it is evidently an attack.

gregw commented 4 years ago

@sbordet but it didn't receive 1 bytes. The client sent 1001 bytes plus chunking framing. It was just an artefact of the transport framing and our implementation that does not aggregate chunks that we see those 1001 bytes in 10 seconds as 1 byte after 10 seconds and 1000 bytes instantly after that.

gregw commented 4 years ago

Also note that we need to protect the initial read from short times as it will be very chaotic. Also probably reset the total time if a 100 continues is sent.

sbordet commented 4 years ago

@gregw let me see if I understand; this is what you're proposing:

....C(1B)....(idle)....Rs.Re....C(1000).Rs.Re.....Rs..(no content)..C(512).Re
    |------------------|----------------|---------|--------------------------
    t0                 v                v         v
                       no               no        check (now-t0)/(1+1000)
                       check            check     no data
                       data             data
                       present          present

We never check when we have data to return to the application. We only check if we are about to block (or schedule callback), so when there is no data. The check is for the total raw content and total time, when total time > 1 s to avoid to trip the min data rate if the content is small but also the time is small.

We may need to reset the time in case of Expect: 100-Continue when the application calls getInputStream() to signal that it wants content.

If the application goes into a long pause (e.g. GC), when it comes back it will see data to read and we don't do the check until there is data to read.

@gregw however, if the network buffers are small, we may have the client blocked and when the server resumes, the server may consume all the content before a TCP (or HTTP/2) message tells the client to send more content. So the server may consume the content quickly but then block to wait for the network roundtrip time before receiving more content. There we do the check and may trip it not because the client does not send, but because the server has spent a lot of time not reading and causing network congestion which then requires a network roundtrip to be resolved.

gregw commented 4 years ago

@sbordet That is indeed a problem, but I think it is a small one. Essentially if the application is not reading fast enough so the pipe congests and then doesn't achieve it's data rate target, then it hasn't achieved it's data rate target. Now this could be because of an evil client... or simply that because the app was slow then the TCP/IP and/or h2 flow control windows were never grown (or shrunk) to a size that is too small to support the required data rate. But I'm not sure if there is anything we can do to prevent that because we can't tell the difference between an evil client and a small flow control window.

Something that helps is that if the application is slow on read N, then it will often also be slow on read N+1, so there will be time for more data to arrive.

sbordet commented 4 years ago

Essentially if the application is not reading fast enough so the pipe congests and then doesn't achieve it's data rate target, then it hasn't achieved it's data rate target.

I still don't like this. It's a mechanism to protect the server against bad clients, and basing the algorithm on what the server application does feels not correct. We want to get rid of bad clients, not penalize weird/slow server applications.

Having said that, I'm not sure we can do better than what proposed here.

Perhaps we can adjust for application idleness? I.e. rather than counting time now-t0 we only count Σ(Re-Rs) and like you proposed, we only check before blocking. With your example, first read blocks for 10s, so before blocking we check and we have 0 bytes but also 0 read time. We got 1001 bytes in chunks [1, 1000], so we unblock, return 1 byte, read again, there is data, so 1000 is returned and we don't fail. Basically I'm proposing to reduce the time we use to calculate the data rate to just the time we spend in read (blocked or waiting for callback) rather than clock time (that includes time when the application does not read).

gregw commented 4 years ago

I believe I convinced simone on a hangout of the approach I'm suggesting (checking total_data/total_time only when about to block), but for completeness I'll respond to his comment above.

This mechanism is to protect against a bad client, but that doesn't mean we have to check before every read. The badness we are protecting against is a deliberately slow client blocking lots of application threads, thus the badness is the blocking and we only need to check if the client is bad before doing the blocking. Using total_data/total_time means that we measure the achieved data rate, which may be affected by a slow server, but if the client is faster than the server then the server will never block and the check will never be made.

The chance for false positives is only if the server is busty - running slow for a while so that the effective data rate becomes very low and then running fast enough to exhaust all the buffered data. If that does become a problem, then perhaps we need to check some kind of sliding average rather than total, but I'm dubious it will be a significant problem (above the problem the server will be having anyway if it regularly runs so slow to cause this kind of behaviour).

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has been a full year without activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has been a full year without activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale because it has been a full year without activity. It will be closed if no further activity occurs. Thank you for your contributions.