ring-clojure / ring

Clojure HTTP server abstraction
MIT License
3.75k stars 519 forks source link

Implement Range header middleware with bytes unit as per RFC7233 #425

Closed patosai closed 3 years ago

patosai commented 3 years ago

Edit: I made a separate package for this:


Closes #423

This middleware looks at the Range header in the request.

atomist[bot] commented 3 years ago

Commit messages of this repository should follow the seven rules of a great Git commit message, as mentioned in the project's contributing guidelines. It looks like there's a few issues with the commit messages in this pull request:

atomist[bot] commented 3 years ago

Commit messages of this repository should follow the seven rules of a great Git commit message, as mentioned in the project's contributing guidelines. It looks like there's a few issues with the commit messages in this pull request:

weavejester commented 3 years ago

Thanks for the patch. Due to the size of the changes this is going to take some time to review, however from a cusorary glance, my guess is that by the time this is ready for inclusion into Ring, most - if not all - of the code will wind up being rewritten.

In the meantime if you want to use this sooner, it might be an idea to put it in a separate library, though if you choose to do this I'd be grateful if it was released under the MIT license, so that the code could potentially be included in Ring in future.

There are a few broad problems with the code as written. First, if we're adhering to a specification, the parsing needs to follow the RFC exactly. This typically means constructing a regular expression from the BNF syntax present in the RFC - if possible - and then writing a function to parse this into something meaningful. Your header-value->ranges function is partially there, but it needs to be more precise.

The next main problem is that you've written code for strings, sequences, files and inputstreams, but a Ring response body implements the StreamableResponseBody protocol, which can also include many other things. What we need is something that will turn a StreamableResponseBody into another StreamableResponseBody limited by a range. Possibly we also need to account for multipart response bodies in some way.

Rather than using a byte array, as that implies we can fit everything in memory, we should just read in bytes from one output stream, dropping the bytes not in the range, and send them to another output stream. For some particular classes, like File, String, and bytes, we can create more efficient implementations.

As a general style guide, functions should be much smaller than you've written them, and of course the patch should adhere to the contributing guidelines.

patosai commented 3 years ago

I have some extra time this weekend so I'll rework this.

weavejester commented 3 years ago

I'm not sure why you're using clojure.spec for this, but before you go any further, I should mention that this PR shouldn't add additional dependencies.

patosai commented 3 years ago

I'm not sure why you're using clojure.spec for this, but before you go any further, I should mention that this PR shouldn't add additional dependencies.

That's very unfortunate.

patosai commented 3 years ago

I'm just going to implement the parser using regexes then, I'm not really planning on implementing a BNF parser if that's cool?

weavejester commented 3 years ago

I'm just going to implement the parser using regexes then, I'm not really planning on implementing a BNF parser if that's cool?

Yes, that's what I said previously. You just need regular expressions that adhere to the BNF forms present in the RFC. You don't need a parser, as the syntax is simple enough to be parsed by a few regexes.

patosai commented 3 years ago

I managed to parse the header and the given byte positions using regexes, and also reworked the logic to use a StreamableResponseBody transformer which basically throws away the bytes it doesn't want.

For the case where a suffix byte range is given, like -350, it's unknown when the StreamableResponseBody will end so I have a buffer that's like a circular buffer that saves however many bytes the suffix needs on every write.

For the other case where the range is from a certain byte until the end, I store everything in a ByteArrayOutputStream and then get it out later.

The other normal cases with first/last positions use vectors which I add to during the writing.

The code right now is still pretty messy so I will need to clean it up a bit; thoughts appreciated.

weavejester commented 3 years ago

It's a little difficult to know quite where to begin in terms of reviewing this. As you say, the code is a little messy, so I'll wait to see how you factor out some of those giant functions into something a little more readable.

Instead, I'll focus on the general design. I assume this too is still unfinished, as you're still using a byte array for the body, but let's go over a quick overview of what's needed so we're both on the same page.

First, we parse the header into an intermediate data structure. We can do this with a few simple regular expressions.

Next, we normalize that data structure. We order start times, fix overlapping regions, and so forth. We should then have an ordered list of non-overlapping ranges.

Then we write a function that takes a StreamableResponseBody, our normalized ranges, a multipart separator string, and produces a new StreamableResponseBody that pipes its input and drops anything that isn't in range, delimiting using the the separator string if there is more than one range.

Finally, we write the function that puts it all together and turns a normal 200 response into a 206 ranged response.

Does that overview make sense to you?

atomist[bot] commented 3 years ago

Commit messages of this repository should follow the seven rules of a great Git commit message, as mentioned in the project's contributing guidelines. It looks like there's a few issues with the commit messages in this pull request:

atomist[bot] commented 3 years ago

Commit messages of this repository should follow the seven rules of a great Git commit message, as mentioned in the project's contributing guidelines. It looks like there's a few issues with the commit messages in this pull request:

patosai commented 3 years ago

I did a little bit of refactoring so it's clearer what's going on now.

I assume this too is still unfinished, as you're still using a byte array for the body, ...

Could you expand on what you mean by this? The best way I can see to put bytes together with the multipart boundary and Content-Type/Content-Range header is to have a string containing the boundary/headers/newlines, make a string out of the bytes by doing (String. bytes), and joining them.

Next, we normalize that data structure. We order start times, fix overlapping regions, and so forth. We should then have an ordered list of non-overlapping ranges.

From what I can tell, because we are using StreamableResponseBody, the entire length is not known, so it isn't possible to validate suffix byte ranges or open ended ranges (unless there actually is some way to get the length?). I try to validate everything else before actually parsing the body, then when the entire body is parsed, I go back and check if the body was actually long enough for suffix ranges or open ended ranges, and that nothing overlaps.

weavejester commented 3 years ago

The best way I can see to put bytes together with the multipart boundary and Content-Type/Content-Range header is to have a string containing the boundary/headers/newlines, make a string out of the bytes by doing (String. bytes), and joining them.

We can't guarantee that the body will fit in the available memory.

From what I can tell, because we are using StreamableResponseBody, the entire length is not known, so it isn't possible to validate suffix byte ranges or open ended ranges (unless there actually is some way to get the length?).

Yes, suffix ranges would be a special case, as we wouldn't know the starting index until the stream was fully written. We'd work out the largest suffix, then use a buffer of at least that size to pipe between streams.

patosai commented 3 years ago

It seems you have to read the initial body once then to calculate the length/validate the ranges, then set up the new stream?

weavejester commented 3 years ago

The body can only be read once. What do you mean by "validate the ranges"?

patosai commented 3 years ago

I'm not entirely understanding how you can ensure that none of the ranges overlap or that they all are valid without reading it over at least once.

For example, let's say you have a range header bytes=0-5,340-, but the body length is only 200 bytes long. If we set up a new StreamableResponseBody that reads the body until the end and streams the requested 0-5 bytes, it then realizes that it can't satisfy the 340- range. What happens then?

patosai commented 3 years ago

I reread the spec and found this on page 11 of https://tools.ietf.org/html/rfc7233

A server MUST NOT generate a multipart response to a request for a single range, since a client that does not request multiple parts might not support multipart responses. However, a server MAY generate a multipart/byteranges payload with only a single body part if multiple ranges were requested and only one range was found to be satisfiable or only one range remained after coalescing.

Also on page 5,

If the last-byte-pos value is absent, or if the value is greater than or equal to the current length of the representation data, the byte range is interpreted as the remainder of the representation (i.e., the server replaces the value of last-byte-pos with a value that is one less than the current length of the selected representation).

A client can request the last N bytes of the selected representation using a suffix-byte-range-spec.

 suffix-byte-range-spec = "-" suffix-length
 suffix-length = 1*DIGIT

If the selected representation is shorter than the specified suffix-length, the entire representation is used.

That means for byte-range-spec with the first byte/optional last byte, I just need to make sure that at least a byte exists at the first byte position.

For suffix-range-spec, I need to make sure at least 1 byte exists to satisfy it.

I think I have enough to complete the implementation.

weavejester commented 3 years ago

For example, let's say you have a range header bytes=0-5,340-, but the body length is only 200 bytes long. If we set up a new StreamableResponseBody that reads the body until the end and streams the requested 0-5 bytes, it then realizes that it can't satisfy the 340- range. What happens then?

You'd use a buffer that's at least 340 bytes in size. You'd probably need to buffer each multipart range individually as well, come to think of it, as you don't know which ranges are valid for an arbitrary input stream.

The buffer size for each range would be equal to the maximum of the range's size, or the largest suffix range. In addition, there's an issue with ranges that are delivered out of order. The RFC says:

the server SHOULD send the parts in the same order that the corresponding byte-range-spec appeared in the received Range header field

However, it also says:

Servers ought to ignore, coalesce, or reject egregious range requests... particularly when the ranges are requested out of order for no apparent reason.

So my suggestion here would be to ignore any ranges that are out of order by default for the response body. We can add exceptions for response body types that allow for random access, such as files, strings and byte arrays, which can be accessed out of order. These exceptions don't necessarily need to be added in this PR.

To guard against DOS attacks, adding a maximum number of ranges also seems like a good idea.

patosai commented 3 years ago

The buffer size for each range would be equal to the maximum of the range's size, or the largest suffix range.

If the header is something like 3-99999999 would we still want to allocate the buffer? I stated in my previous comment that I thought we could get away with streaming the body with minimal memory allocation, but I realized there's a problem with calculating the Content-Range header, because we actually have to read the body to get the enclosed byte range, but the headers are written before the body is read.

weavejester commented 3 years ago

If the header is something like 3-99999999 would we still want to allocate the buffer?

No. I think we can add configuration values for the maximum number of ranges, and the maximum size of a range, with conservative defaults. The specification says we can just ignore ranges we don't like. If there are no ranges that are suitable, we return a 416.

patosai commented 3 years ago

If the body length is unknown and there's a range like 340- which requests all bytes from offset 340 to the end, how do we handle that with a buffer of maximum size? Keep reading until the buffer max size is hit and return 416/exclude it from the body?

weavejester commented 3 years ago

Yes, we can probably just pipe the output into a ByteArrayOutputStream until we hit the size limit, then discard it.

atomist[bot] commented 3 years ago

Commit messages of this repository should follow the seven rules of a great Git commit message, as mentioned in the project's contributing guidelines. It looks like there's a few issues with the commit messages in this pull request:

patosai commented 3 years ago

I think it's ready for review again.

If the content length is available (String/File/byte-array/Content-Length response header), the ranges are sorted and a streaming body is used which uses minimal memory and reads through the body, dropping bytes that aren't in a range.

If it isn't available, a buffering body is used which reads the entire body upfront and stores the ranges of everything in memory, before finally writing everything to the output stream when it's time. There's a max number of bytes per range for buffering bodies only, and for both streaming and buffering there's a max number of ranges.

The one wrench is that the Content-Length has to be calculated beforehand the body is written, so I have a function for both streaming and buffering bodies that precalculates what the content length will be.

weavejester commented 3 years ago

I've taken a quick look through the code, and unfortunately it isn't quite what I had in mind.

The code follows a design that is suited for object-orientated languages like Java, but is unidiomatic in Clojure. At the same time, there's some wheel reinventing going on that existing Java I/O classes could handle better.

Let's take a step back. I think trying to handle streams of indeterminable length is perhaps a little ambitious for a first pass, and that's my mistake. Let's instead limit this problem to make it simpler, with the intent to adding more functionality later on, once we have the basics complete.

So we'll narrow the use-case to response bodies with known content lengths, with a request where ranges that are supplied in order. If either of those conditions is unmet, we'll just return a normal 200 OK response.

As an example, assume we have some ranges:

bytes=10-90,100-200,300-500,400-,-50

First we break this into a data structure, preserving order:

[[10 90] [100 200] [300 500] [400 nil] [nil 50]]

Next we find the content length from the request. Assume in this case it's 600. Given this we fill in the ranges and merge those that are within 80 bytes of each other, as suggested by the RFC:

[[10 200] [300 600]]

Once the ranges have been normalized, we can produce a StreamableResponseBody wrapper. We can use the PipedOutputStream and PipedInputStream classes to make this process simpler.

Some more general advice: don't confuse verbosity with readibility, and don't make functions overly long. You've split a regular expression up over 18 lines, when one or two might be clearer. You also have functions that are many lines long, which is often indicative that you're not factoring things correctly.

patosai commented 3 years ago

Thanks for your time and comments. The buffering section made the whole thing complex and I wanted to share code between the streaming and buffering versions.

A note on the regexes: I split them up because it would otherwise be cryptic. Here is the full regex in all its glory: #"bytes=(?:,[ \t]*)*(?:(?:(\d+)\-(?:(\d+))?)|(?:\-(\d+)))(?:[ \t]*,(?:[ \t]*(?:(?:(\d+)\-(?:(\d+))?)|(?:\-(\d+))))?)*"

I'm happy with the capabilities of this middleware and I don't plan on refactoring it anymore; I think it's mission accomplished. It doesn't look like it'll get merged so I made a separate package.

weavejester commented 3 years ago

I was more thinking that you could use three smaller regular expressions: match #"bytes=(.*)", then split on #"[ \t]*,[ \t]", then match #"(\d+)?-(\d+)?" per split element.

Please feel free to add your middleware to the Third Party Libraries page on the Ring wiki.

patosai commented 3 years ago

Yes that could be a way to do that, I wanted to follow the spec to the dot.

I've added the middleware to the Wiki.