tus / tus-resumable-upload-protocol

Open Protocol for Resumable File Uploads
https://tus.io
MIT License
1.49k stars 102 forks source link

Why not ranged PATCH instead? #148

Closed anderspitman closed 5 years ago

anderspitman commented 5 years ago

Hi there, tus is a cool protocol that solves a real problem. However, it seems more complicated than necessary, at least for simple cases. Couldn't you achieve resumable uploads by implementing PATCH requests on the server, which use Content-Range to determine what part of the indicated file to replace? So you start with a PATCH to example.com/file bytes 0-999/10000, then PATCH example.com/file bytes 1000-1999/10000, etc. The client does one chunk at a time, and only sends the next chunk after the previous gets a 200 response. If the connection is interrupted, the client simply redoes the last request; no need to keep any state on the server other than the file itself.

Now I'm sure you've already thought through this, and I'm curious what I'm missing here.

Thanks!

Acconut commented 5 years ago

That topic came up a few times already. Basically, your proposal would have following issues:

  1. The Content-Range header is meant for response and not requests, so using it for PATCH requests would not be HTTP specification compliant.
  2. Sending multiple sequential HTTP requests has a significant overhead, reducing the throughput of the upload.
  3. What should the body size of the PATCH requests be? It must be small enough that redoing a request is not painful for the user experience (i.e. the PATCH request should not send 500MB at once). However, the body size should also be big enough that you reduce the number of HTTP request which cost time (see point 2).

While your proposal may work in controlled and known environment it also lacks flexibility that allows tus to be deployed in nearly any situation.

anderspitman commented 5 years ago

Thanks @Acconut. I did search through previous issues to see if this had been discussed. I found some related issues but nothing that exactly answered my question, so thanks for responding.

  1. From what I can tell, there seems to be a lot of conflicting information about which headers are valid with which requests. Do you have a reference for where the specs say Content-Range can't be used for PATCH? Is it ok to send it with PUT? That's what OneDrive uses for partial uploads. And even if the header isn't valid (and being standards-compliant is important for some reason), you could just use a custom header, or query params.

2/3. I think something on the order of 1-50MB would work quite well for the vast majority of uploads. It could be scaled based on the connection speed. But again that's left up to the client. If they want to risk losing a bigger chunk in the name of speed, go for it.

What flexibility are you referring to exactly? Do you have an example where the PATCH approach would fall down?

Acconut commented 5 years ago

From what I can tell, there seems to be a lot of conflicting information about which headers are valid with which requests. Do you have a reference for where the specs say Content-Range can't be used for PATCH? Is it ok to send it with PUT?

I don't have a concrete references as it has been more than six years since we settled on the PATCH request system :) However, maybe https://news.ycombinator.com/item?id=18512982 may help you a bit.

you could just use a custom header,

That's basically what we are doing with the Upload-Offset header, isn't it? If I understand you correctly, your proposal is mostly about using a fixed PATCH request size vs determining the offset on-demand?

I think something on the order of 1-50MB would work quite well for the vast majority of uploads. It could be scaled based on the connection speed. But again that's left up to the client. If they want to risk losing a bigger chunk in the name of speed, go for it.

In my experience, it is not that simple. For example, if you are building a mobile application, it can be used inside your office with fast WiFi or outside the in rural areas only with mobile Edge connectivity. The upload system should be able to resume in both situations. The client could in theory use connection statistics to figure the best request size, I agree with you. However, I believe that building those heuristics is a lot more work and more error-prone than providing an upload server with flexible offsets as tus does. In some cases, e.g. the browser, connection information is barely available and often unreliable.

anderspitman commented 5 years ago

I don't have a concrete references as it has been more than six years since we settled on the PATCH request system :) However, maybe https://news.ycombinator.com/item?id=18512982 may help you a bit.

Useful, thanks

That's basically what we are doing with the Upload-Offset header, isn't it? If I understand you correctly, your proposal is mostly about using a fixed PATCH request size vs determining the offset on-demand?

Not quite. The idea is to allow the client to PATCH whatever size they want, with whatever offset they want (obviously servers may need to limit file sizes for security reasons). So if a client wants to attempt the entire 4GB upload in one shot, it can. If it fails, the client would do a HEAD, check Content-Length to see how many bytes actually got copied to the filesystem, then make another request.

In my experience, it is not that simple. For example, if you are building a mobile application, it can be used inside your office with fast WiFi or outside the in rural areas only with mobile Edge connectivity. The upload system should be able to resume in both situations. The client could in theory use connection statistics to figure the best request size, I agree with you. However, I believe that building those heuristics is a lot more work and more error-prone than providing an upload server with flexible offsets as tus does. In some cases, e.g. the browser, connection information is barely available and often unreliable.

I don't think it needs to be that complicated. As mentioned above, just attempt a large upload, and if it fails check and see how much got copied.

Acconut commented 5 years ago

The idea is to allow the client to PATCH whatever size they want, with whatever offset they want (obviously servers may need to limit file sizes for security reasons). So if a client wants to attempt the entire 4GB upload in one shot, it can. If it fails, the client would do a HEAD, check Content-Length to see how many bytes actually got copied to the filesystem, then make another request.

That is pretty much exactly how tus works. The client sends as much data as possible in a PATCH request. If it gets interrupted, it will issue a HEAD request to set what the current offset is (i.e. how much data the server received). After that it can send the new PATCH request where the previous one was interrupted. If I understood you correctly, this is what you meant (expect that tus does not allow "whatever offset [the clients] want", the client asks the server what the offset is). Or am I missing something?

anderspitman commented 5 years ago

I think we're almost on the same page. Maybe I misunderstood the tus spec when I glanced over it. My impression was that the server has to store some amount of session information beyond just returning the current size of the file on disk. Is that incorrect?

I could see that you might want to limit the client to patching only to the exact offset of the current size of the file, but that should just require a quick filesystem check, not any state in the server.

Acconut commented 5 years ago

I think we're almost on the same page. Maybe I misunderstood the tus spec when I glanced over it. My impression was that the server has to store some amount of session information beyond just returning the current size of the file on disk.

That's not the case, the server does not need to store client-specific session information. It is enough to store the file on disk (or on an external cloud provider) and see what the current file size is to determine the upload's offset.

anderspitman commented 5 years ago

Ahh ok I think I misunderstood the core protocol section. Since the example filename looks like a hash, I assumed it was some sort of session id equivalent. But apparently the creation part is optional, and if you already have a known URL for the file, you can just PATCH to it directly.

Looks like tus is append-only, right? My proposal would be more flexible, so you could use the same PATCH endpoint to also replace a chunk of a file somewhere other than the end, assuming the client has a way to figure out what needs to be replaced. But that's outside the scope of tus, which is specifically for resuming. Do I have it right?

Acconut commented 5 years ago

Since the example filename looks like a hash, I assumed it was some sort of session id equivalent.

No, that's no hash, it's just a randomly generated ID.

Looks like tus is append-only, right?

That's true. We discussed allowing PATCH requests at custom offsets (in particular as an alternative to the Concatenation extension for parallel uploads) but reject this idea since it would make the implementations more complex without providing much benefit for the users.

anderspitman commented 5 years ago

Cool, thanks for the info!