tus / tus-resumable-upload-protocol

Open Protocol for Resumable File Uploads
https://tus.io
MIT License
1.48k stars 103 forks source link

Clarification on chunks and the checksum extension #143

Closed smatsson closed 5 years ago

smatsson commented 5 years ago

Hi!

I was wondering if I could get some clarifications on how chunks and the checksum extension relate. When I first implemented the checksum extension in tusdotnet it seemed pretty straight forward but my gut feeling is telling me that this is not the case after receiving multiple issues/PR where people have different interpretations of how chunks and checksums relate. With "chunks" I'm referring to multiple small requests instead of a single large one ("The Client SHOULD send all the remaining bytes of an upload in a single PATCH request, but MAY also use multiple small requests successively for scenarios where this is desirable.")

My interpretation has always been that chunks are not important to the spec and is just a way to circumvent limitations of proxies, hardware etc. E.g. when a client cannot read data directly from disk and want to keep a smaller part of the file in memory. Each chunk should still be streamed using the same protocol as a large request (i.e. get offset with a HEAD request, continue sending data with a PATCH request).

So let's say we have a 100 MB file that we split into 1 MB chunks and we want to use the checksum extension to verify each chunk. The client would split the file in 100 chunks, calculate the checksum of each chunk and then start the uploading.

Client: Here is my first chunk with checksum X. Server: Perfect thanks! The chunk has been saved and the checksum is OK. Client: Here is my second chu... [disconnected] Server: OK, client disconnected. I saved 90% of the chunk. Client: Hi it's me again! What's the offset of my file? Server: It's 1.9 MB Client: Hmm... yeah that's the second chunk and 90% in. Ok got it! Here's the rest of the data + the checksum for chunk 2. Server: Thanks! The chunk has been saved and the checksum is OK. [continue until all chunks are done]

This seems like the way to do it for me as otherwise all client disconnects would remove the last chunk. And I have some backing in the spec for this: "A Client MAY include the Upload-Checksum header in a PATCH request. Once the entire request has been received, the Server MUST verify the uploaded chunk against the provided checksum using the specified algorithm.".

My interpretation is that "entire request" means that the entire request could be read and that the client did not disconnect. This interpretation is contradicted by the following: "In the first two cases the uploaded chunk MUST be discarded, and the upload and its offset MUST NOT be updated.".

I find that the above quote also contradicts this quote: "The Server SHOULD always attempt to store as much of the received data as possible."

So... how are chunks supposed to work together with the checksum extension? Are there two different flows to uploads depending if you use chunks or not? If so, how do we determine which of the flows to use?

Acconut commented 5 years ago

With "chunks" I'm referring to multiple small requests instead of a single large one

Yes, in this context, I refer to a chunk as the body of a single PATCH request.

Client: Here is my second chu... [disconnected] Server: OK, client disconnected. I saved 90% of the chunk. Client: Hi it's me again! What's the offset of my file? Server: It's 1.9 MB

That's not how it's intended. The offset after the disconnection would be again 1MB (I also don't like this idea since it forces the server to store the details of previous PATCH requests, doesn't it?) Let me explain that by commenting on the rest:

all client disconnects would remove the last chunk.

Yes, that's true and one of the major drawbacks of checksum extension.

My interpretation is that "entire request" means that the entire request could be read and that the client did not disconnect. This interpretation is contradicted by the following: "In the first two cases the uploaded chunk MUST be discarded, and the upload and its offset MUST NOT be updated.".

Yes, but how is this a contradiction? Could you explain your thoughts on this?

So... how are chunks supposed to work together with the checksum extension? Are there two different flows to uploads depending if you use chunks or not? If so, how do we determine which of the flows to use?

I think you could call it two different upload flows. There is one for uploading without the checksum where you store as much data as you can receive immediately. And there is one for uploading with the checksum where you first have to buffer the entire PATCH request before calculating the checksum and eventually storing the data. Deciding between these two flows is rather easy since you just have to look for the presence of the Upload-Checksum header in the PATCH request.

So, to summarize:

smatsson commented 5 years ago

Hi @Acconut!

Thanks for taking time off from your Saturday to answer my questions. Greatly appreciated! :)

That's not how it's intended. The offset after the disconnection would be again 1MB (I also don't like this idea since it forces the server to store the details of previous PATCH requests, doesn't it?) Let me explain that by commenting on the rest:

Yes it would require the server to keep state on the current chunk. This is was tusdotnet currently does, which is complicated and error prone. I would say that just discarding the chunk on a client disconnect is way simpler and cleaner.

Yes, that's true and one of the major drawbacks of checksum extension.

Right, this was a misunderstanding of mine.

Yes, but how is this a contradiction? Could you explain your thoughts on this?

I think my confusion is based on how tusdotnet internals work and my misunderstand of how chunks works with the checksum extension. So a contradiction on my understanding basically.

So, to summarize:

Crystal clear! Thanks again.

I have noticed that some clients (like the one linked in the linked issue) uses chunks and checksums as the default transfer mode which is, IMO, a bit unfortunate as they won't get the full power of the tus protocol.

Acconut commented 5 years ago

No worries, you're welcome :)

I have noticed that some clients (like the one linked in the linked issue) uses chunks and checksums as the default transfer mode which is, IMO, a bit unfortunate as they won't get the full power of the tus protocol.

Are you talking about https://github.com/georgiosd/TusDotNetClient? Would it be possible for you to contact the maintainer about this topic? I am not a .NET user and it's a bit hard for me to understand the code base. I think that would help the community a lot.

smatsson commented 5 years ago

Yes exactly. It seems to be a fork of https://github.com/gerdus/tus-dotnet-client but that one hasn't been updated in more than a year. I'll contact georgiosd and see if he's open to a PR for supporting streaming uploads.

Acconut commented 5 years ago

Thank you very much :+1:

jurgyy commented 4 years ago

I know this is an old issue but I came here because I had the same interpretation as @smatsson. Is it possible to clarify the information over at https://tus.io/protocols/resumable-upload.html#checksum if it is still the case that the unchecked chuck should be discarded?

Acconut commented 4 years ago

Is it possible to clarify the information over at https://tus.io/protocols/resumable-upload.html#checksum if it is still the case that the unchecked chuck should be discarded?

Of course, do you have a proposal on what a good wording would be?