tus / tus-resumable-upload-protocol

Open Protocol for Resumable File Uploads
https://tus.io
MIT License
1.48k stars 103 forks source link

Detecting file change on client vs server #139

Closed kornelski closed 4 years ago

kornelski commented 5 years ago

The checksum feature seems to verify integrity of newly uploaded body, but I'm not seeing verification of the combined file as a whole.

The problem:

  1. Client uploads a file partially
  2. User modifies the file
  3. Client tries to resume upload from the same file on disk

A naive implementation would end up with first half of first version of the file and second half of the second version of the file combined on the server, which is likely to end up being garbage.

I see two ways of improving this:

Acconut commented 5 years ago

The checksum feature seems to verify integrity of newly uploaded body

That's correct, the checksums are meant to detect transmission error for a single PATCH request.

The problem:

  1. Client uploads a file partially
  2. User modifies the file
  3. Client tries to resume upload from the same file on disk

This situation should be caught by the client without needing involvement from the server. The client is able to determine file modifications to a certain degree. For example, tus-js-client (and tus-java-client and tus-android-client) currently check if the file size or the file modification date has changed between uploads. If that's the case, it will not resume a previous upload to ensure that no file chunks are mixed up. It's similar to the first solution you proposed but the server does not need to be involved in that; the client should be able to maintain that state on its own.

A better improvement would be to compute the full checksum for a file on the client side and then let the client verify the checksum whenever it resumes the upload. This would catch all file modifications but it's not easily doable in a fast way in browsers and on mobile devices, so the costs are a bit too high for normal use-cases. Personally, I have never experienced that in-upload-modifications are a real problem.

kornelski commented 5 years ago

That's why I wrote a "naive implementation". Client-side prevention requires storing state on the client. With help of the server the clients could be stateless (i.e. always try to resume everything, and succeed whenever possible), and the server by necessity is already stateful, so it seems to be the better place to store state.

Acconut commented 5 years ago

Clients are usually not stateless if you want to resume an upload. In that case you have to transfer some knowledge between your uploading session about the upload you want to resume. This is normally kept to the minimum of an upload URL pointing to your resource on the server and is needed so that the client knows where you want to resume the upload to. I agree that one should always be careful when adding more state to a component but clients usually do already have to deal with that.

kornelski commented 5 years ago

Bummer. I hoped I could use it to have stateless clients and threat resumption as an optimization for retries of idempotent operations (i.e. you can treat every upload/every retry as a new, separate operation, just some will be faster).

Acconut commented 5 years ago

I can understand that hope but I think it's not related to tus clients but is a general problem. If you want to resume a previous action you need some information about the previous state. Even if that information is just a URL to a remote server where you can get more details, as in tus' situation.

However, regarding:

The client could always send checksum of the full file it sends (not the body of the request, which may be partial), so that the server can immediately reject resumption of a wrong file.

You can currently do following which is a bit similar: When the client creates an upload, it calculates the checksum of the entire file and sends it as metadata to the server. The server will never validate this checksum but when the client wants to resume the upload, the client can send a HEAD request to get the stored checksum from metadata. After that the client can verify that checksum and see if it wants to resume or start a new upload.

Acconut commented 4 years ago

Closing this issue due to inactivity. Feel free to leave a comment if you want to continue the discussion :)