tus / tus-resumable-upload-protocol

Open Protocol for Resumable File Uploads

https://tus.io

MIT License

1.48k stars 103 forks source link

Client Tag Extension #157

Closed felix-schwarz closed 3 years ago

felix-schwarz commented 4 years ago

The Client Identifier Extension to the tus protocol proposed here allows clients to resume an upload even if they did not receive the response to the creation or creation-with-upload request with which they initiated the upload.

The mechanism for this is relatively simple:

with the Client Identifier Extension, clients can generate a Client Identifier (typically a UUID) and send it alongside the initial creation or creation-with-upload request.
if the connection is interrupted before the client receives the response with the URL of the created resource, the client can use the Client Identifier as a "backup" to retrieve the URL of the created resource.

Mobile client apps benefit of this extension in various ways:

by taking advantage of creation-with-upload, complete uploads can be handed off to OS-managed background HTTP request queues (that continue to run when the apps themselves are already paused / suspended) for execution as soon as network connectivity is next available - without anxiety regarding the reliability of the connection
if a creation-with-upload request doesn't complete, the Client Identifier Extension can be used to retrieve all information necessary to resume the upload right from the point where it was interrupted - not wasting a single byte (of expensive cellular data) of bandwidth to retransmit data that has already been transmitted. The upload will finish eventually - even on a very unreliable connection. And on a reliable connection, there's the opportunity for the upload to finish with maximum efficiency - with a single request.
without the Client Identifier Extension, the only reliable way for mobile apps to ensure a piece of upload data is sent only once would be to limit itself to just the creation extension. Using that extension, however, generates a usage pattern of an initial (creation) request and an immediate (data upload) follow-on request - a usage pattern that is detected and penalized (with long delays) for background HTTP queues, by at least iOS.

In my proposal, I tried to match the language and style of the existing specification as closely as possible.

Feedback and questions welcome! 🙂

smatsson commented 4 years ago

I think this would be a great addition to the protocol. Nicely done! There has been similar discussions in tusdotnet (e.g. https://github.com/tusdotnet/tusdotnet/issues/71) where the idea was for the server to run custom code to create the file id based on something the server knows (e.g. database row id) or metadata provided by the client. In the latter the client would then be able to reassemble the file id without receiving the location header from the server. This would also allow client's to re-use some existing id on their end and just provide that to the server (e.g. object id from a document database, row id from a rdbms etc) without needing to save the location header on their end.

I think your approach is better given that it would be a formal specification and nothing specific to tusdotnet. I can see some huge benefits of this extension, especially with legacy systems, where one does not need to change database schemas etc to keep track of the location header but can instead rely on some ID that is already known to the system.

Some thoughts:

Is Upload-Client-Identifier clear enough? My initial thought was that it's a client identifier (think client_id in OAuth) and not an identifier for the file being uploaded.

The Upload-Client-Identifier request header MUST be a globally unique identifier created by the Client to avoid collissions. It is recommmended that Clients generate and use a version 4 UUID for this purpose.

I don't think this handles all use cases. In tusdotnet one splits file storage into different areas, typically one per user. This would prevent prevent user A and user B from using the same identifier. This is probably also true for other implementations. We might want to re-write this section to something more generic? IMO the client should be able to use whatever identifier they find reasonable as long as it's unique in the storage area used.

If an Upload-Client-Identifier is already used for another in-progress upload, the Server MUST respond to the POST request initiating the upload with a 409 Conflict status.

I find the term "in-progress" to be confusing here. I'm guessing you mean that any existing file location cannot re-use the client id? E.g. if there is a location /files/1234 with Upload-Client-Identifier "A", then no other location can use that Upload-Client Identifier as long as /files/1234 does not return 404 (i.e. the file has been removed)?

if a creation-with-upload request doesn't complete, the Client Identifier Extension can be used to retrieve all information necessary to resume the upload right from the point where it was interrupted - not wasting a single byte (of expensive cellular data) of bandwidth to retransmit data that has already been transmitted. The upload will finish eventually - even on a very unreliable connection. And on a reliable connection, there's the opportunity for the upload to finish with maximum efficiency - with a single request.

I was a bit sceptical regarding the creation-with-upload extension when I first read the spec as I don't see the immediate benefit. Not a lot of data can be transmitted in a safe way without the risk of losing the location header. In combination with this extension I think that creation-with-upload will be much, much better.

felix-schwarz commented 4 years ago

Thanks @smatsson, I'm happy you like it! :-)

Is Upload-Client-Identifier clear enough? My initial thought was that it's a client identifier (think client_id in OAuth) and not an identifier for the file being uploaded.

I had similar thoughts, but couldn't come up with a better term at the time. How about Upload-Client-Resource-Identifier instead (and renaming the whole extension to Client Resource Identifier)?

The Upload-Client-Identifier request header MUST be a globally unique identifier created by the Client to avoid collissions. It is recommmended that Clients generate and use a version 4 UUID for this purpose.

I don't think this handles all use cases. In tusdotnet one splits file storage into different areas, typically one per user. This would prevent prevent user A and user B from using the same identifier. This is probably also true for other implementations. We might want to re-write this section to something more generic? IMO the client should be able to use whatever identifier they find reasonable as long as it's unique in the storage area used.

I agree and would change that paragraph to:

The Upload-Client-Resource-Identifier request header MUST be an identifier created by the Client that's unique within its storage space, to avoid collisions.

Clients can satisfy this requirement by generating and using a version 4 UUID.

What do you think?

If an Upload-Client-Identifier is already used for another in-progress upload, the Server MUST respond to the POST request initiating the upload with a 409 Conflict status.

I find the term "in-progress" to be confusing here. I'm guessing you mean that any existing file location cannot re-use the client id? E.g. if there is a location /files/1234 with Upload-Client-Identifier "A", then no other location can use that Upload-Client Identifier as long as /files/1234 does not return 404 (i.e. the file has been removed)?

What I mean by that is that if an upload is ongoing (so, has not yet finished) that has been started with Upload-Client-Identifier "A", it should not be possible for clients to initiate other uploads using the same identifier - until either the first upload has completed or expired.

An ID could theoretically be reused after the previous upload that used it has either completed or expired - irrespective of what the user did with the file after that. Reusing an ID should be avoided, though. I put this here mainly to have a formal path to resolution for the highly unlikely event (in case UUIDs are used at least) of two clients coming up with the same ID.

I was a bit sceptical regarding the creation-with-upload extension when I first read the spec as I don't see the immediate benefit. Not a lot of data can be transmitted in a safe way without the risk of losing the location header. In combination with this extension I think that creation-with-upload will be much, much better.

Thanks! That's exactly what I intend to achieve :-)

smatsson commented 4 years ago

I had similar thoughts, but couldn't come up with a better term at the time. How about Upload-Client-Resource-Identifier instead (and renaming the whole extension to Client Resource Identifier)?

Hmm yes maybe. Or simply Upload-Resource-Identifier? Upload-File-Identifier? Upload-Client-Reference? As a philosophical input, all the request headers are "Client" (e.g Upload-Metadata, not Upload-Client-Metadata). Not convinced by my own argument here either though as naming is hard 🤔

I agree and would change that paragraph to:

The Upload-Client-Resource-Identifier request header MUST be an identifier created by the Client that's unique within its storage space, to avoid collisions.

Clients can satisfy this requirement by generating and using a version 4 UUID.

What do you think?

Agreed, much better! :)

What I mean by that is that if an upload is ongoing (so, has not yet finished) that has been started with Upload-Client-Identifier "A", it should not be possible for clients to initiate other uploads using the same identifier - until either the first upload has completed or expired.

An ID could theoretically be reused after the previous upload that used it has either completed or expired - irrespective of what the user did with the file after that. Reusing an ID should be avoided, though. I put this here mainly to have a formal path to resolution for the highly unlikely event (in case UUIDs are used at least) of two clients coming up with the same ID.

From experience, it's a bad idea to re-use IDs as this usually ends up with bugs that are hard to track, prevent and fix. Allowing re-use of IDs once the upload is completed would also break all implementations that currently allow you to download files (e.g. tusdotnet, tusd etc). IMO it would be a much more robust and clean solution to just save the client identifier with the file when it is created and then not allow it to be re-used unless the file is deleted (either by the server or the client using the termination extension).

The risk of a UUID collision is very, very, very small (1 in 50 billion according to https://www.quora.com/Has-there-ever-been-a-UUID-collision) and taking into account that the collision needs to happen inside the exact same storage area I don't think we even need to consider this. The worst thing that could happen if this happens (it won't ;) ) is that you will get a 409 Conflict on your POST to create the file, then create a new UUID and try again. I think this drawback is acceptable if it gives us the ability to use external ids (such as database row ids or similar) without changing existing systems.

@Acconut any input on this PR?

felix-schwarz commented 4 years ago

Hmm yes maybe. Or simply Upload-Resource-Identifier? Upload-File-Identifier? Upload-Client-Reference? As a philosophical input, all the request headers are "Client" (e.g Upload-Metadata, not Upload-Client-Metadata). Not convinced by my own argument here either though as naming is hard 🤔

Yup, naming is hard. 😉

I had been thinking about something that is not a Client Identifier and ended up with Client Reference, too. Since the two of us both ended up considering Upload-Client-Reference independently, it feels intuitively correct and would therefore likely make a good alternative name choice for the headers and extension. FWIW it captures the intention and can't easily be confused with OAuth2-style Client IDs.

I agree and would change that paragraph to: The Upload-Client-Resource-Identifier request header MUST be an identifier created by the Client that's unique within its storage space, to avoid collisions. Clients can satisfy this requirement by generating and using a version 4 UUID. What do you think?

Agreed, much better! :)

Happy you like it!

From experience, it's a bad idea to re-use IDs as this usually ends up with bugs that are hard to track, prevent and fix. Allowing re-use of IDs once the upload is completed would also break all implementations that currently allow you to download files (e.g. tusdotnet, tusd etc). IMO it would be a much more robust and clean solution to just save the client identifier with the file when it is created and then not allow it to be re-used unless the file is deleted (either by the server or the client using the termination extension).

I completely agree here that IDs should not, under any circumstance, be re-used.

But I'm critical of the requirement of the storage solution implementing this tus extension to also have to store the ID alongside the uploaded file and prevent any further uploads with the same ID for the lifetime of that file.

It definitely makes sense, of course, and from the perspective of clients, I'm totally rooting for it. But it might be hard to impossible to implement for huge, distributed storage spaces:

performance could suffer a great deal if, before accepting an upload, the upload server needed to reach out to all nodes and have them perform scans of all file table rows to check for an existing file with the same ID
consistency is hard to achieve, too, as, by the time the last node reports back, another node could already have accepted a file with that ID

When using version 4 UUIDs, it is reasonable not to expect collisions, so I wouldn't make it a requirement, just a recommendation that a 409 Conflict response also be returned in case an existing file on the server was uploaded with the same ID.

The risk of a UUID collision is very, very, very small (1 in 50 billion according to https://www.quora.com/Has-there-ever-been-a-UUID-collision) and taking into account that the collision needs to happen inside the exact same storage area I don't think we even need to consider this. The worst thing that could happen if this happens (it won't ;) ) is that you will get a 409 Conflict on your POST to create the file, then create a new UUID and try again. I think this drawback is acceptable if it gives us the ability to use external ids (such as database row ids or similar) without changing existing systems.

Absolutely. My motivation for including the 409 Conflict response in the first place, btw, was to cover the scenario where something like CVE-2008-0166 would happen to UUID generation code. Again, highly unlikely, but a possibility nonetheless.

smatsson commented 4 years ago

I had been thinking about something that is not a Client Identifier and ended up with Client Reference, too. Since the two of us both ended up considering Upload-Client-Reference independently, it feels intuitively correct and would therefore likely make a good alternative name choice for the headers and extension. FWIW it captures the intention and can't easily be confused with OAuth2-style Client IDs.

👍

I completely agree here that IDs should not, under any circumstance, be re-used.

But I'm critical of the requirement of the storage solution implementing this tus extension to also have to store the ID alongside the uploaded file and prevent any further uploads with the same ID for the lifetime of that file.

I think we are using two different approaches here which is great when doing this kind of work :) I absolutely get where you're coming from here. Distributed systems are indeed hard and all of your arguments are valid. I might be missing something but if we don't store the client reference forever, wouldn't the server still need to keep track of the ID to check that there are no ongoing uploads? Also wouldn't one use some kind of partitioning for this record so that it can be easily obtained?

I think that this extension would really shine if it would allow the client to not care about the server's id (basically the location header) at all and just use their own id (file name, database id, whatever). As an example, let's pretend that we are building a Dropbox clone. As a user I have access to my own storage area and no other user's files. In this case I want to use the path of the file on my system to reference the file. I don't care what ID the server has of that file. While the file still exists in my account I must not be allowed to create a new file with the same path. Does this make sense? If the storage area is bigger (e.g. a whole organization or the same for all users) I would obviously need to use something that is way more unique (e.g. UUID).

Maybe we should add a paragraph that the server is free to reject client references due to security/performance/whatever? In your use case, anything but UUIDs could be rejected due to performance reasons and in my use case the server could allow pretty much anything. As a real world example; tusdotnet by default uses writes files on disk. It would probably be a very bad idea to allow client reference that contain traversing paths (e.g. ../../somesecret.lib) as this would open up for attacks. If we allow the server to reject client reference patterns this could be avoided.

Not saving the client reference for the life time of the file would also mean that no lookups can be done once the file is completed so if I want to download the file I still need to keep the original location header, which is something I hope that we can avoid with this extension.

Absolutely. My motivation for including the 409 Conflict response in the first place, btw, was to cover the scenario where something like CVE-2008-0166 would happen to UUID generation code. Again, highly unlikely, but a possibility nonetheless.

I think it's absolutely crucial to have this check during creation to prevent overwriting files by accident. Even with UUIDs there is nothing that guarantees that the client generated a new UUID instead of just re-using an existing one they found somewhere. Bugs do happen =)

felix-schwarz commented 4 years ago

I might be missing something but if we don't store the client reference forever, wouldn't the server still need to keep track of the ID to check that there are no ongoing uploads?

Absolutely, it would still need to keep track of IDs still in use. But that implementation can be self-contained and relatively simple… f.ex. a low-tech implementation of Client Reference Support could: 1) create a file using the Client Reference (or, better, a SHA-256 checksum of it) as the name, containing (f.ex.) the actual URL to the created Resource 2) use these files to easily map the Client Reference to the Resource URL, detect re-use of IDs while another download with the same ID is unfinished, etc. 3) remove the file when the upload is complete

In comparison, requiring servers to be able to resolve Client References to completed uploads would move beyond the boundaries of the upload code and reach into the core of the service. Another complication could be that - at least as far as I've understood the tus protocol so far (please correct me if I'm wrong) - the URLs of the Resources created by uploads are allowed to become invalid once the upload has completed or expired. And this would kind of change that, too.

Also wouldn't one use some kind of partitioning for this record so that it can be easily obtained?

Yeah, absolutely. The main criteria for determining the partition of an item would likely be something different (paths - or checksums thereof, to get something more evenly distributed) than the Upload Reference, though.

Maybe we should add a paragraph that the server is free to reject client references due to security/performance/whatever? In your use case, anything but UUIDs could be rejected due to performance reasons and in my use case the server could allow pretty much anything. As a real world example; tusdotnet by default uses writes files on disk. It would probably be a very bad idea to allow client reference that contain traversing paths (e.g. ../../somesecret.lib) as this would open up for attacks. If we allow the server to reject client reference patterns this could be avoided.

I think this is a really good idea. The server could respond with a 420 Policy Not Fulfilled response if a Reference doesn't meet its requirements. What do you think?

Not saving the client reference for the life time of the file would also mean that no lookups can be done once the file is completed so if I want to download the file I still need to keep the original location header, which is something I hope that we can avoid with this extension.

I can totally see the value of persisting the Client Reference with an uploaded file - and having the ability to use it for lookups at any later point, for the lifetime of the uploaded file.

However, I believe this is outside the scope of this extension, and feel that such a feature deserves an extension of its own (much like create-with-upload builds on top of create)… maybe as client-reference-persistence.

Factoring this feature out into its own extension would have the benefit that simple & self-contained implementations like the one outlined above remain sufficient to support the extension - while keeping the door open for permanent usage of the Client Reference for servers that can support it.

smatsson commented 4 years ago

Absolutely, it would still need to keep track of IDs still in use. But that implementation can be self-contained and relatively simple… f.ex. a low-tech implementation of Client Reference Support could:

create a file using the Client Reference (or, better, a SHA-256 checksum of it) as the name, containing (f.ex.) the actual URL to the created Resource use these files to easily map the Client Reference to the Resource URL, detect re-use of IDs while another download with the same ID is unfinished, etc. remove the file when the upload is complete In comparison, requiring servers to be able to resolve Client References to completed uploads would move beyond the boundaries of the upload code and reach into the core of the service. Another complication could be that - at least as far as I've understood the tus protocol so far (please correct me if I'm wrong) - the URLs of the Resources created by uploads are allowed to become invalid once the upload has completed or expired. And this would kind of change that, too.

I think we are saying the same thing but with different semantics 😊 I think it's easier to visualize what I'm trying to say by thinking about the part handling the upload/tus protocol and the part handling any processing/core system as two different "applications". The tus application would only handle the upload and once the file is completely uploaded it would pass it to the processing application. The client identifier is basically just a lookup that exist inside the tus application to find a location of a resource (i.e. the location header from the create extension). Once the file is uploaded and handed over to the processing app the file no longer exists in the tus application and thus the client reference can be re-used. Doing a HEAD request for the resource in the tus application would now result in a 404 (or 403 or 410 depending on how the server is implemented)*. The same applies if the file has expired or is deleted (either by the client or by the developer running the server). If you check the readme for tusdotnet it might be a bit clearer what I mean with different "applications" (even though they run in the same app): https://github.com/tusdotnet/tusdotnet#configure - Basically tusdotnet runs as a middleware and once the upload is completed it is handed of to the dev's custom code for processing. Since the file is never deleted from tusdotnet in the example, the client reference would still be locked for re-use. If the developer wants to release the client reference they could delete the file from tusdotnet inside the callback (this happens a lot in apps running tusdotnet once the file has been moved to some other system). This is in my understanding the same thing you are saying with the checksum of the file. Does this make sense?

*Techincally I suppose that the processing app could have the same location for the file but it's not a tus upload anymore, or at least it is out of scope for the tus protocol.

I think this is a really good idea. The server could respond with a 420 Policy Not Fulfilled response if a Reference doesn't meet its requirements. What do you think?

Do we want to use a custom http status here or should we go with something more generic? It seems that some companies (e.g. Twitter) use the 420 status code to indicate rate limiting. 400 Bad Request is already mentioned in the creation spec for when the Upload-Defer-Length header contains an invalid value. Maybe we can use the same here? What do you think?

However, I believe this is outside the scope of this extension, and feel that such a feature deserves an extension of its own (much like create-with-upload builds on top of create)… maybe as client-reference-persistence.

Kind of a side note as I think that we are talking about the same thing, but I'm really not a fan of having multiple small extensions that change the behavior of other extensions as it makes the protocol harder to use and the code much more complex to maintain. creation, creation-with-upload, termination etc all add functionality to the protocol while client-reference-persistence would change the behavior of the client-reference extension.

felix-schwarz commented 4 years ago

I think we are saying the same thing but with different semantics 😊

Thanks for the visualization as apps. This absolutely makes sense and I think we're on the same page there. 🙂

I think this is a really good idea. The server could respond with a 420 Policy Not Fulfilled response if a Reference doesn't meet its requirements. What do you think?

Do we want to use a custom http status here or should we go with something more generic? It seems that some companies (e.g. Twitter) use the 420 status code to indicate rate limiting. 400 Bad Request is already mentioned in the creation spec for when the Upload-Defer-Length header contains an invalid value. Maybe we can use the same here? What do you think?

Absolutely makes sense. I think 400 Bad Request is indeed a better fit here.

Kind of a side note as I think that we are talking about the same thing, but I'm really not a fan of having multiple small extensions that change the behavior of other extensions as it makes the protocol harder to use and the code much more complex to maintain. creation, creation-with-upload, termination etc all add functionality to the protocol while client-reference-persistence would change the behavior of the client-reference extension.

Thanks for highlighting this. It motivated me to look for a more elegant/powerful solution and I hope that with a new, optional Tus-Client-Reference-Validity header, this is covered in a better (and also more future-proof) way.

I've updated my PR with

the changes to terminology from Identifier to Reference
clarification around and protocol support for different Reference formats (and their rejection), including a new & optional Tus-Client-Reference-Format header.
a new Tus-Client-Reference-Validity header whereby servers can indicate how long a client reference will remain valid

Would love to hear your thoughts on this.

smatsson commented 4 years ago

Love these updates! Well done. 👍 They make it very clear for the client on how to use the extension and what to expect for different approaches taken by the server.

Some minor comments from my end:

Clients can then send a HEAD request with the same Upload-Client-Reference header to the Upload URL, which - upon success - MUST respond with the 200 OK or 204 No Content status, and the resource's URL in the response's Location header.

A successful response MUST also contain the Tus-Version header. An Upload-Offset header that indicates how many bytes have already been received by the server MAY also be returned.

Maybe we should link to the orinal HEAD spec (https://tus.io/protocols/resumable-upload.html#head) here to make it clear that the two responses are the same? Would it make sense to also add the Upload-Client-Reference header to the HEAD response?

If an Upload-Client-Reference is not or no longer known by the Server, it MUST respond to

Missing a word? "is not known"? "is unknown or no longer known"?

If an Upload-Client-Reference is already in use for a different, ongoing upload, the Server MUST respond to the POST request initiating the upload with a 409 Conflict status.

I think the word "ongoing" here locks this down a bit. Maybe change it to somehting like "use for a different upload (based on the Tus-Client-Reference-Validity header)..."?

Tus-Client-Reference-Format

This is very nice indeed! Love the fact that the client can know before sending the request 😊 I'm thinking that tusdotnet (at least when using disk storage) will return something like "tusdotnet.ValidDiskPath". Would this be the way you meant this to be used?

random:[len]: len number of random bytes, base-64-encoded

Not really sure what I'm reading here? Does it say that I can use a fixed length byte array of random data, encoded as base64? Something like:


// Assuming OPTIONS header is "random:10"

var array = new Uint32Array(10);
window.crypto.getRandomValues(array);

// TODO Convert array to base64 and include in request

Tus-Client-Reference-Validity

Kudos on this as well. Makes it super clear for the client what is going on. Is "Validity" a good word here? I think it might be confusing as to what kind of validity (e.g. format validity, length validity etc)? Would e.g Tus-Client-Reference-TTL or Tus-Client-Reference-Lifespan make it more clear?

seconds:[number of seconds]: the Upload-Client-Reference can be used for at least [number of seconds] after the last completed request uploading data to the Server.

What does "the last completed request" mean in this context? If I do multiple PATCH requests (i.e. using chunks), does this mean that I am guaranteed that the client reference exists for X seconds after my last request completed?

Acconut commented 4 years ago

Thank you very much for providing this amazing proposal and heavily discussing it, @felix-schwarz and @smatsson! You made a really good job at matching our writing style :) It took me some time to look into this, so apologies for my delayed response.

The problems with the Creation With Data extension, as pointed out by @felix-schwarz in the first comment, are very much real and do limit the use of this extension. Sadly, I only discovered them when I was already implementing support for Creation With Data in tus-js-client and haven't found a fitting solution for it. This proposal is an interesting idea and does indeed have potential to be merged into the specification.

Alternative solutions

I had two ideas on how to deal with POST responses that didn't make it to the client but each of them had a major flaw. Maybe we can still draw some inspiration from them:

Let the client specify the upload ID in the POST request (and let the server validate it). This way the client should be able to reconstruct the full upload URL and then to the usually HEAD request to find what the current status on the upload is. Of course, the problem is that in some server systems, the ID cannot be freely chosen. For example, the AWS S3 store for tusd uses an ID generated by AWS in its upload URLs.
Use idempotency tokens in the POST request to let the server figure out if a POST request is being retried (see https://stripe.com/docs/api/idempotent_requests). For example, the client includes the token ABC in a POST request but doesn't receive the response. After that, the client retries the exact same request with the same token. Now the server sees that the second POST request is the same as the first POST request and they also contain the same token ABC. Therefore, the server does not create a new upload but instead just returns the response for the already existing upload (i.e. same Location header). Of course, the problem is that when using Creation With Data, the client has to transmit all the data again, losing the benefit of resumability.

Complexity

To be honest with you, I think this extension is too complex. I am not talking about the Upload-Client-Reference and the new HEAD request itself but about all the possible customizations. We generally try to keep the specification lightweight in order to make it easier for implementations to follow the entire specification. Similar to the principal of convention over configuration. That being said, let me address my concerns with the Tus-Client-Reference-Validity and Tus-Client-Reference-Format headers:

If I understood correctly, Tus-Client-Reference-Format is intended to let the server specify what format it expects from the client. The server is free to reject any other upload reference. My major concern here is that it will make it really hard for generic clients, such as tus-js-client, to implement a set of reference generators which suits most tus servers. Instead, I propose to let the client reference simply be any ASCII string of 200 characters or less (we may need to base64 encode it and adjust the character limit, but those details are not relevant right now). Every server must accept these references.

What if the client wants to encode data into the upload reference which exceeds 200 chars? Simply run the data through MD5 (or your preferred hashing function) and you get a valid upload reference. Yes, collisions are possible but the chance of hitting one is rather low, I would argue.

What if the server is not able to handle 200 characters or is not able to handle specific characters (e.g. /)? Simply run the data through MD5 and hexadecimal encode it. Every server should be able to store a 32 alphanumeric string.

What if the server has concerns that the upload reference is not "unique enough" (whatever that means :)? IMO, this is not a concern of the server. If the client generates a reference, which will also be generated by another client, then the first client should generate a better, more unique reference. Note that I don't think using only the SHA-256 of a file is a good idea for generating an upload reference (what happens if another client uploads the same file?). An upload reference should always include a random part.

Moving on to Tus-Client-Reference-Validity: I am not sure if that header is actually needed. The tus specification is only concerned about the upload itself. Whatever happens after the upload is completed is not a concern of tus. The protocol does not make any requirements whether the upload URL should be available after the upload is complete. On the other hand, the protocol does also not require servers to keep upload URLs available as long as the upload is not complete (see the Expiration extension).

That being said, I am a fan of keeping the upload reference valid as long as the upload resource is available. Again, this reduces the complexity of the tus protocol and removes fragmenting options when implementing tus clients and servers. You want to keep upload references also valid after the upload is finished? That's great but tus does not care about what happens after the upload is done. Do whatever you want! :)

I think the original version of this PR didn't contain the Tus-Client-Reference-Validity and Tus-Client-Reference-Format headers, and instead laid out more precise rules, which I am a fan of.

Naming

Naming is though. We all know that. However, the header name Upload-Client-Reference for me always has a meaning of referencing the client that created an upload. Instead, the header should be a reference to the upload itself, which is assigned by the client. Therefore, I would like to propose another name: Upload-Tag. It's short, concise and is not used in the specification and can therefore not be confused easily.

In my understanding, a tag is a label which is assigned to an object. The tag is not the object's identity but can be used to reference it later. Of course, tags don't have to be unique in nature but we can enforce that for tus :) What do you think?

Sorry for the long comment :) Let me know what you think and thanks again for being so much involved in tus!

nigoroll commented 4 years ago

my 2¢:

I agree with @Acconut that indeed the Tus-Client-Reference-Format is a complication which does not appear to add much benefit. We would need a unique ID, but the format should not matter.
IMHO, this opens a new attack vector for hijacking uploads. If, by reverse engineering, adversaries found tus clients to choose weak ids, they could exploit this extension to poison other uploads. This could be mitigated by adding additional security measures like tokens (crypto/hash signatures) or session-ids (nonces), but in my mind, the protocol itself should already be robust and not open such attack vectors in the first place.
Quite frankly, I do not see the point. A normal creation POST request returns an upload path (which, by definition has to be unique and can/should be chosen by the server such that it cannot be guessed by an outsider). If the response to a creation POST gets lost, it can be repeated, leaving an empty upload to be garbage collected later. The additional round-trip for the separate POST should not matter for long uploads, and for short uploads, re-sending the data with a creation-with-upload should not matter.

So, in my mind, clients should just make a smart decision between creation and creation-with-upload depending on the size of the upload. The argument that one particular operating system offers a facility with a penalty for an extra HTTP request should, in my mind, not drive a protocol design:

issuing the creation POST synchronously from the application and only the PATCH from the background HTTP task should not add any relevant additional latency for cases where handing the upload to a background task matters in the first place.
the decision by the OS designers that multiple HTTP requests are bad and only single HTTP requests are good appears quite arbitrary

smatsson commented 4 years ago

@Acconut Very good feedback! I feel a bit like you are missing my use case about the client being able to "reconstruct" the tag without saving the location header. This is one of the most requested features I have received for tusdotnet and I think that this extension would be a great solution to that problem. I have seen many use cases where people could not (or did not want to) change database schemas or similar and they already had a unique ID that could be used during upload.

I do like the name change to Upload-Tag. The only concern I have is the same as the one you pointed out, i.e. that usually a tag can be applied to multiple things whereas here we would only allow a single tag for a single file. I think this is a fair enough limitation.

I agree with you that the extension is a bit complex at the moment and would definitely be open to scraping both Tus-Client-Reference-Format and Tus-Client-Reference-Validity`.

If I understood correctly, Tus-Client-Reference-Format is intended to let the server specify what format it expects from the client. The server is free to reject any other upload reference. My major concern here is that it will make it really hard for generic clients, such as tus-js-client, to implement a set of reference generators which suits most tus servers. Instead, I propose to let the client reference simply be any ASCII string of 200 characters or less (we may need to base64 encode it and adjust the character limit, but those details are not relevant right now). Every server must accept these references.

Perhaps we could force the reference to be a non-empty base64 encoded value (with a max length). Having it as base64 would allow the client to use pretty much anything as the input to the reference, be it a byte array, some other ID or a serialized object graph. I think this would be a very powerful feature and it would tie in nicely with how metadata is handled today.

Moving on to Tus-Client-Reference-Validity: I am not sure if that header is actually needed. The tus specification is only concerned about the upload itself. Whatever happens after the upload is completed is not a concern of tus. The protocol does not make any requirements whether the upload URL should be available after the upload is complete. On the other hand, the protocol does also not require servers to keep upload URLs available as long as the upload is not complete (see the Expiration extension).

I think our different views here are only semantics on what "completed" means. I tried to explain my point of view in this comment but I don't think I made it clear enough. Consider the following example on how to setup a request pipeline for tusdotnet:

app.UseTus(httpContext => new DefaultTusConfiguration
{
    Store = new TusDiskStore(@"C:\tusfiles\"),
    UrlPath = "/files",
    Events = new Events
    {
        OnFileCompleteAsync = async eventContext =>
        {
            ITusFile file = await eventContext.GetFileAsync();
            await DoSomeProcessing(file);
        }
    }
});

app.Use(async (httpContext, next) =>
{
    await httpContext.Response.WriteAsync("Hello World!");
});

tusdotnet runs as a middleware in the pipeline and is setup using the app.UseTus call. The middleware handles all requests for the tus protocol for the /files endpoint including all subpaths (e.g. /files/<fileid>). Requests that are not tus requests or requests for other paths than /files/* will result in a "Hello World!" response. What happens outside the /files endpoint is of no concern to tusdotnet nor the tus protocol. Once a file has been completely uploaded (i.e. the final bytes has been received by tusdotnet) then OnFileCompleteAsync will run. After that the file is still available using the tus protocol, e.g a HEAD request for /files/<fileid> will return something like:

cache-control: no-store
status: 200
tus-resumable: 1.0.0
upload-length: 300
upload-offset: 300

Note that the upload-length and upload-offset headers have the same value, i.e. the file is "completed". I think it's reasonable that the client reference is still valid and in use in this case (i.e. it cannot be re-used by another upload). My interpretation of what @felix-schwarz means is that once the upload-length and upload-offset are the same, then the client reference must be up for grabs again. I think this would cause bugs down the road that are very hard to track.

So just to be clear, I only argue that the client reference should be kept as long as the location header is still valid inside the tus context. There is no reason from the protocols perspective to force the server to keep the client reference forever and ever.

I think the original version of this PR didn't contain the Tus-Client-Reference-Validity and Tus-Client-Reference-Format headers, and instead laid out more precise rules, which I am a fan of.

The original version required the use of UUID which I think would limit what I'm trying to achieve with the client reference being left to the client to decide. Besides from this I'm also a fan of more precise rules and not so much customization.

@nigoroll

IMHO, this opens a new attack vector for hijacking uploads. If, by reverse engineering, adversaries found tus clients to choose weak ids, they could exploit this extension to poison other uploads. This could be mitigated by adding additional security measures like tokens (crypto/hash signatures) or session-ids (nonces), but in my mind, the protocol itself should already be robust and not open such attack vectors in the first place.

This is a fair argument. I really hope that it's already the case that servers do user management before accepting file uploads. The protocol already has this issue as there is nothing in the spec that specifies the uniqueness of the location header. It's implicit that the location header must be unique but that could just as well be achieved by an auto-incremented integer or similar (e.g. /files/1, /files/2 etc).

Quite frankly, I do not see the point. A normal creation POST request returns an upload path (which, by definition has to be unique and can/should be chosen by the server such that it cannot be guessed by an outsider). If the response to a creation POST gets lost, it can be repeated, leaving an empty upload to be garbage collected later. The additional round-trip for the separate POST should not matter for long uploads, and for short uploads, re-sending the data with a creation-with-upload should not matter.

I think you are completely missing my use case here with the client having a way of referencing the file without knowing the location header. For the creation vs creation-with-upload part, I agree with the quoted text, but I still believe this extension would give great value for the other use case.

nigoroll commented 4 years ago

Re @smatsson

(new attack vector to hijack uploads though insecure client identifiers)

I really hope that it's already the case that servers do user management before accepting file uploads

The question is: On which level. I would argue that TUS in itself to be safe as long as the initial POST to the upload-URL is protected in some way.

The protocol already has this issue as there is nothing in the spec that specifies the uniqueness of the location header

Correct. But, while server safety can be implemented at a single point (usually under the control of a service operator), clients cannot be enforced to generate good identifiers.

I think you are completely missing my use case here with the client having a way of referencing the file without knowing the location header.

I read the initial note again and still wonder what I might be missing. Can you help me understand the "great value for the other use case" please ?

smatsson commented 4 years ago

@nigoroll

The question is: On which level. I would argue that TUS in itself to be safe as long as the initial POST to the upload-URL is protected in some way.

How so? If the initial POST is protected but subsequent PATCH requests are not then you are open to attacks. See "Broken access control" in OWASP top 10. If this extension is implemented then the upload-tag and the location header must be protected in the same way, i.e. locked to a specific user/tenant/whatever.

Correct. But, while server safety can be implemented at a single point (usually under the control of a service operator), clients cannot be enforced to generate good identifiers.

This is absolutely true, but see my comment above. The server is still allowed to reject an upload-tag if it seems fit. The client should not have access to other client's files and this extension does not change that.

I read the initial note again and still wonder what I might be missing. Can you help me understand the "great value for the other use case" please ?

It's described in the second comment in this thread but here it goes again: I have received at least 15 feature requests for tusdotnet that all boiled down to some way for the client to use an existing ID for an upload and not having to save the location header returned by the POST request. For example the client wants to use an existing database row id, document id, page id or whatever to link the upload to the entity in the database without having to modify the database.

nigoroll commented 4 years ago

re @smatsson

If this extension is implemented then the upload-tag and the location header must be protected in the same way, i.e. locked to a specific user/tenant/whatever.

If the location cannot be predicted by an outsider, I would argue that the level of protection be equivalent to some other means.

Correct. But, while server safety can be implemented at a single point (usually under the control of a service operator), clients cannot be enforced to generate good identifiers.

This is absolutely true, but see my comment above. The server is still allowed to reject an upload-tag if it seems fit. The client should not have access to other client's files and this extension does not change that.

Yes, but the server cannot check if the upload-tag was generated in a secure (unpredictable) manner.

It's described in the second comment in this thread but here it goes again:

Thank you for the clarification. I cannot follow this argument either, not having to save a location info is not, in my mind, a good argument for a change which potentially implies an additional attack vector. In general, this sounds like a request to implement a protocol change for a minor convenience on the side of the developers.

For example the client wants to use an existing database row id, document id, page id or whatever to link the upload to the entity in the database without having to modify the database.

Those examples sound like very bad choices with regard to the aspect we just discussed in that they usually are low entropy and highly predictable.

felix-schwarz commented 4 years ago

Thanks everyone for your thoughtful and encouraging feedback! ❤️

@smatsson wrote:

Would it make sense to also add the Upload-Client-Reference header to the HEAD response?

I think if a Client already has the server-issued resource URL for an upload and sends a HEAD request to it, the Upload-Client-Reference provides no benefit, as the Client already has a way to address the resource.

If the Client already sent the Upload-Client-Reference as part of their HEAD request, the information would be redundant.

Also, a server may opt to just store or use a hashed version of the Upload-Client-Reference, in which case the value wouldn't be available for inclusion in the HEAD response.

However, I may miss something here, so if you see a use-case for this, please let me know.

If an Upload-Client-Reference is not or no longer known by the Server, it MUST respond to

Missing a word? "is not known"? "is unknown or no longer known"?

"is unknown or no longer known" is better and I changed that part accordingly.

This is very nice indeed! Love the fact that the client can know before sending the request 😊 I'm thinking that tusdotnet (at least when using disk storage) will return something like "tusdotnet.ValidDiskPath". Would this be the way you meant this to be used?

Yes, that would be exactly how it would be meant to be used.

random:[len]: len number of random bytes, base-64-encoded

Not really sure what I'm reading here? Does it say that I can use a fixed length byte array of random data, encoded as base64? Something like:
// Assuming OPTIONS header is "random:10"

var array = new Uint32Array(10);
window.crypto.getRandomValues(array);

// TODO Convert array to base64 and include in request 

Yes, that's exactly what is meant by that part.

Would e.g Tus-Client-Reference-TTL or Tus-Client-Reference-Lifespan make it more clear?

Thanks! I think TTL is much better and changed it accordingly.

[Edit: It's still in 5bf437c904cbba6d07044ec7e81f64ee9a0e9a47, but I subsequently removed that header]

seconds:[number of seconds]: the Upload-Client-Reference can be used for at least [number of seconds] after the last completed request uploading data to the Server.

What does "the last completed request" mean in this context? If I do multiple PATCH requests (i.e. using chunks), does this mean that I am guaranteed that the client reference exists for X seconds after my last request completed?

That's exactly what I meant with it. The thought here is that a server may not be able to see a request before it has been finished - or the connection has broken down. For example because a proxy is not relaying the request until then.

@Acconut wrote:

This proposal is an interesting idea and does indeed have potential to be merged into the specification.

Happy you like it! 🤗

If I understood correctly, Tus-Client-Reference-Format is intended to let the server specify what format it expects from the client. The server is free to reject any other upload reference. My major concern here is that it will make it really hard for generic clients, such as tus-js-client, to implement a set of reference generators which suits most tus servers.

Generic clients (and servers) would only need to support v4-uuid and random:[len]. If Tus-Client-Reference-Format would end up in the spec, that's a point I would clarify.

A main motivator for adding this header was to provide vendors with a standardized, open mechanism to announce support for additional, vendor-specific formats that might offer product-specific advantages. Formats other than v4-uuid and random:[len] would be entirely optional. Again, if Tus-Client-Reference-Format would end up in the spec, I'd add language to clarify this.

Instead, I propose to let the client reference simply be any ASCII string of 200 characters or less (we may need to base64 encode it and adjust the character limit, but those details are not relevant right now). Every server must accept these references.

I like the simplicity of this. And since Tus-Client-Reference-Format is not essential to this extension, I have removed it and added

A `Upload-Client-Reference` MUST consist of ASCII characters and MUST NOT exceed 200 bytes.

I also removed the part where servers can respond with a 400 Bad Request response if the Tag doesn't meet their format requirements.

Moving on to Tus-Client-Reference-Validity: I am not sure if that header is actually needed. The tus specification is only concerned about the upload itself. Whatever happens after the upload is completed is not a concern of tus. The protocol does not make any requirements whether the upload URL should be available after the upload is complete. On the other hand, the protocol does also not require servers to keep upload URLs available as long as the upload is not complete (see the Expiration extension).

That being said, I am a fan of keeping the upload reference valid as long as the upload resource is available. Again, this reduces the complexity of the tus protocol and removes fragmenting options when implementing tus clients and servers. You want to keep upload references also valid after the upload is finished? That's great but tus does not care about what happens after the upload is done. Do whatever you want! :)

Tus-Client-Reference-Validity was meant as a way for servers to give clients a hint at internal behaviour. Like Tus-Client-Reference-Format it is, however, not essential to this extension. I have therefore removed Tus-Client-Reference-Validity.

Therefore, I would like to propose another name: Upload-Tag. It's short, concise and is not used in the specification and can therefore not be confused easily.

In my understanding, a tag is a label which is assigned to an object. The tag is not the object's identity but can be used to reference it later. Of course, tags don't have to be unique in nature but we can enforce that for tus :) What do you think?

I like it! HTTP's ETag is also used to uniquely identify a particular version of a resource - and also in the HTTP header. This fits right in - and U have changed the spec proposal accordingly.

Sorry for the long comment :) Let me know what you think and thanks again for being so much involved in tus!

I appreciate you taking the time to respond at that length and provide feedback at that detail level. Thanks!

@nigoroll wrote:

I agree with @Acconut that indeed the Tus-Client-Reference-Format is a complication which does not appear to add much benefit. We would need a unique ID, but the format should not matter.

Thanks. I've removed Tus-Client-Reference-Format from the proposal.

IMHO, this opens a new attack vector for hijacking uploads. If, by reverse engineering, adversaries found tus clients to choose weak ids, they could exploit this extension to poison other uploads. This could be mitigated by adding additional security measures like tokens (crypto/hash signatures) or session-ids (nonces), but in my mind, the protocol itself should already be robust and not open such attack vectors in the first place.

The question of whether this extension could open an attack vector for poisening in case of a security issue like CVE-2008-0166 is an interesting one. Thanks for bringing it up!

There already are some safeguards at the Tus and HTTP level, which - on their own or combined with others - should provide effective protection against hijacking :

Tus:
- this proposal: the proposal itself recommends the use of version 4 UUID, which would typically be sourced from the operating system and should be cryptographically secure
- the protocol already supports checksums, which can be used to ensure the integrity of the upload
HTTP:
- servers can check the User-Agent header to detect unsafe client versions (and often also unsafe operating system versions). They can then either not indicate support for this extension to those clients in response to an OPTIONS request - and/or reject uploads from them altogether (which could also alert affected users to their use of an unsafe version)
- in most cases, uploads will require authentication via an Authorization header or another form

In that context, I'd also like to highlight that both the HEAD request with the Upload-Tag (to retrieve the URL of the created Resource) and the POST request for creation and creation-with-upload target the same URL, so that they should be on the same level when it comes to security through authentication.

That also means, however, that information on the authenticated user should be available to the server for both the initial POST and follow-up HEAD request. A server could therefore store the authenticated user that sent the Upload-Tag as part of the initial POST request alongside said Upload-Tag - and only provide access to the URL of the created Resource in response to HEAD requests made by the same user.

Then, even if Upload-Tags would be easily predicatable, hijacking uploads wouldn't be possible unless the attacker also can authenticate as the user (and FWIW the proposed extension only covers POST and HEAD requests, not PATCH requests).

If safe ID generation for unauthenticated uploads is a concern, an optional Upload-Tag-Generator header could be added to the proposal, allowing clients to indicate what was used to generate the tag (f.ex. Upload-Tag-Generator: openssl/1.1.1g). Servers could then opt to reject uploads from clients using an unsafe tag generator.

The additional round-trip for the separate POST should not matter for long uploads, and for short uploads, re-sending the data with a creation-with-upload should not matter.

Unfortunately, it does matter.

I had similar thoughts before making the proposal for this extension but noticed during implementation that short uploads are surprisingly hard to define.

In the end, what a short upload is in this context heavily depends on the available network speed (both throughput and latency), traffic cost (free or metered) and required performance.

Unreliable and slow connections do exist - and I firmly believe that a protocol for resumable uploads should definitely aim for maximum efficiency and avoid sending the same data more than once.

So, in my mind, clients should just make a smart decision between creation and creation-with-upload depending on the size of the upload. The argument that one particular operating system offers a facility with a penalty for an extra HTTP request should, in my mind, not drive a protocol design:

I used the situation with iOS as an example for the benefits of this extension, but as others here already pointed out, it is useful in other contexts, too. I may also add that similiar constraints may (and likely do) also exist on Android. Since I'm not developing for Android, however, that'd be a question for someone else to answer.

issuing the creation POST synchronously from the application and only the PATCH from the background HTTP task should not add any relevant additional latency for cases where handing the upload to a background task matters in the first place.

If I'm in a place without Internet connectivity and save a file in my cloud storage app of choice, that app - for lack of connectivity - can't make that synchronous creation POST request. Instead, I'd need to reopen that app when I'm back in a place with connectivity - so it could then make that requesst and start uploading.

Using the proposed extension, however, the app can schedule the creation-with-upload + client-tag request with the background queue of the operating system. The OS will then ensure the request gets sent as soon as the device has connectivity again. And the app can schedule file uploads of virtually any size without worry, because - using the Upload-Tag, it can be sure that even if the connection breaks down at 999 out of 1000 MB, it will be able to resume from the last byte that reached the server - and won't have to upload those 999 MB again.

the decision by the OS designers that multiple HTTP requests are bad and only single HTTP requests are good appears quite arbitrary

That decision - while extremely frustrating and limiting to mobile app development - is far from arbitrary. There are good reasons for it, actually:

once all scheduled background HTTP requests of an app finish, the app is briefly resumed (or launched) in the background so it can process the responses.
if an app was allowed to schedule the next HTTP request(s) at that point, the app could ensure that it gets run in the background on a regular basis, by scheduling fresh requests again and again
possible consequences of allowing such behaviour: high data usage, low battery life … and the possibility of a deep privacy invasion (apps could, after all, use their "periodic" time in the background to f.ex. collect a user's current location and send it out without the user noticing)

nigoroll commented 4 years ago

Re @felix-schwarz

A Upload-Client-Reference MUST consist of ASCII characters and MUST NOT exceed 200 bytes.

I would recommend to reference RFC7230 section 3.2.6.: Field Value Components

Regarding the security concern:

I agree that additional authentication mechanisms as laid our in your exemple can avoid the issue I raised - this is what I meant with This could be mitigated by adding additional security measures like tokens (crypto/hash signatures) or session-ids (nonces) in my initial comment.

Yet I stand by my point that the protocol should be secure in itself and not need to rely on additional measures in the first place. Yes, those can (and should) be used, but the protocol itself should still be safe.

Servers could then opt to reject uploads from clients using an unsafe tag generator.

I see no reason to assume that unsafe clients could be identified.

Regarding short (small?) uploads:

First of all, avoid sending the same data more than once will not work at any rate, this misses the reality of HTTP. Headers of a tus request with a (sha256) checksum will already be in the ballpark of 200 bytes, even without additional authentication and ignoring today's cookie madness. On top of that come TCP (including retransmits), TLS and (potentially) H2 overheads.

A main argument for this extention is that the response of a creation-with-upload may be missed for a long upload. In my mind, the solution is simple: Only send a small amount of data with the first request (like 2k). This will reduce both the overhead per request and the amount of data which needs retransmission when the request fails.

Regarding the specifics of certain OSes:

I understand what you are saying, but still I question if a protocol like TUS should be adjusted specifically for this situation. For example, when IP connectivity is missing, an app could still schedule the upload to be initiated at a later time, when the app is run in the foreground.

But I am not a mobile app developer, so, for lack of competence on my side, I would not want to engage in an in-depth discussion.

To summarize my points:

TUS already offers means to reduce the overhead implied by lost responses.
TUS is designed for long uploads, and for those the additional round-trip for a creation POST should not be relevant
I think that a protocol like TUS should not address specifics of how OS designers try to fence in malicious apps. While I understand the reasoning, the decisions of the limits for background HTTP on iOS might change any time. (and I do still think they are quire arbitrary, but do not want to stress that point).

One line summary:

The suggested extension comes with a new attack vector and, IMHO, addresses a problem for which other viable solutions exist.

I do not want to repeat my arguments, so unless new arguments are raised, I will refrain from any further discussion. The TUS maintainers know the arguments now and I trust they will make a good decision.

smatsson commented 4 years ago

@nigoroll

If the location cannot be predicted by an outsider, I would argue that the level of protection be equivalent to some other means.

Well I mean maybe, it all depends on the level of security you need. Some would call this security by obscurity. :)

Yes, but the server cannot check if the upload-tag was generated in a secure (unpredictable) manner.

No but it can verify that the upload-tag was created by the same user that is now trying to access it.

Thank you for the clarification. I cannot follow this argument either, not having to save a location info is not, in my mind, a good argument for a change which potentially implies an additional attack vector. In general, this sounds like a request to implement a protocol change for a minor convenience on the side of the developers.

I'm not really proposing a protocol change rather than piggy backing on something that could be slightly expanded to work with multiple use cases. I'm guessing that you don't work much in the enterprise world where this kind of old, "cannot change" systems are pretty common. Linking IDs between systems like this are fairly common where you want one ID in system A to be referenced from system B without modifying storage for any of the two.

Those examples sound like very bad choices with regard to the aspect we just discussed in that they usually are low entropy and highly predictable.

Without checking that they belong to the user that created them? Absolutely. With a security check that the tag was created by the same user that is now accessing it? No. Would adding a disclaimer to the extension warning about this help? The same would need to apply to the creation/creation-with-upload extensions to not use predictable IDs or to limit IDs to a single client.

Side note: I get your whole argument in regards to the protocol being safe "as is" and not having extra measurements to protect the server. I'm just saying that in reality it's not that easy and servers must still implement additional security. E.g. it needs to guarantee that the location header from creation is unique, which is not specified in the protocol.

nigoroll commented 4 years ago

@smatsson

If the location cannot be predicted by an outsider, I would argue that the level of protection be equivalent to some other means.

Well I mean maybe, it all depends on the level of security you need. Some would call this security by obscurity. :)

Security by obscurity has got nothing to do with the fact that an outsider can neither possibly guess not brute force a sufficiently long, sufficiently good random number. Otherwise you would be implying that basically all crypto in use today was broken (where an asymmetrically encrypted nonce is used for symmetric algorithms).

Yes, but the server cannot check if the upload-tag was generated in a secure (unpredictable) manner.

No but it can verify that the upload-tag was created by the same user that is now trying to access it.

Again, only outside the protocol.

Otherwise I am glad to hear that we agree. Regarding the practical aspects you mentioned (and, FYI, you guessed wrongly), again a modern protocol really should not be driven by bad legacy implementations.

Maybe you and @felix-schwarz should just add the feature to your implementations (and, please, mark them as potentially dangerous), there is nothing that should prevent you from doing it (in fact, in by soon to be released varnish implementation I added a very simple option to use an external id solely for testing purposes). But I still doubt this should become an official extentsion.

felix-schwarz commented 4 years ago

@nigoroll

Maybe you and @felix-schwarz should just add the feature to your implementations (and, please, mark them as potentially dangerous), there is nothing that should prevent you from doing it (in fact, in by soon to be released varnish implementation I added a very simple option to use an external id solely for testing purposes). But I still doubt this should become an official extentsion.

If anything, the discussion on this proposal has shown that it fills a gap in the protocol and can be used to solve a variety of technical problems, allowing the protocol to live up to its full potential on all platforms and to become suitable for more use cases.

I see no value in everybody in need of a solution to the problems solved by this proposal to think up and implement their own, custom and incompatible solutions. I much prefer an ecosystem of well-tested, community-supported libraries, clients and servers based on a standardized protocol extension.

That said, I have made further changes to the proposal that should address the concerns you brought forward, explicitly requiring authentication OR secret + challenge headers.

If you still believe the extension to be "dangerous" with these changes, I'd appreciate an example to base further discussion on.

@Acconut @smatsson What do you think of the latest changes? I'd love to hear your perspective!

felix-schwarz commented 4 years ago

@nigoroll Thanks for your feedback! I've made a couple of changes detailed in my review responses.

In general, we still need to rely on the client to actually use a good entropy source for Upload-Tag-Challenge and there is nothing we can do about it. If clients do not, they still risk their upload being hijacked. The proposed protocol text change clearly requires high entropy, so from the protocol perspective I would agree that all is fine, but in practice this might remain an issue.

I believe this is true for all protocols. Even the most well-designed and well-written client or server eventually can't protect against OS vendor caused bugs like CVE-2008-0166 or goto fail.

Acconut commented 4 years ago

Thank you very much for the vivid discussion in here, which is vital for the development of such an open protocol! A lot of things have been said, but I would still like to add a few points which might bring some topics to an end:

Some comments ago, @nigoroll said that "TUS is designed for long uploads", which is not directly related to this PR but I would still like to address it. Yes, the current version of tus is most efficient for longer uploads, but we don't want tus to be focused on long uploads only. First of all, the definition of a "long upload" depends a lot on the environment which is not always known in advance. Furthermore, using tus also for "short uploads" has the benefit of offering a unified interface for all file upload transfers which is very important. In summary, I also want to the be useful and beneficial for short uploads, which this PR helps with.
The Upload-Tag-Secret and Upload-Tag-Challenge are an interesting idea on their own and I think they would better live in their own extension. I don't see a reason why the secret/challenge mechanism should be bound to the use of upload tags. Instead, it could be a standalone extension.
Since the discussion is somewhat coming to a coherent end, let's talk about the next steps a bit: Before merging and "releasing" this extension, I would like to have this extension implemented in at least one server and one client. This would help us to spot any problems and hopefully address them before release. A lot has been said in the thread about better efficiency and better resumablity, so let's make sure that this assumptions still hold true. :)

What do you think about these rather general points?

nigoroll commented 4 years ago

re @Acconut

short vs. long uploads: Yes, sure, tus should be efficient for all cases.
I agree that the challenge protocol extension could be generally useful, e.g. for cases where the upload path is weak or otherwise predictable. Yet it should be noted that, while a server can ensure the upload path is secure, again, this extension relies on the client, so we should at least point out that secure upload path generation is probably the better option to protect against hijacking.
:+1: for a prototype implementation! I volunteer to add it to the varnish module, but would need a client implementation.

Acconut commented 4 years ago

for a prototype implementation! I volunteer to add it to the varnish module, but would need a client implementation.

I will try to implement a prototype into tus-js-client, so we have a client to test it out!

felix-schwarz commented 4 years ago

@Acconut

The Upload-Tag-Secret and Upload-Tag-Challenge are an interesting idea on their own and I think they would better live in their own extension. I don't see a reason why the secret/challenge mechanism should be bound to the use of upload tags. Instead, it could be a standalone extension.

I've revised the document to factor this part out to a new, standalone challenge extension and also dropped the Tag-part from the headers (it's just Upload-Secret and Upload-Challenge now).

Since the discussion is somewhat coming to a coherent end, let's talk about the next steps a bit: Before merging and "releasing" this extension, I would like to have this extension implemented in at least one server and one client. This would help us to spot any problems and hopefully address them before release. A lot has been said in the thread about better efficiency and better resumablity, so let's make sure that this assumptions still hold true. :)

If time permits, I plan to also implement client support in the ownCloud iOS SDK.

smatsson commented 4 years ago

I've been away on vacation and haven't had a chance to read it through. Read it quickly now and it seems that you worked out all our differences. Well done!

Will read it more thoroughly next week and comment then.

smatsson commented 4 years ago

Client Tag

I think the security consideration section added is just spot on. Well done!

Challenge

In general I think this is a nice addition to the spec that can be used in a wide range of use cases and not only with the upload tag extension.

I think it would be more clear and consistent if we required the Upload-Challenge header for ALL requests targeting that upload resource if a Upload-Secret was used to create the resource (this would also cover DELETE and POST with concatenation). From a client/lib perspective it would be less ambiguous when to include the header and when not to. It would also have the benefit of being future proof when new extensions are added as we do not need to figure out how each extension should work with this extension.

Speaking of concatenation... If I create two files providing an Upload-Secret for each, should I be able to concatenate these two files? If so, how? I would argue that to concatenating the two files, I would need to be able to read the file info and thus knowing that the files exist, which would require an Upload-Challenge. So when using concatenation I would also be required to include the Upload-Challenge but for multiple files, which I cannot seem to do right now.

Some examples:

POST /files HTTP/1.1
Upload-Concat: partial
Upload-Length: 5
Upload-Secret: <secretA>

POST /files HTTP/1.1
Upload-Concat: partial
Upload-Length: 5
Upload-Secret: <secretB>

POST /files HTTP/1.1
Upload-Concat: final;/files/a /files/b
Upload-Challenge: <what to put here?>

A simple solution would be to allow multiple Upload-Challenge headers and then define a way for it to match each file, e.g.:

POST /files HTTP/1.1
Upload-Concat: final;/files/a /files/b
Upload-Challenge: <challenge for secret a>
Upload-Challenge: <challenge for secret b>

Another way would be to allow for, but not require, the location to be included in the Upload-Challenge header like so:

POST /files HTTP/1.1
Upload-Concat: final;/files/a /files/b
Upload-Challenge: /files/a <challenge for secret a>,/files/b <challenge for secret b>

What do you think?

Acconut commented 4 years ago

Thanks for factoring it out, @felix-schwarz, and thanks for the feedback, @smatsson. I think we should move this proposal and its discussion to a new PR, so we can finish this one for good :) Do you agree?

smatsson commented 4 years ago

@Acconut Agree 👍 Only problem I'm seeing there is that you cannot implement Client-Tag as it is written now unless you also implement Challenge. Maybe remove the part about Challenge from Client-Tag and just say that the server is responsible for verifying that the client has access to the Client-Tag. In the future, once the challenge extension is complete, we could add that as a hint on how to verify access.

Acconut commented 4 years ago

Maybe remove the part about Challenge from Client-Tag and just say that the server is responsible for verifying that the client has access to the Client-Tag. In the future, once the challenge extension is complete, we could add that as a hint on how to verify access.

Yes, good idea :+1:

felix-schwarz commented 3 years ago

Thanks everyone for the feedback!

As suggested, I've factored out the two extensions into their own PRs:

an update to the Challenge extension is now tracked in #162. It adds support for requests beyond just HEAD and PATCH, adds additional safeguards for that - and also includes a proposal for how the challenge could be computed when a request references more than one upload resource
an update to the Client-Tag extension is now tracked in #163. As suggested, it no longer references the Challenge extension, but instead effectively limits the availability of the Client-Tag extension to scenarios where authentication is available through other means.

I'm closing this PR and am looking forward to your feedback in the other PRs! @Acconut @butonic @nigoroll @smatsson