tus / tus-resumable-upload-protocol

Open Protocol for Resumable File Uploads
https://tus.io
MIT License
1.48k stars 103 forks source link

Upload Post-Processing #190

Open Acconut opened 1 year ago

Acconut commented 1 year ago

Most files will be processed after their upload in some form, for example:

While these post-processing steps are application-specific, tus does not have any special support for communication the post-processing or its results back to the client. Originally, the decision to exclude this was intentional because tus should focus on the uploading itself, not any of the subsequent steps. However, this also leads to issues such as https://github.com/tus/tus-js-client/issues/557. Hence, I think tus should have some capability to communicate this post-processing back to the client.

The requirements of such an extension are:

Acconut commented 1 year ago

Possible approaches:

smatsson commented 12 months ago

Thanks for bringing this up. I've been meaning to answer this for a couple of weeks now but haven't had the time. It's been a long day, so please forgive any ramblings. If it's completely unclear, let me know and I'll explain better. I know we are in the tus repo but some of these issues also apply to RUFH. If we in the tus community agree on them I would be happy to open a similar issue on the IETF repo. Anyhow...

While both adding an additional header and and additional link to a status endpoint is viable, I'm not sure that this would solve the issue completely. Either of these would be fairly specific to the domain handling the upload, i.e. a video processing app would need to return other data than say a file sharing app. So let's assume we go with the specific URL. This URL would need to be specified by the app running tus, i.e. in user code (for the previously mentioned reason). We could also chose to go with a well known URL, but then the actual endpoint would still need to be app specific. This would work in some cases but not all and I think it's a very complex task to get some kind of automatic behavior here. This would also not cover the case where the actual user code handling the upload fails for some reason.

To try clarify, this is the flow that might (and probably) will happen. This issue exists in both tus and RUFH. I wouldn't really call it a protocol issue but every server implementation I've seen of both tus and RUFH have this issue. It is caused by separation of concern in the app, where the tus part is handling the upload and once the upload is completed, some kind of user code is executed, be it events/hooks or some other type of code (e.g. modelbinding in tusdotnet or the handler in SwiftNIO for RUFH).

  1. Client uploads the data
  2. The tus part of the server handles the upload
  3. The file is completely uploaded (all data received from the client)
  4. The tus part calls the user code
  5. The user code throws an exception (null ref or whatever)
  6. A 500 internal server error is returned to the client
  7. The client does a HEAD request to figure out how much data was uploaded
  8. The tus part of the server responds that the entire file has been uploaded
  9. NO profit, as the client know thinks everything is good...

This is the issue described in the tus-js-client issue linked in the first post. I somewhat agree with the OP that Accept that we can't do post-processing and change the hooks accordingly. is the best approach for now. This is the recommendation in tusdotnet, to add the file id to a queue or similar from OnFileComplete and let some other system handle the processing. This is good for a number of reasons, mainly resilience (retries etc). That appoach, while technically valid, does not sit well in the modelbinding/SwiftNIO case as it's not entirely clear for the dev that this is how it's supposed to be done. In the modelbinding POC I opted for also running the user code on HEAD requests, given that the file was actually completed. This circumvents the issue but is, of course, a bit unorthodox and honestly smells like SOAP/XML-RPC (which is horrible). Another approach I entertained is to monitor the user code for exceptions and return a different (smaller) offset on the next HEAD so that the PATCH will trigger again (hacky AF).

So questions from me:

  1. If the last PATCH returns an error, how can the client be sure that the header/URL returned is actually valid? While we could put this in the spec I think it's against "common knowledge" that we should trust the headers returned from a server that just had an internal server error.
  2. If we get a URL, what is the client supposed to do with it? We could take an easy route here and just hand the URL of to the user code in the client, but you could already do that today, so I'm not sure if it adds that much value (as you can see by my argumentation in https://github.com/tus/tus-resumable-upload-protocol/pull/159)
  3. EDIT: Maybe we should split upload and processing into two? When the file is uploaded a new URL is returned and the client needs to do a POST to this URL to start the processing? It would break every single client out there though :D
Acconut commented 9 months ago

Another approach I entertained is to monitor the user code for exceptions and return a different (smaller) offset on the next HEAD so that the PATCH will trigger again (hacky AF).

Wow, that is an incredible idea you came up with. It is the epitheny of hackiness, but also extremely ingenious!

  1. If the last PATCH returns an error, how can the client be sure that the header/URL returned is actually valid? While we could put this in the spec I think it's against "common knowledge" that we should trust the headers returned from a server that just had an internal server error.

That's a good point. Just because the client received a 500 with the Tus-Resumable header, doesn't mean that something specific to handling uploads failed and is the cause for this 500 error.

2. If we get a URL, what is the client supposed to do with it? We could take an easy route here and just hand the URL of to the user code in the client, but you could already do that today, so I'm not sure if it adds that much value (as you can see by my argumentation in

This is actually the solution I am preferring now, also because of the feedback we got from the IETF regarding https://github.com/httpwg/http-extensions/issues/2312. The server could include the Content-Location header in a response to POST, HEAD, and PATCH requests. Clients can then GET this URL to retrieve the post-processing result from the uploaded file. If the processing takes a longer time, the server can respond with a 503 status code and the Retry-After header to indicate that the client should retry later.

A tus client could then be configured to other pass this URL to the application and let the application handle fetch the processing results. Alternatively, tus clients may include functionality to fetch the results on their own and then make the results directly available to the application.

3. EDIT: Maybe we should split upload and processing into two? When the file is uploaded a new URL is returned and the client needs to do a POST to this URL to start the processing? It would break every single client out there though :D

I think using Content-Location would mean a separation between uploading and post-processing. The client will have one upload URL for performing the upload and will then get another URL for fetching the post-processing results. The client does not have to manually start the post-processing, but it will use another endpoint for querying the results.

Does that make sense? What do you think about using Content-Location for this purpose?

smatsson commented 9 months ago

Wow, that is an incredible idea you came up with. It is the epitheny of hackiness, but also extremely ingenious!

Thank you! That's one of the best compliments I've gotten in life so far ;)

Does that make sense? What do you think about using Content-Location for this purpose?

After reading http-extensions#2312 I think this would be a good approach! Reading the RFC it seems like just the use case for it (see below):

Otherwise, such a Content-Location indicates that this content is a representation reporting on the requested action's status and that the same report is available (for future access with GET) at the given URI. For example, a purchase transaction made via a POST request might include a receipt document as the content of the 200 (OK) response; the Content-Location field value provides an identifier for retrieving a copy of that same receipt in the future. - https://datatracker.ietf.org/doc/html/rfc9110.html#field.content-location

Are we allowed to use content-location with a 204 No content status code? In RUFH we don't have this problem but in tus we require the server to return 204 on successful uploads. Might be OK or we could ease that requirement into a SHOULD or something and allow 200 OK as well? I know that this is a fairly common question raised in the tusdotnet repo where people wish to return some data to the client on successful upload.

A tus client could then be configured to other pass this URL to the application and let the application handle fetch the processing results. Alternatively, tus clients may include functionality to fetch the results on their own and then make the results directly available to the application.

I think passing the URL is probably a better thing as the processing of the file can take a long time. The response might need to be polled several times before the processing is done. To solve the issue with the processing failing the server could chose to include a "retry processing" url in the response in the content-location header, which the app specific code in the client could then request.

Acconut commented 9 months ago

Are we allowed to use content-location with a 204 No content status code? In RUFH we don't have this problem but in tus we require the server to return 204 on successful uploads. Might be OK or we could ease that requirement into a SHOULD or something and allow 200 OK as well? I know that this is a fairly common question raised in the tusdotnet repo where people wish to return some data to the client on successful upload.

From reading RFC 9110, I don't see anything speaking against using Content-Location in a 204 response. The RFC commonly refers to cases where the header field is included in 2XX responses, which also counts for 204. So we should be good to go.

I am also open to being less restrictive about the response code in responses to PATCH requests. A 200 should also be fine.

To solve the issue with the processing failing the server could chose to include a "retry processing" url in the response in the content-location header, which the app specific code in the client could then request.

The decision whether the post-processing has to be retried should probably be made by the server-side, for example by adding the job again to the queue. If an application wants to let the client decide this, they should provide an application-specific method for doing so. I don't think that this really fits the scope of upload handling :)

smatsson commented 9 months ago

From reading RFC 9110, I don't see anything speaking against using Content-Location in a 204 response. The RFC commonly refers to cases where the header field is included in 2XX responses, which also counts for 204. So we should be good to go.

I agree. It just seems odd to return a status code saying that there is no content and at the same time returning a URL to the content that does not exist :) I'm sure it's fine though, so let's go ahead with it.

The decision whether the post-processing has to be retried should probably be made by the server-side, for example by adding the job again to the queue. If an application wants to let the client decide this, they should provide an application-specific method for doing so. I don't think that this really fits the scope of upload handling :)

We mean the same thing here :) The content of the response from the Content-Location is application specific so the server can return whatever data in whatever format they wish. My point was just that the client-retry-thing is solvable if we go with this solution which is good. I agree with you that a queue or similar is better but people don't always use it that way. See for instance the linked issue in tus-js-client where the OP wants to do post-processing in a hook.

Acconut commented 9 months ago

Great, then we are on the same page! Maybe we can revamp https://github.com/tus/tus-resumable-upload-protocol/pull/159 and bring it across the finish line!

smatsson commented 9 months ago

Yes let's do that! :)

smatsson commented 8 months ago

After thinking some more on this, maybe we should wait for the RUFH spec to be finalized on this subject and then just backport that to tus? It might be that someone in the httpbis group comes up with something more sophisticated?

Acconut commented 8 months ago

We can wait with formalizing this into tus, yes. That being said, I would like to implement a prototype into tusd anyways to try it out and gather experience with it. The feedback can then be used for RUFH and tus together. Of course, maybe in the end we find out that Content-Location is not suitable at all.

smatsson commented 8 months ago

I updated my tusdotnet RUFH POC to include Content-Location and have written about it on the httpbis' github: https://github.com/httpwg/http-extensions/issues/2312#issuecomment-1871288007

For tus, there are some more specific things that we probably need to sort out:

Acconut commented 5 months ago

I updated my tusdotnet RUFH POC to include Content-Location

That's great, thanks for working on this! I wasn't able to try it out in tusd yet.

  • Should this be another extension? If so, what should the name be?

I wonder whether Content-Location will every be part of tus v1 or if we should rather wait until the IETF work has completed and then publish tus v2 on top of RUFH with mentions of Content-Location.

  • How should the client handle the Content-Location header? Simplest for is probably just to hand it over to the user code for further processing as the contract for the CL URI would be domain specific anyway?

That's up to the client in my opinion. Simple clients might simply expose the URL to the user, but more sophisticated clients could also fetch the content of the provided URL (with retries etc). There might be use cases where the user is interested in the content served at that URL, but they might also only care about the URL to pass it to another component for further processing.

smatsson commented 3 months ago

Been meaning to answer this for a while but it slipped my mind...

I wonder whether Content-Location will every be part of tus v1 or if we should rather wait until the IETF work has completed and then publish tus v2 on top of RUFH with mentions of Content-Location.

Yeah I think this is the way to go 👍