tus / tusd

Reference server implementation in Go of tus: the open protocol for resumable file uploads
https://tus.github.io/tusd
MIT License
3.03k stars 474 forks source link

Azure storage does not support concatenation #843

Open boris-ning-usds opened 1 year ago

boris-ning-usds commented 1 year ago

Technical Knowledge of User I'm new to using tusd and the tus-js-client, so I'll be happy if anybody can point out any problems in my setup for testing.

Describe the bug While running tusd with an Azure Storage backend, I found that the transfer would always fail on the last HTTP POST with a missing or invalid Upload-Length header. This only occurs when parallel upload (connections) are set to at least 2.

Snippet of tusd logs during the end of a transfer:

[tusd] 2022/10/31 04:54:05.053405 event="ResponseOutgoing" status="201" method="POST" path="" requestId="" body="" 
[tusd] 2022/10/31 04:54:05.063695 event="RequestIncoming" method="PATCH" path="14ef5595737703227cbbe4c006cc7920" requestId="" 
[tusd] 2022/10/31 04:54:05.064181 event="RequestIncoming" method="PATCH" path="9d5eb446624ade5a25d719f07865fdd8" requestId="" 
[tusd] 2022/10/31 04:54:05.105852 event="ChunkWriteStart" id="9d5eb446624ade5a25d719f07865fdd8" maxSize="57738292" offset="0" 
[tusd] 2022/10/31 04:54:05.107953 event="ChunkWriteStart" id="14ef5595737703227cbbe4c006cc7920" maxSize="57738292" offset="0" 
[tusd] 2022/10/31 04:54:13.957541 event="ChunkWriteComplete" id="14ef5595737703227cbbe4c006cc7920" bytesWritten="57738292" 
[tusd] 2022/10/31 04:54:13.976012 event="ResponseOutgoing" status="204" method="PATCH" path="14ef5595737703227cbbe4c006cc7920" requestId="" body="" 
[tusd] 2022/10/31 04:54:13.976043 event="UploadFinished" id="14ef5595737703227cbbe4c006cc7920" size="57738292" 
[tusd] 2022/10/31 04:54:17.708767 event="ChunkWriteComplete" id="9d5eb446624ade5a25d719f07865fdd8" bytesWritten="57738292" 
[tusd] 2022/10/31 04:54:17.740995 event="ResponseOutgoing" status="204" method="PATCH" path="9d5eb446624ade5a25d719f07865fdd8" requestId="" body="" 
[tusd] 2022/10/31 04:54:17.741020 event="UploadFinished" id="9d5eb446624ade5a25d719f07865fdd8" size="57738292" 
[tusd] 2022/10/31 04:54:17.742362 event="RequestIncoming" method="POST" path="" requestId="" 
[tusd] 2022/10/31 04:54:17.742385 event="ResponseOutgoing" status="400" method="POST" path="" requestId="" body="ERR_INVALID_UPLOAD_LENGTH: missing or invalid Upload-Length header

Client side NodeJS using tus-js-client

0 115476584 0.00%
0 115476584 0.00%
65536 115476584 0.06%
131072 115476584 0.11%
31260672 115476584 27.07%
63569920 115476584 55.05%
90113076 115476584 78.04%
115476584 115476584 100.00%
Failed because: Error: tus: unexpected response while creating upload, originated from request (method: POST, url: http://127.0.0.1:47280/files/, response code: 400, response text: ERR_INVALID_UPLOAD_LENGTH: missing or invalid Upload-Length header
, request id: n/a)

To Reproduce Steps to reproduce the behavior:

  1. Start up tusd server with Azure backend storage.
    podman run -p 47280:47280 -v /tmp/host/tus:/srv/ --env AZURE_STORAGE_ACCOUNT=<storage account name> --env AZURE_STORAGE_KEY=<azure storage key> docker.io/tusproject/tusd:v1.10.0 -port=47280 -azure-blob-access-tier=hot -azure-object-prefix=tus-prefix -azure-storage=tus-file-container -azure-endpoint=https://<storage account name>.blob.core.windows.net
  2. Run the client demo or a nodejs client, set the parallel upload to 2.
  3. Upload a file.
  4. Watch the last POST message fail, and the file fail to combine - even though all the parts has been successfully sent.

Expected behavior I expect parallel uploads to work.

Setup details Please provide following details, if applicable to your situation:

NodeJS client code

const fs = require('fs');
const tus = require("tus-js-client");

const path = "file.txt"
const file = fs.createReadStream(path)

const upload = new tus.Upload(file, {
  endpoint: 'http://127.0.0.1:47280/files/',
  parallelUploads: 2,

  // Callback for errors which cannot be fixed using retries
  onError: function(error) {
    console.log("Failed because: " + error)
  },

  // Callback for reporting upload progress
  onProgress: function(bytesUploaded, bytesTotal) {
    const percentage = (bytesUploaded / bytesTotal * 100).toFixed(2)
    console.log(bytesUploaded, bytesTotal, percentage + "%")
  },

  // Callback for once the upload is completed
  onSuccess: function() {
    console.log("Download %s from %s", upload.file.name, upload.url)
  }
});

// Check if there are any previous uploads to continue.
upload.findPreviousUploads()
  .then((previousUploads) => {

    // Found previous uploads so we select the first one. 
    if (previousUploads.length) {
      upload.resumeFromPreviousUpload(previousUploads[0])
    }

    // Start the upload
    upload.start()
})
boris-ning-usds commented 1 year ago

Ah, can't tell if this is a bug or if this requires an feature request. I noticed that tusd with Azure backend do not support the concatenation extension, which I assume is required for parallel uploads to work.

[tusd] 2022/10/31 21:38:36.132883 Supported tus extensions: creation,creation-with-upload,termination,creation-defer-length
Acconut commented 1 year ago

I noticed that tusd with Azure backend do not support the concatenation extension

Exactly! The azurestore does not yet support this extension, so parallelUploads from tus-js-client cannot be used with azurestore. I do not know who much effort it would be to implement this. Maybe @omBratteng knows more about this?

omBratteng commented 1 year ago

I was looking at implementing it, but I was a bit unsure how to exactly do it. Theoretically, you could upload the chunks in any order you'd like, as long as you know their original order before you commit the blob. So I chose to not implement it at the time.

OlafHaalstra commented 1 year ago

Is there alternatively a way to let the Azure uploading implementation take care of this? Just like azcopy utilizes concurrent connections to speed up download/upload.

https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-optimize#increase-the-number-of-concurrent-requests

omBratteng commented 1 year ago

I'm not sure how azcopy does it in the backend, but I would assume it would know the order of the chunks it breaks the file into. So when it comes to commit the blob, they know the order of the chunks to commit. Perhaps if there was a way for the tus client to tell the server the order of the chunks, we could then pass that on to Azure when committing the chunk list.

So theoretically if you have a file you split up in 100 chunks, name them from chunk 0 to 99. You could upload every odd chunk from 50 to 99, then every even from 0 to 50, then every even from 50 to 99 and finally every odd from 0 to 50. Then just say that the chunks belong in this order when committing, and it should be all good to go.

I'm not sure if that is possible? @Acconut? Perhaps it's something to consider for the v2 spec? Where the client pre-determines the list of chunks?

Acconut commented 1 year ago

I'm not sure if that is possible? @Acconut? Perhaps it's something to consider for the v2 spec? Where the client pre-determines the list of chunks?

I am not sure I understand the question fully. Are you looking for a way to tell the server, which chunks will be concatenated, before a single chunk is uploaded? Because right now, we do the reverse: first upload the chunks and then tell the sever in which order to concatenate them.

OlafHaalstra commented 1 year ago

I'm not sure if that is possible? @Acconut? Perhaps it's something to consider for the v2 spec? Where the client pre-determines the list of chunks?

I am not sure I understand the question fully. Are you looking for a way to tell the server, which chunks will be concatenated, before a single chunk is uploaded? Because right now, we do the reverse: first upload the chunks and then tell the sever in which order to concatenate them.

Shouldn't in in that case not already work to just push the file chunks in parallel (allowing parallelUploads) and 'comitting' by telling Azure how to concatenate the chunks (from 1 specific file if completed)?

For my use case related to this is how easy would it be to run tusd (in combination with the tus-js-client) to support multiple HTTP channels? I noticed that locally you can easily reach speeds to 100+ MBps but as soon as you try to reach Azure production it caps at 15MBps even though my local bandwith is in the range of 100 MBps

Acconut commented 1 year ago

Shouldn't in in that case not already work to just push the file chunks in parallel (allowing parallelUploads) and 'comitting' by telling Azure how to concatenate the chunks (from 1 specific file if completed)?

I have no experience with Azure, but the procedure you described is exactly how it is performed for S3 and GCS. I don't know Azure's APIs but I assume it would work similarly, yes.

omBratteng commented 1 year ago

Because right now, we do the reverse: first upload the chunks and then tell the sever in which order to concatenate them.

Does the client know which order it uploaded the chunks in? I haven't looked in any detail of tus-js-client, as our uploader was already created. But as I am refactoring our entire uploader, I will probably know more myself.

Acconut commented 1 year ago

Does the client know which order it uploaded the chunks in?

The order should not matter in this way. Chunks might be uploaded in parallel, so there is no order of time. However, the client knows in which order the chunks should be concatenated and tells this order to the server in the POST request for the concatenation. Does this make sense? If not, I believe we are talking about different things here.

omBratteng commented 1 year ago

I think I understand. And I think I need to do some testing/debugging to fully understand in the concept of how Azure Blob Storage does it.

Could you link to where in the js client it tells the server the order of chunks to concatenate?

Acconut commented 1 year ago

Sure, the entire relevant code for parallel uploads and concatenation is at https://github.com/tus/tus-js-client/blob/a456406f8e4232db416705952da5ef49a661c7ef/lib/upload.js#L255-L364.

It begins by slicing the entire file into multiple parts and uploading each part concurrently. Once the upload URL is available for a single part, it is stored in an array while preserving the positions of the part in the entire file: https://github.com/tus/tus-js-client/blob/a456406f8e4232db416705952da5ef49a661c7ef/lib/upload.js#L315

Finally, these URLs are POSTed to the tus server for concatentation: https://github.com/tus/tus-js-client/blob/a456406f8e4232db416705952da5ef49a661c7ef/lib/upload.js#L335-L336

abdulmomin703 commented 1 year ago

Shouldn't in in that case not already work to just push the file chunks in parallel (allowing parallelUploads) and 'comitting' by telling Azure how to concatenate the chunks (from 1 specific file if completed)?

I have no experience with Azure, but the procedure you described is exactly how it is performed for S3 and GCS. I don't know Azure's APIs but I assume it would work similarly, yes.

Azure works in the similar manner.

Azure provides stageBlock method in order to upload a single chunk and in parameters you have to specify a unique chunk/block Id. https://learn.microsoft.com/en-us/javascript/api/@azure/storage-blob/blockblobclient?view=azure-node-latest#@azure-storage-blob-blockblobclient-stageblock

After uploading all the blocks/chunks, commitBlockList method is used to concatenate the blocks. This method takes a list of blockIds in parameters to specify the order in which the blocks will be concatenated. Hence, the order in which the blocks are uploaded does not matter. https://learn.microsoft.com/en-us/javascript/api/@azure/storage-blob/blockblobclient?view=azure-node-latest#@azure-storage-blob-blockblobclient-commitblocklist

Acconut commented 1 year ago

That's great. Is there also a method to list these blocks? If so, a lot of code from the s3store could be reused for this.

abdulmomin703 commented 1 year ago

That's great. Is there also a method to list these blocks? If so, a lot of code from the s3store could be reused for this.

Yes, there is. https://learn.microsoft.com/en-us/javascript/api/@azure/storage-blob/blockblobclient?view=azure-node-latest#@azure-storage-blob-blockblobclient-getblocklist

Acconut commented 1 year ago

Great! In that case, it seems easy to implement a similar function for concatenation as already exists in the s3store: https://github.com/tus/tusd/blob/9cf626bf0ca5109815450eac49d8ea34072edc3e/pkg/s3store/s3store.go#L695-L819 It has two methods for concatenation because S3 has some quirks with small parts, so maybe it would be simpler for Azure.

However, I think this could be implemented easily. Let me know if someone is interested in doing so :)

omBratteng commented 1 year ago

Azure store does use blocklist

https://github.com/tus/tusd/blob/7225439860d8675b231f408f8e9a26b3fadcf1e2/pkg/azurestore/azureservice.go#L206-L213

https://github.com/tus/tusd/blob/7225439860d8675b231f408f8e9a26b3fadcf1e2/pkg/azurestore/azureservice.go#L235-L244

I guess I could take a run at it, but I can't promise any timeframe as I am a bit busy with selling my apartment and moving at the moment.

Acconut commented 1 year ago

I guess I could take a run at it, but I can't promise any timeframe as I am a bit busy with selling my apartment and moving at the moment.

No worries, good luck with your move!

sebasptsch commented 1 year ago

Hi, I was wondering if there was any follow-up discussion behind the scenes on concat with azure?

Acconut commented 1 year ago

Every discussion on this topic has happened in this issue, so there is nothing hidden from the public :) We all agree that we would like to accept support for concatenation in the Azure storage. Somebody just needs to have the time to implement it. Let me know if you are interested in doing so!