Open boris-ning-usds opened 1 year ago
Ah, can't tell if this is a bug or if this requires an feature request. I noticed that tusd with Azure backend do not support the concatenation extension, which I assume is required for parallel uploads to work.
[tusd] 2022/10/31 21:38:36.132883 Supported tus extensions: creation,creation-with-upload,termination,creation-defer-length
I noticed that tusd with Azure backend do not support the concatenation extension
Exactly! The azurestore does not yet support this extension, so parallelUploads
from tus-js-client cannot be used with azurestore. I do not know who much effort it would be to implement this. Maybe @omBratteng knows more about this?
I was looking at implementing it, but I was a bit unsure how to exactly do it. Theoretically, you could upload the chunks in any order you'd like, as long as you know their original order before you commit the blob. So I chose to not implement it at the time.
Is there alternatively a way to let the Azure uploading implementation take care of this? Just like azcopy
utilizes concurrent connections to speed up download/upload.
I'm not sure how azcopy does it in the backend, but I would assume it would know the order of the chunks it breaks the file into. So when it comes to commit the blob, they know the order of the chunks to commit. Perhaps if there was a way for the tus client to tell the server the order of the chunks, we could then pass that on to Azure when committing the chunk list.
So theoretically if you have a file you split up in 100 chunks, name them from chunk 0 to 99. You could upload every odd chunk from 50 to 99, then every even from 0 to 50, then every even from 50 to 99 and finally every odd from 0 to 50. Then just say that the chunks belong in this order when committing, and it should be all good to go.
I'm not sure if that is possible? @Acconut? Perhaps it's something to consider for the v2 spec? Where the client pre-determines the list of chunks?
I'm not sure if that is possible? @Acconut? Perhaps it's something to consider for the v2 spec? Where the client pre-determines the list of chunks?
I am not sure I understand the question fully. Are you looking for a way to tell the server, which chunks will be concatenated, before a single chunk is uploaded? Because right now, we do the reverse: first upload the chunks and then tell the sever in which order to concatenate them.
I'm not sure if that is possible? @Acconut? Perhaps it's something to consider for the v2 spec? Where the client pre-determines the list of chunks?
I am not sure I understand the question fully. Are you looking for a way to tell the server, which chunks will be concatenated, before a single chunk is uploaded? Because right now, we do the reverse: first upload the chunks and then tell the sever in which order to concatenate them.
Shouldn't in in that case not already work to just push the file chunks in parallel (allowing parallelUploads
) and 'comitting' by telling Azure how to concatenate the chunks (from 1 specific file if completed)?
For my use case related to this is how easy would it be to run tusd
(in combination with the tus-js-client
) to support multiple HTTP channels? I noticed that locally you can easily reach speeds to 100+ MBps but as soon as you try to reach Azure production it caps at 15MBps even though my local bandwith is in the range of 100 MBps
Shouldn't in in that case not already work to just push the file chunks in parallel (allowing parallelUploads) and 'comitting' by telling Azure how to concatenate the chunks (from 1 specific file if completed)?
I have no experience with Azure, but the procedure you described is exactly how it is performed for S3 and GCS. I don't know Azure's APIs but I assume it would work similarly, yes.
Because right now, we do the reverse: first upload the chunks and then tell the sever in which order to concatenate them.
Does the client know which order it uploaded the chunks in? I haven't looked in any detail of tus-js-client, as our uploader was already created. But as I am refactoring our entire uploader, I will probably know more myself.
Does the client know which order it uploaded the chunks in?
The order should not matter in this way. Chunks might be uploaded in parallel, so there is no order of time. However, the client knows in which order the chunks should be concatenated and tells this order to the server in the POST request for the concatenation. Does this make sense? If not, I believe we are talking about different things here.
I think I understand. And I think I need to do some testing/debugging to fully understand in the concept of how Azure Blob Storage does it.
Could you link to where in the js client it tells the server the order of chunks to concatenate?
Sure, the entire relevant code for parallel uploads and concatenation is at https://github.com/tus/tus-js-client/blob/a456406f8e4232db416705952da5ef49a661c7ef/lib/upload.js#L255-L364.
It begins by slicing the entire file into multiple parts and uploading each part concurrently. Once the upload URL is available for a single part, it is stored in an array while preserving the positions of the part in the entire file: https://github.com/tus/tus-js-client/blob/a456406f8e4232db416705952da5ef49a661c7ef/lib/upload.js#L315
Finally, these URLs are POSTed to the tus server for concatentation: https://github.com/tus/tus-js-client/blob/a456406f8e4232db416705952da5ef49a661c7ef/lib/upload.js#L335-L336
Shouldn't in in that case not already work to just push the file chunks in parallel (allowing parallelUploads) and 'comitting' by telling Azure how to concatenate the chunks (from 1 specific file if completed)?
I have no experience with Azure, but the procedure you described is exactly how it is performed for S3 and GCS. I don't know Azure's APIs but I assume it would work similarly, yes.
Azure works in the similar manner.
Azure provides stageBlock method in order to upload a single chunk and in parameters you have to specify a unique chunk/block Id. https://learn.microsoft.com/en-us/javascript/api/@azure/storage-blob/blockblobclient?view=azure-node-latest#@azure-storage-blob-blockblobclient-stageblock
After uploading all the blocks/chunks, commitBlockList method is used to concatenate the blocks. This method takes a list of blockIds in parameters to specify the order in which the blocks will be concatenated. Hence, the order in which the blocks are uploaded does not matter. https://learn.microsoft.com/en-us/javascript/api/@azure/storage-blob/blockblobclient?view=azure-node-latest#@azure-storage-blob-blockblobclient-commitblocklist
That's great. Is there also a method to list these blocks? If so, a lot of code from the s3store could be reused for this.
That's great. Is there also a method to list these blocks? If so, a lot of code from the s3store could be reused for this.
Great! In that case, it seems easy to implement a similar function for concatenation as already exists in the s3store: https://github.com/tus/tusd/blob/9cf626bf0ca5109815450eac49d8ea34072edc3e/pkg/s3store/s3store.go#L695-L819 It has two methods for concatenation because S3 has some quirks with small parts, so maybe it would be simpler for Azure.
However, I think this could be implemented easily. Let me know if someone is interested in doing so :)
Azure store does use blocklist
I guess I could take a run at it, but I can't promise any timeframe as I am a bit busy with selling my apartment and moving at the moment.
I guess I could take a run at it, but I can't promise any timeframe as I am a bit busy with selling my apartment and moving at the moment.
No worries, good luck with your move!
Hi, I was wondering if there was any follow-up discussion behind the scenes on concat with azure?
Every discussion on this topic has happened in this issue, so there is nothing hidden from the public :) We all agree that we would like to accept support for concatenation in the Azure storage. Somebody just needs to have the time to implement it. Let me know if you are interested in doing so!
Technical Knowledge of User I'm new to using tusd and the tus-js-client, so I'll be happy if anybody can point out any problems in my setup for testing.
Describe the bug While running tusd with an Azure Storage backend, I found that the transfer would always fail on the last HTTP POST with a missing or invalid Upload-Length header. This only occurs when parallel upload (connections) are set to at least 2.
Snippet of tusd logs during the end of a transfer:
Client side NodeJS using tus-js-client
To Reproduce Steps to reproduce the behavior:
podman run -p 47280:47280 -v /tmp/host/tus:/srv/ --env AZURE_STORAGE_ACCOUNT=<storage account name> --env AZURE_STORAGE_KEY=<azure storage key> docker.io/tusproject/tusd:v1.10.0 -port=47280 -azure-blob-access-tier=hot -azure-object-prefix=tus-prefix -azure-storage=tus-file-container -azure-endpoint=https://<storage account name>.blob.core.windows.net
Expected behavior I expect parallel uploads to work.
Setup details Please provide following details, if applicable to your situation:
-port=47280 -azure-blob-access-tier=hot -azure-object-prefix=tus-prefix -azure-storage=tus-file-container -azure-endpoint=https://<storage account name>.blob.core.windows.net
NodeJS client code