microsoft / AzureStorageExplorer

Easily manage the contents of your storage account with Azure Storage Explorer. Upload, download, and manage blobs, files, queues, tables, and Cosmos DB entities. Gain easy access to manage your virtual machine disks. Work with either Azure Resource Manager or classic storage accounts, plus manage and configure cross-origin resource sharing (CORS) rules.
Creative Commons Attribution 4.0 International
377 stars 87 forks source link

Uploading large files: files get "stuck" partially uploaded and never complete #8200

Open kmelmon opened 2 months ago

kmelmon commented 2 months ago

Preflight Checklist

Storage Explorer Version

1.35.0

Regression From

1.19.1

Architecture

i86

Storage Explorer Build Number

No response

Platform

Windows

OS Version

Windows 11

Bug Description

When uploading large files, the file gets "stuck" after partially uploading. Progress never goes beyond a certain point and the file never finishes uploading. In my most recent scenario the file was around 750 GB, but I have seen this sporadically in the past for files as small as 1 GB.

The issue seems likely related to the use of OAuth for authentication, which was added somewhat recently. If I roll back to version 1.19.1, it uses Active Directory authentication, and I don't see the issue.

The repro is 100% with the 750 GB file.

I've attached an AzCopy log file from the failure case 9978b3ad-4aa4-3944-7646-d12ca3c1bcdc.log

Steps to Reproduce

  1. Launch Azure Storage Explorer
  2. Connect to blob container storage using OAuth
  3. Upload a 750 GB file to blob storage container

Actual Experience

The upload partially completes then gets "stuck" and never makes any further progress.

Expected Experience

File should upload completely

Additional Context

No response

craxal commented 2 months ago

Can you also share your Storage Explorer app logs? How long does it take for the hang to occur?

Can't seem to find them right now, but we have a few other issues other users have opened related to this.

We have an ongoing discussion with AzCopy developers about an authentication question. Essentially, AzCopy clients are expected to insert an authentication token in the system credential store, and it reads the token to do its work. For long-running operations, clients also need to refresh the token. As far as we can tell, Storage Explorer does this as expected. What remains to be determined is whether AzCopy is checking the credential store for refreshed tokens at regular intervals.

kmelmon commented 1 month ago

Here are the app logs. I don't know exactly how long it takes for the hang to occur, but it seems like you can compute it from the AzCopy log. It seems to be under an hour from rough observation. AppLogs.zip

craxal commented 1 month ago

Yeah, that sounds about right. Tokens are normally good for about an hour before they need to be refreshed. So, it definitely sounds like AzCopy is not getting the refreshed token.

One way to test this is by running AzCopy directly yourself. You can copy the command Storage Explorer uses from the link in the Activity Log. If AzCopy can't complete on its own, it isn't doing something right.

kmelmon commented 1 month ago

2024/09/25 19:05:28 PERF: primary performance constraint is Disk. States: X: 0, O: 0, M: 0, L: 0, R: 0, D: 1, W: 0, F: 0, B: 1, E: 0, T: 2, GRs: 5 2024/09/25 19:05:28 10.7 %, 0 Done, 0 Failed, 1 Pending, 0 Skipped, 1 Total, 2-sec Throughput (Mb/s): 134.1982 (Disk may be limiting speed) 2024/09/25 19:05:28 ==> REQUEST/RESPONSE (Try=1/306.3515ms, OpTime=469.0319ms) -- RESPONSE SUCCESSFULLY RECEIVED PUT https://gfxmltrainingstore1.blob.core.windows.net/cadmus/TestLargeUpload%2Ftest2%2Fdownsample_text_bilinear.hdf5?blockid=MDAwMDBsRMGle8NE2GLVLVZQMZkSMDAwMDAwMDAwMDA0OTUy&comp=block X-Ms-Request-Id: [5f0b258a-c01e-0111-037d-0f3e52000000]

2024/09/25 19:05:29 ==> REQUEST/RESPONSE (Try=1/290.7196ms, OpTime=564.8591ms) -- RESPONSE SUCCESSFULLY RECEIVED PUT https://gfxmltrainingstore1.blob.core.windows.net/cadmus/TestLargeUpload%2Ftest2%2Fdownsample_text_bilinear.hdf5?blockid=MDAwMDBsRMGle8NE2GLVLVZQMZkSMDAwMDAwMDAwMDA0OTUz&comp=block X-Ms-Request-Id: [5f0b2750-c01e-0111-157d-0f3e52000000]

2024/09/25 19:05:30 PERF: primary performance constraint is Disk. States: X: 0, O: 0, M: 0, L: 0, R: 0, D: 1, W: 1, F: 0, B: 0, E: 0, T: 2, GRs: 5 2024/09/25 19:05:30 10.7 %, 0 Done, 0 Failed, 1 Pending, 0 Skipped, 1 Total, 2-sec Throughput (Mb/s): 66.8998 (Disk may be limiting speed) 2024/09/25 19:05:31 ==> REQUEST/RESPONSE (Try=1/228.61ms, OpTime=525.8384ms) -- RESPONSE SUCCESSFULLY RECEIVED PUT https://gfxmltrainingstore1.blob.core.windows.net/cadmus/TestLargeUpload%2Ftest2%2Fdownsample_text_bilinear.hdf5?blockid=MDAwMDBsRMGle8NE2GLVLVZQMZkSMDAwMDAwMDAwMDA0OTU0&comp=block X-Ms-Request-Id: [5f0b28e6-c01e-0111-7f7d-0f3e52000000]

2024/09/25 19:05:32 PERF: primary performance constraint is Disk. States: X: 0, O: 0, M: 0, L: 0, R: 0, D: 1, W: 0, F: 0, B: 1, E: 0, T: 2, GRs: 5 2024/09/25 19:05:32 10.7 %, 0 Done, 0 Failed, 1 Pending, 0 Skipped, 1 Total, 2-sec Throughput (Mb/s): 129.4926 (Disk may be limiting speed) 2024/09/25 19:05:32 ==> REQUEST/RESPONSE (Try=1/434.9037ms, OpTime=693.251ms) -- RESPONSE SUCCESSFULLY RECEIVED PUT https://gfxmltrainingstore1.blob.core.windows.net/cadmus/TestLargeUpload%2Ftest2%2Fdownsample_text_bilinear.hdf5?blockid=MDAwMDBsRMGle8NE2GLVLVZQMZkSMDAwMDAwMDAwMDA0OTU1&comp=block X-Ms-Request-Id: [5f0b2b1a-c01e-0111-777d-0f3e52000000]

2024/09/25 19:05:33 ==> REQUEST/RESPONSE (Try=1/307.093ms, OpTime=551.3149ms) -- RESPONSE SUCCESSFULLY RECEIVED PUT https://gfxmltrainingstore1.blob.core.windows.net/cadmus/TestLargeUpload%2Ftest2%2Fdownsample_text_bilinear.hdf5?blockid=MDAwMDBsRMGle8NE2GLVLVZQMZkSMDAwMDAwMDAwMDA0OTU2&comp=block X-Ms-Request-Id: [5f0b2cbd-c01e-0111-797d-0f3e52000000]

2024/09/25 19:05:34 PERF: primary performance constraint is Disk. States: X: 0, O: 0, M: 0, L: 0, R: 0, D: 1, W: 0, F: 0, B: 1, E: 0, T: 2, GRs: 5 2024/09/25 19:05:34 10.7 %, 0 Done, 0 Failed, 1 Pending, 0 Skipped, 1 Total, 2-sec Throughput (Mb/s): 71.2911 (Disk may be limiting speed) 2024/09/25 19:05:35 ==> REQUEST/RESPONSE (Try=1/324.4066ms, OpTime=623.817ms) -- RESPONSE SUCCESSFULLY RECEIVED PUT https://gfxmltrainingstore1.blob.core.windows.net/cadmus/TestLargeUpload%2Ftest2%2Fdownsample_text_bilinear.hdf5?blockid=MDAwMDBsRMGle8NE2GLVLVZQMZkSMDAwMDAwMDAwMDA0OTU3&comp=block X-Ms-Request-Id: [5f0b2e6a-c01e-0111-707d-0f3e52000000]

2024/09/25 19:05:35 ==> REQUEST/RESPONSE (Try=1/0s, OpTime=62.7333ms) -- RESPONSE STATUS CODE ERROR PUT https://gfxmltrainingstore1.blob.core.windows.net/cadmus/TestLargeUpload%2Ftest2%2Fdownsample_text_bilinear.hdf5?blockid=MDAwMDBsRMGle8NE2GLVLVZQMZkSMDAwMDAwMDAwMDA0OTU4&comp=block Accept: application/xml Authorization: REDACTED Content-Length: 16777216 Content-Type: application/octet-stream User-Agent: AzCopy/10.26.0 azsdk-go-azblob/v1.4.0 (go1.22.5; Windows_NT) X-Ms-Client-Request-Id: d4462e6d-5baf-4df7-6977-1171eaf0106e x-ms-version: 2023-08-03

RESPONSE Status: 401 Server failed to authenticate the request. Please refer to the information in the www-authenticate header. Content-Length: 404 Content-Type: application/xml Date: Wed, 25 Sep 2024 19:05:37 GMT Server: Microsoft-HTTPAPI/2.0 Www-Authenticate: Bearer authorization_uri=https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/oauth2/authorize resource_id=https://storage.azure.com X-Ms-Error-Code: InvalidAuthenticationInfo X-Ms-Request-Id: 5f0b3046-c01e-0111-327d-0f3e52000000 Response Details: InvalidAuthenticationInfoServer failed to authenticate the request. Please refer to the information in the www-authenticate header. Lifetime validation failed. The token is expired.

2024/09/25 19:05:36 ==> REQUEST/RESPONSE (Try=1/66.8347ms, OpTime=66.8347ms) -- RESPONSE STATUS CODE ERROR PUT https://gfxmltrainingstore1.blob.core.windows.net/cadmus/TestLargeUpload%2Ftest2%2Fdownsample_text_bilinear.hdf5?blockid=MDAwMDBsRMGle8NE2GLVLVZQMZkSMDAwMDAwMDAwMDA0OTU4&comp=block Accept: application/xml Authorization: REDACTED Content-Length: 16777216 Content-Type: application/octet-stream User-Agent: AzCopy/10.26.0 azsdk-go-azblob/v1.4.0 (go1.22.5; Windows_NT) X-Ms-Client-Request-Id: d4462e6d-5baf-4df7-6977-1171eaf0106e x-ms-version: 2023-08-03

RESPONSE Status: 401 Server failed to authenticate the request. Please refer to the information in the www-authenticate header. Content-Length: 404 Content-Type: application/xml Date: Wed, 25 Sep 2024 19:05:37 GMT Server: Microsoft-HTTPAPI/2.0 Www-Authenticate: Bearer authorization_uri=https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/oauth2/authorize resource_id=https://storage.azure.com X-Ms-Error-Code: InvalidAuthenticationInfo X-Ms-Request-Id: 0ebdb9b9-301e-0047-6a7d-0f31b0000000 Response Details: InvalidAuthenticationInfoServer failed to authenticate the request. Please refer to the information in the www-authenticate header. Lifetime validation failed. The token is expired.

2024/09/25 19:05:36 INFO: [P#0-T#0] Early close of chunk in singleChunkReader with context still active 2024/09/25 19:05:36 ERR: [P#0-T#0] UPLOADFAILED: \?\Z:\downsample_text_bilinear.hdf5 : 401 : 401 Server failed to authenticate the request. Please refer to the information in the www-authenticate header.. When Staging block. X-Ms-Request-Id: 0ebdb9b9-301e-0047-6a7d-0f31b0000000

Dst: https://gfxmltrainingstore1.blob.core.windows.net/cadmus/TestLargeUpload/test2/downsample_text_bilinear.hdf5 2024/09/25 19:05:36 PERF: primary performance constraint is Disk. States: X: 0, O: 0, M: 0, L: 0, R: 0, D: 1, W: 0, F: 0, B: 0, E: 0, T: 1, GRs: 5 2024/09/25 19:05:36 10.7 %, 0 Done, 0 Failed, 1 Pending, 0 Skipped, 1 Total, 2-sec Throughput (Mb/s): 130.1875 (Disk may be limiting speed) 2024/09/25 19:05:37 ==> REQUEST/RESPONSE (Try=1/10.7782ms, OpTime=76.9684ms) -- RESPONSE STATUS CODE ERROR GET https://gfxmltrainingstore1.blob.core.windows.net/cadmus/TestLargeUpload%2Ftest2%2Fdownsample_text_bilinear.hdf5?blocklisttype=all&comp=blocklist Accept: application/xml Authorization: REDACTED User-Agent: AzCopy/10.26.0 azsdk-go-azblob/v1.4.0 (go1.22.5; Windows_NT) X-Ms-Client-Request-Id: 795c7943-17e5-4c71-62f9-e46512d1b621 x-ms-version: 2023-08-03

RESPONSE Status: 401 Server failed to authenticate the request. Please refer to the information in the www-authenticate header. Content-Length: 404 Content-Type: application/xml Date: Wed, 25 Sep 2024 19:05:39 GMT Server: Microsoft-HTTPAPI/2.0 Vary: Origin Www-Authenticate: Bearer authorization_uri=https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/oauth2/authorize resource_id=https://storage.azure.com X-Ms-Error-Code: InvalidAuthenticationInfo X-Ms-Request-Id: dc134f59-401e-0000-597d-0f5aeb000000 Response Details: InvalidAuthenticationInfoServer failed to authenticate the request. Please refer to the information in the www-authenticate header. Lifetime validation failed. The token is expired.

2024/09/25 19:05:37 ==> REQUEST/RESPONSE (Try=1/502.9µs, OpTime=502.9µs) -- RESPONSE STATUS CODE ERROR GET https://gfxmltrainingstore1.blob.core.windows.net/cadmus/TestLargeUpload%2Ftest2%2Fdownsample_text_bilinear.hdf5?blocklisttype=all&comp=blocklist Accept: application/xml Authorization: REDACTED User-Agent: AzCopy/10.26.0 azsdk-go-azblob/v1.4.0 (go1.22.5; Windows_NT) X-Ms-Client-Request-Id: 795c7943-17e5-4c71-62f9-e46512d1b621 x-ms-version: 2023-08-03

RESPONSE Status: 401 Server failed to authenticate the request. Please refer to the information in the www-authenticate header. Content-Length: 404 Content-Type: application/xml Date: Wed, 25 Sep 2024 19:05:39 GMT Server: Microsoft-HTTPAPI/2.0 Vary: Origin Www-Authenticate: Bearer authorization_uri=https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/oauth2/authorize resource_id=https://storage.azure.com X-Ms-Error-Code: InvalidAuthenticationInfo X-Ms-Request-Id: dc134f5f-401e-0000-5e7d-0f5aeb000000 Response Details: InvalidAuthenticationInfoServer failed to authenticate the request. Please refer to the information in the www-authenticate header. Lifetime validation failed. The token is expired.

2024/09/25 19:05:37 JobID=a5c1446c-c37b-d844-62d5-2d5650319912, Part#=0, TransfersDone=1 of 1 2024/09/25 19:05:37 all parts of entire Job a5c1446c-c37b-d844-62d5-2d5650319912 successfully completed, cancelled or paused 2024/09/25 19:05:37 is part of Job which 1 total number of parts done 2024/09/25 19:05:38 PERF: primary performance constraint is Unknown. States: X: 0, O: 0, M: 0, L: 0, R: 0, D: 0, W: 0, F: 0, B: 0, E: 0, T: 0, GRs: 5 2024/09/25 19:05:38 0.0 %, 0 Done, 1 Failed, 0 Pending, 0 Skipped, 1 Total, 2024/09/25 19:05:38

Diagnostic stats: IOPS: 1 End-to-end ms per request: 669 Network Errors: 0.00% Server Busy: 0.00%

Job a5c1446c-c37b-d844-62d5-2d5650319912 summary Elapsed Time (Minutes): 71.5707 Number of File Transfers: 1 Number of Folder Property Transfers: 0 Number of Symlink Transfers: 0 Total Number of Transfers: 1 Number of File Transfers Completed: 0 Number of Folder Transfers Completed: 0 Number of File Transfers Failed: 1 Number of Folder Transfers Failed: 0 Number of File Transfers Skipped: 0 Number of Folder Transfers Skipped: 0 Total Number of Bytes Transferred: 0 Final Job Status: Failed

2024/09/25 19:05:38 Closing Log

craxal commented 1 month ago

Yep, error message says the token is expired. So I highly suspect that's why the job is hanging. I'm pretty sure something in AzCopy isn't checking for a refreshed token. I'll keep reaching out to the AzCopy team to get something moving.

kmelmon commented 1 month ago

Any update on this? Our team would really benefit from a fix as we use Azure Storage for all our ML training data.

craxal commented 1 month ago

We have no significant updates. The AzCopy team is aware of the issue and are currently looking into it.

jarkko-siitonen commented 1 week ago

My team is facing the same issue with large files in a major customer project. Can you point me to AzCopy issue related to this bug?

To some extent we were able to circumvent this OAuth expiration issue by using SAS-tokens for authentication - but unfortunately Storage Explorer appears to default to SAS-token authentication only if user has Storage account level Microsoft.Storage/storageAccounts/listkeys/action permission - something we can't grant to out end users. User can create SAS token manually without that Storage account level permission and use the token to do large transfers successfully - but instructing this manual method to all end users is not doable. Maybe storage explorer could default to sas-tokens even without said listkeys permission?

kmelmon commented 1 week ago

We've been using version 1.19.1 as a workaround for now: https://github.com/microsoft/AzureStorageExplorer/releases/tag/V1.19.1

craxal commented 1 week ago

@jarkko-siitonen Keep in mind there are two different kinds of SAS: key-based SAS and user-delegated SAS. Storage Explorer supports both. So you do not need to give out List Keys permissions. If the user has permissions to generate user-delegated SAS, Storage Explorer should be able to use that method instead.

@JasonYeMSFT Correct me, if I'm wrong?

jarkko-siitonen commented 1 week ago

@craxal Yes, but how do you make Storage explorer to actually use SAS tokens without a) generating the SAS token manually and connecting the explorer to Shared access signature URL b) granting the listkeys action

My testing seems to indicate Storage explorer passes AZCOPY_CRED_TYPE = "OAuthToken" parameter to AzCopy unless either of those 2 is true. Would be delighted if there was 3rd way to make Storage Exlorer use SAS.

JasonYeMSFT commented 1 week ago

@jarkko-siitonen What Craig said is right. Storage Explorer can generate user delegation SAS. But it can only do so after you are signed in because it needs your Azure AD credential. Storage Explorer is implemented to prefer user delegation SAS because it is more flexible. If your user can already sign in, get an OAuth token and use it to do data transfer, there is a good chance that you can take advantage of user delegation SAS to achieve easier access control. If you can tell us more about your scenario we may better help you by giving recommendations, for example

jarkko-siitonen commented 1 week ago

@JasonYeMSFT Short description: As a fragment of larger project, we are replacing a legacy sftp server with Azure storage account(s) to enforce Entra ID authentication with MFA and Geoblocking. The service is used to transfer files with sizes in hundreds of GB, thus transfer times can be quite long (more than OAuth token lifetime).

How do your users connect to their storage in Storage Explorer:

What are the operations your users need

How do you distribute permissions to your users

Is the AzCopy team just aware of the problem or is fixing this bug on the roadmap?

craxal commented 1 week ago

@jarkko-siitonen They are actively working on a fix for the issue. They've given us an estimated release date in January.

@JasonYeMSFT

For user with contributor role Storage explorer uses account key-based SAS in uploads and downloads

To my knowledge, assuming they also have permission to create user-delegated SAS, this should not be happening. Any idea why this would?

JasonYeMSFT commented 1 week ago

@jarkko-siitonen As Craig has said, AzCopy is working on a fix and once it's done, direct OAuth access token based transfer should no longer have any issues. However, you might be able to stop relying on account key access today in Storage Explorer.

For your read-only case, I think you should keep giving your users Storage Blob Data Reader role. For your read-write case, have you considered giving your users Storage Blob Data Contributor role instead and turn off the storage account key access for your storage accounts? For users with the Storage Blob Data Contributor role, Storage Explorer can generate a user delegation SAS for them which can have a lifetime much longer than a regular OAuth access token (up to 7 days). It has a maximum lifetime though so it would become an issue if your data transfer takes more time to complete. You can try setting this up for a new test storage account and upload a blob to see how it works. We would like to help you stop retaining storage account key access for your users. If you run into any issues, please let us know and we can help.

jarkko-siitonen commented 1 week ago

Thank you @craxal and @JasonYeMSFT for the bug fix release date indication. It's a lot easier to decide what to do next, when there is a clear understanding, the bug is actively worked on!

Seven days maximum duration for OAuth file transfers should not be an issue for us.

And yes, we would indeed prefer not to give storage account key access to users. However, a new test did not yield any better results for us than it did before. Guest user has Storage Blob Data Contributor role on container level and Reader role for storage account level, with additional custom role containing Generate a user delegation key permission on storage account level. The user can generate User delegated SAS manually, but Storage Explorer upload, download and delete functions all use OAuth authentication - which would be just fine for us, if this token renewal bug didn’t exist.

If this Storage Blob Data Contributor behavior is not the expected/desired outcome with Storage Blob Data Contributor role, maybe I should create a new issue for Storage Explorer team to inspect?

JasonYeMSFT commented 1 week ago

I read the code and I found that our upload code doesn't use user delegation SAS because it doesn't have the limitation of service to service copy where the source and destination cannot use different OAuth credentials. I added this issue to tracking adding this user delegation SAS to our upload/download code https://github.com/microsoft/AzureStorageExplorer/issues/8299