tus / tusd

Reference server implementation in Go of tus: the open protocol for resumable file uploads
https://tus.github.io/tusd
MIT License
3.03k stars 474 forks source link

When using S3 storage sometimes files are shown as uploaded but a zero byte object is stored #497

Open OKaluza opened 3 years ago

OKaluza commented 3 years ago

Describe the bug I'm not sure if this is an issue with Uppy or tusd, but I have a tusd server connected to an S3 compatible store (On OpenStack Swift) which I'm uploading files to with the Uppy react Dashboard element. Everything works well most of the time, but when uploading large numbers of files (eg: 907 jpg images in my test case) all of the files are marked as uploading successfully and all appear as entries in the S3 bucket, but some of the files are zero bytes, ~5 out of 907 images in this case.

The metadata stored for the invalid images has the correct file sizes and the previews appear fine in Uppy.

It is then impossible to upload these images again without manually deleting them from the bucket, as it always sees the existing uploads as valid, even though the data is missing. If I disable S3 storage all the files are uploaded fine.

To Reproduce Upload a large number of images (100s) from Uppy React Dashboard to S3 backed tusd server

Expected behavior All the images should be stored in the S3 bucket.

Setup details

I am using this helm chart to run the server https://github.com/sagikazarmark/helm-charts/tree/master/charts/tusd

Acconut commented 3 years ago

Interesting report, thank you! My first guess is that maybe OpenStack Swift is behaving differently than S3, confusing tusd. Can you please record the tusd logs for all uploads and send them over? If the file would be too large for GitHub, you can also mail it to marius@transloadit.com.

It is then impossible to upload these images again without manually deleting them from the bucket, as it always sees the existing uploads as valid, even though the data is missing.

Yes, that is because tus-js-client and Uppy store the upload URL associated with your files in localStorage. Either cleaning localStorage or setting resume: false for Uppy's tus plugin should help there.

OKaluza commented 3 years ago

Thanks @Acconut, I tried to find a log file but it doesn't seem to be there, where is it kept? Do I need to pass a flag to enable logging? I tried adding "-verbose" but didn't seem to make a difference, according to https://github.com/tus/tusd/blob/master/docs/usage-binary.md that is on by default anyway.

Acconut commented 2 years ago

Apologies for my late reply, @OKaluza. I was out of the office for some time. Let me know if this behavior is still occurring.

I tried to find a log file but it doesn't seem to be there, where is it kept?

tusd does not create a log file but simply writes its output to stdout/stderr. It is then the user's responsibility to store the output in a file if desired so. We have desired to not include file logging functionality in tusd to reduce complexity.

OKaluza commented 2 years ago

Hi @Acconut no worries, I stopped using the S3 interface as I was couldn't find enough info to debug whether it was an issue with our storage backend or tusd and was unable to make it 100% reliable. Thanks for the info, I did end up realising the logs were all to stdout/err. I'll reopen this if I get a chance to try and test it again on the OpenStack S3 interface.

OKaluza commented 1 year ago

Hi @Acconut I am trying to use the S3 interface again and running into the same issue. An example is this upload: https://tusd.dev.asdc.cloud.edu.au/files/004dd2ebad8322404987399b7922a280+MzJmYjVmOTgtZjk3Mi00YzhlLWFkNjItYmYzYzEyOThhMmY5 If I look at the file in our object storage it is zero bytes, the correct file size is listed in the .info file however, if I request the url with wget it responds 200 and attempts to download but the transfer fails and keeps retrying in a loop. Here are the logs https://gist.github.com/OKaluza/30ef62c77935ccdc65d80e6242f6baed, I can't see any issues when the file was uploaded, everything seemed to go fine.

Acconut commented 1 year ago

The logs contain multiple errors of the same type:

[tusd] 2023/05/11 00:47:44.977929 event="ResponseOutgoing" status="500" method="PATCH" path="1dd661d49f35c3947a46654e48952ef2+MjY2ZjljYmQtYWNjYi00ZTZkLWI5ZDItODMyMTQ2MTcwMjcx" error="EntityTooSmall: Your proposed upload is smaller than the minimum allowed object size. status code: 400, request id: tx85a5a2e141b242a3811ab-00645c3b2f, host id: tx85a5a2e141b242a3811ab-00645c3b2f" requestId="bc11b2865c89a4cea657fa44d4f16ed7"

Each upload to S3 is split into multiple parts where each part is uploaded on its own (called S3 Multipart Upload). Each part must meet a minimum part size requirement. For AWS S3, this is 5MB. But in this case a few parts did not meet this minimum. Maybe OpenStack has a different minimum size? Did you change any configuration of tusd?

Also, I am noticing multiple PATCH requests for the upload 1dd661d49f35c3947a46654e48952ef2. That looks suspicious.

OKaluza commented 1 year ago

Thanks for looking at it, I haven't changed anything in the tusd config apart from setting s3-bucket, s3-endpoint, and behind-proxy.

Is it worth playing with the "partSize" option?

OpenStack should have the same 5MB minimum, it intends to be completely compatible with the basic S3 interface and I haven't had any issues using it with other applications that support S3, or in python with boto3 etc. I thought it might only be occurring with whose total size was < 5MB but it appears some are less and some are over 5MB. The frequency is quite low, at most ~5% of the uploads seem to end up as zero byte objects.

My attempts to debug the issue today haven't come up with anything very helpful, but there are no 500 errors or any obvious errors in the logs at all this time, the uploads appear to be fine, there are 204 - no content responses when accessing the uploads that are stored as zero bytes however. For example this upload:

$ wget https://tusd.dev.asdc.cloud.edu.au/files/9eec7f3c271cb983cc4f883c97289f23+MTc5YjIxZjktMDhjYi00OGZjLWI4MjItYTkyYjFhZDE1NWUz
--2023-05-17 15:46:15--  https://tusd.dev.asdc.cloud.edu.au/files/9eec7f3c271cb983cc4f883c97289f23+MTc5YjIxZjktMDhjYi00OGZjLWI4MjItYTkyYjFhZDE1NWUz
Resolving tusd.dev.asdc.cloud.edu.au (tusd.dev.asdc.cloud.edu.au)... 118.138.249.245
Connecting to tusd.dev.asdc.cloud.edu.au (tusd.dev.asdc.cloud.edu.au)|118.138.249.245|:443... connected.
HTTP request sent, awaiting response... 204 No Content

The upload log entries:

[tusd] 2023/05/17 05:18:00.848108 event="UploadCreated" id="9eec7f3c271cb983cc4f883c97289f23+MTc5YjIxZjktMDhjYi00OGZjLWI4MjItYTkyYjFhZDE1NWUz" size="3744012" url="https://tusd.dev.asdc.cloud.edu.au/files/9eec7f3c271cb983cc4f883c97289f23+MTc5YjIxZjktMDhjYi00OGZjLWI4MjItYTkyYjFhZDE1NWUz" 
[tusd] 2023/05/17 05:18:01.240267 event="RequestIncoming" method="PATCH" path="9eec7f3c271cb983cc4f883c97289f23+MTc5YjIxZjktMDhjYi00OGZjLWI4MjItYTkyYjFhZDE1NWUz" requestId="b10a374902da0f56c499fd63e46f7115" 
[tusd] 2023/05/17 05:18:01.408458 event="ChunkWriteStart" id="9eec7f3c271cb983cc4f883c97289f23+MTc5YjIxZjktMDhjYi00OGZjLWI4MjItYTkyYjFhZDE1NWUz" maxSize="3744012" offset="0" 
[tusd] 2023/05/17 05:18:02.919309 event="ChunkWriteComplete" id="9eec7f3c271cb983cc4f883c97289f23+MTc5YjIxZjktMDhjYi00OGZjLWI4MjItYTkyYjFhZDE1NWUz" bytesWritten="3744012" 
[tusd] 2023/05/17 05:18:04.528701 event="ResponseOutgoing" status="204" method="PATCH" path="9eec7f3c271cb983cc4f883c97289f23+MTc5YjIxZjktMDhjYi00OGZjLWI4MjItYTkyYjFhZDE1NWUz" requestId="b10a374902da0f56c499fd63e46f7115" 
[tusd] 2023/05/17 05:18:04.528723 event="UploadFinished" id="9eec7f3c271cb983cc4f883c97289f23+MTc5YjIxZjktMDhjYi00OGZjLWI4MjItYTkyYjFhZDE1NWUz" size="3744012" 

The download log entries:

[tusd] 2023/05/17 05:18:34.182769 event="RequestIncoming" method="GET" path="9eec7f3c271cb983cc4f883c97289f23+MTc5YjIxZjktMDhjYi00OGZjLWI4MjItYTkyYjFhZDE1NWUz" requestId="ab0d0994b9ab05eb1597c387d80afdbb" 
[tusd] 2023/05/17 05:18:34.518524 event="ResponseOutgoing" status="204" method="GET" path="9eec7f3c271cb983cc4f883c97289f23+MTc5YjIxZjktMDhjYi00OGZjLWI4MjItYTkyYjFhZDE1NWUz" requestId="ab0d0994b9ab05eb1597c387d80afdbb" 

The 9eec7f3c271cb983cc4f883c97289f23.info file:

{"ID":"9eec7f3c271cb983cc4f883c97289f23+MTc5YjIxZjktMDhjYi00OGZjLWI4MjItYTkyYjFhZDE1NWUz","Size":3744012,"SizeIsDeferred":false,"Offset":0,"MetaData":{"filename":"DSC_0425.JPG","filetype":"image/jpeg","name":"DSC_0425.JPG","relativePath":"null","type":"image/jpeg"},"IsPartial":false,"IsFinal":false,"PartialUploads":null,"Storage":{"Bucket":"asdc-uploads","Key":"9eec7f3c271cb983cc4f883c97289f23","Type":"s3store"}}
dylanmcculloch commented 1 year ago

Hi @Acconut, I help administer the Openstack Swift cluster that @OKaluza is using. We've reproduced the issue in our test environment.

Operating System: Linux Used tusd version: Built from git main Used tusd data storage: AWS S3 interface to OpenStack Swift Used tusd configuration: -s3-bucket xxx -s3-endpoint yyy -hooks-dir=./examples/hooks/ Used tus client library: tuspy 1.0.0

For an upload that results in a 0 byte object, it looks like the tusd/aws client initially sends the correct object to Swift/S3, but then the client almost immediately overwrites the object with a 0 byte object.

The main issue appears to be that after the tusd/aws client sends the UploadPart request it immediately performs a subsequent ListParts call. It looks like there's a race condition due to the "eventually consistent" nature of Openstack Swift/S3, which means the response from the ListParts call sometimes does not include the part that was just sent. In that circumstance it performs another upload, but at this point it only sends a 0 byte object. I patched tusd to enable aws client debug logging [1] which revealed the following.

The normal good sequence is: s3/CreateMultipartUpload s3/PutObject s3/GetObject s3/ListParts s3/GetObject s3/ListParts s3/GetObject s3/UploadPart s3/ListParts <---- response includes both PartNumber 0 & 1 s3/CompleteMultipartUpload

But for cases in which we see the problematic 0 byte objects we see the sequence: s3/CreateMultipartUpload s3/PutObject s3/GetObject s3/ListParts s3/GetObject s3/ListParts s3/GetObject s3/UploadPart s3/ListParts <---- response only includes PartNumber 0, does not include PartNumber 1 s3/UploadPart <---- extra-step, uploads 0 byte file for PartNumber=1 s3/CompleteMultipartUpload

For that extra step, the UploadPart Request has: X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 (sha256 of 0 byte file) And response shows: Etag: "d41d8cd98f00b204e9800998ecf8427e" (md5 of 0 byte file)

I've tested adding a 1 second sleep before the ListParts call [2] and I haven't been able to reproduce the same issue, so that seems to support the theory of a race condition. This appears to be the relevant code that is sending an empty part when this bug is triggered [3]:

if len(parts) == 0 {
    // AWS expects at least one part to be present when completing the multipart
    // upload. So if the tus upload has a size of 0, we create an empty part
    // and use that for completing the multipart upload.

I'm assuming this issue wouldn't be apparent when using AWS itself as a backend as their S3 is now strongly consistent. [4]

I've attached a couple of logs:


[1] https://gist.github.com/dylanmcculloch/41f3e0e0254c990e39d0abfc6137a4bb#file-tusd_aws_debug-patch [2] https://gist.github.com/dylanmcculloch/809e812e223c5bd3c05b7429cd1d645c#file-tusd_sleep_before_listallparts-patch [3] https://github.com/tus/tusd/blob/318aab451be20a90e9d4c0c4287c2e5264cc9542/pkg/s3store/s3store.go#L648-L651 [4] https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/ [5] https://gist.github.com/dylanmcculloch/a07b09568b5870f6bd8dde7c47a5d41e#file-tusd_s3_swift_debug_good-log [6] https://gist.github.com/dylanmcculloch/a72ecaf9088029f6cedc2a558f429d1f#file-tusd_s3_swift_debug_bad-log

Acconut commented 1 year ago

Wow, thank you very much for this investigation! Given the eventual consistency of ListParts for OpenStack, the error now makes sense. tusd's reliance on the eventual consistency of ListParts was questioned in the past (see https://github.com/tus/tusd/issues/204), but since we never ran into issues with AWS S3, we didn't saw a reason to drop ListParts. Now I am wondering if our issues reported here might be caused by this reliance as well.

I would expect this to become better in the v2 release, because we cache ListParts results there. Instead of calling ListParts whenever we need a list of parts, we fetch the list at the beginning of the request and then update and use our internal cache for everything else. So if you upload your file in one PATCH request, there should not be a race condition between UploadPart and ListParts. Could you try one of the lastest release candidates for v2?

OKaluza commented 1 year ago

Thanks heaps @dylanmcculloch for debugging and updating this!

@Acconut I've tried v2.0.0-rc19 and so far have not seen the issue so far, will keep testing. Is there anything we should be aware of using v2, is it production ready?

Acconut commented 1 year ago

That's great to hear! v2 incorporates many changes, but we have also been running it in production for some time without issues. So it should be good. We hope to properly release it soon!