payloadcms / payload

Payload is the open-source, fullstack Next.js framework, giving you instant backend superpowers. Get a full TypeScript backend and admin panel instantly. Use Payload as a headless CMS or for building powerful applications.
https://payloadcms.com
MIT License
21.94k stars 1.32k forks source link

"Large" file uploads results in: MongoServerError: Transaction with { txnNumber: 31 } has been aborted. #5148

Open zazz-ops opened 5 months ago

zazz-ops commented 5 months ago

Link to reproduction

No response

Describe the Bug

I've successfully created, configured, and tested a media collection that has no problem uploading images, resizing them, and forwarding them to my S3 bucket using the @payload/plugin-cloud-storage package. There's no issue with how the collection functions, unless I try to upload a video with a filesize of ~180MB.

image

The nova function is just a thin sugar-wrapper on my fetch request.

The request hangs for about 3 minutes and then I see the following in the Payload console logs:

image

And the very helpful error message in the response from the request:

image

This is obviously some sort of MongoDB timeout, but I'm confused as to why a MongoDB timeout would have any impact on a file upload at all. Is there a MongoDB transaction that starts at the onset of the file upload, and then it doesn't finish until the file upload is done?

Nonetheless, I tried to increase the upload.limits.filesize value in my payload.config.ts file, but the issue persists.

image

I'm looking to use Payload to handle uploads of up to 5GB. Is this possible with Payload?

To Reproduce

Create a media collection and attempt to upload a large file via the REST API interface.

Payload Version

2.8.1

Adapters and Plugins

No response

BrianJM commented 5 months ago

This appears related to #4350

BrianJM commented 5 months ago

@zazz-ops Can you try disabling transactions in mongodb?

transactionOptions: false

https://payloadcms.com/docs/database/mongodb#options

You are connected to a replica set, correct?

zazz-ops commented 5 months ago

@BrianJM Disabled transactions and still getting the same error:

image

Interestingly enough, the file actually does upload successfully to my S3 bucket:

image

However, I'm still getting the mongo error and there is no doc for the uploaded file in the DB.

zazz-ops commented 5 months ago

@BrianJM And yes, connecting to a replica set. I'm just using the free-tier MongoDB Atlas M0 deployment.

BrianJM commented 5 months ago

@zazz-ops I started getting transaction errors after moving from a local instance to an Atlas (M0) replica set as well.

It's interesting that disabling transactions still produces a transaction error; in my case, it does resolve the issue but I'm not doing large uploads.

~Do you see a pattern with the timing of transaction errors? I realize this may depend on network conditions, but does it occur after 60 seconds (as an example) from the start of the upload?~ I see you noted approximately 3 minutes.

zazz-ops commented 5 months ago

@BrianJM Yes, the error typically fires after ~3 minutes. But the interesting thing is that it always fires (immediately?) following the completion of the file upload. The video file I'm using for testing (~180Mb) takes ~3 minutes to upload on average.

So the timing of the error seems to be directly related to the duration of the upload.

zazz-ops commented 5 months ago

@BrianJM I just did some more testing with a 21Mb file and an 80Mb file.

The 21Mb file uploaded fine and produced no errors. Payload performed as expected.

The 80Mb file produced the same errors as the original 180Mb file. The file itself landed fine in my S3 Bucket, but the MongoDB error was thrown, and there's no correlating doc in my media collection.

zazz-ops commented 5 months ago

How do we figure out what the partSize is that Payload is using for its S3 uploads? Seems to me that it might be something like 50Mb, and so the 21Mb file is fine, but the 80Mb file is not because it needs to move on to the next part of the upload, but that (for some unknown reason) ends up throwing this MongoDB error.

BrianJM commented 5 months ago

How do we figure out what the partSize is that Payload is using for its S3 uploads? Seems to me that it might be something like 50Mb, and so the 21Mb file is fine, but the 80Mb file is not because it needs to move on to the next part of the upload, but that (for some unknown reason) ends up throwing this MongoDB error.

@zazz-ops You beat me to it. That's part of the problem, I think. The partSize is 50 MB and the queueSize is 4. Each part must finish within 2 minutes, or the upload will timeout.

What's your upload speed? Are you hosting local or in a data center? You will need at least 15Mbps upload to finish a 200MB upload in 2 minutes.

The partSize definitely needs to be reduced with a queueSize of 4.

https://github.com/payloadcms/payload/blob/main/packages%2Fplugin-cloud-storage%2Fsrc%2Fadapters%2Fs3%2FhandleUpload.ts

DanRibbens commented 5 months ago

Could it be the transactionLifetimeLimitSeconds?

The transaction will initialize at the start of the request, but it won't commit until after the file completes. I think this is the underlying issue and something we need to consider.

The upload code is written this way, because if you were trying to upload a file and it fails, we wouldn't want it to have a collection document. It sounds like we need to rethink this approach and introduce an intermediate db commit with a delete cleanup for failing uploads.

BrianJM commented 5 months ago

I thought it may be related to this, depending on the upload speed.

I am testing large uploads with the recent fixes on main, and trying to replicate in 2.11.1. I think you're right about transactionLifetimeLimitSeconds - I'll test that as well.

BrianJM commented 5 months ago

@DanRibbens I can reproduce an error with large uploads, with and without transactions.

I think you're right about re-thinking the approach regarding transactions - that will also be necessary - but there is a deeper issue here with S3 timeouts.

Testing

Reproduction

Here is how to reproduce the issue, with or without transactions:

  1. In the Network tab of dev tools, create a 5 Mbps profile.
  2. Upload a file > 200 MB.
  3. Wait 5 minutes and review the request in the Network tab.

    image

Note: 6 Mbps and 200 MB will not cause an error. The upload will succeed as shown below.

image

Resolution?

I believe resolution is to allow the S3 timeout to be configured.

Allowing the multipartUpload/partSize and queueSize (file) may also be useful. The maximum memory consumption is queueSize * partSize, so it may be better to have smaller parts spread across more queue slots in some environments (e.g., 4 * 50 MB = 200 MB vs 10 * 5 MB = 50 MB)

Changing the multipartUpload to 5 MB (from 50 MB) does not resolve the issue; I still receive an error ([00:37:40] ERROR (payload): Error: No files were uploaded.).

I did not get as far as testing extending or removing the timeout.

Update: fetch appears to timeout after 5 minutes (waiting for response headers). This explains longer timeouts without transactions.

zazz-ops commented 5 months ago

@BrianJM My upload speed is slow, typically it's about 5Mb. So that's likely contributing to the failure of the upload, but it shouldn't matter. The whole point of multipart uploading is resiliency at any upload speed. So Payload absolutely needs to tackle this. It's not optional for a CMS — even one that's so much more than just a CMS.

@DanRibbens I suspected that Payload was starting a DB op upon the init of the upload and then waiting for the upload to complete. And it seems you've already figured out that this is not a tenable strategy for an uploader. A 5Gb file could take hours (or even days) to upload depending on the user's upload speed, so keeping a DB op running for that long is ... you already know.

So ... fun fact: I've built the uploader I need 5 times already for previous projects. So I'll likely reach for the latest bespoke version I wrote and see if I can make Payload work with it by using a media collection that does not have uploads enabled.

Here's a quick sketch of how I've implemented a highly-resilient "large" file uploader before:

This is obviously a high-level sketch, but it's the gist of what I've implemented in the past for the bespoke CMS or DAM systems I've built.

Payload handles CRUD, ACL, Hooks, Globals, and Plugins better than what I've built, so for this new project I'm working on I'm currently evaluating whether I should modify the codebase I've already written to suit the project, or if I should reach for Payload and leverage all of its amazing capabilities.

I was hoping Payload's DAM functionality would Just Work™ as I need it to ... not the case it seems.

BrianJM commented 5 months ago

@zazz-ops I don't think this can be resolved without using WebSockets.

Network request timeouts (waiting for server response headers) are defined by the browser. Chrome is 300 seconds. Firefox is 90 seconds.

With the current implementation, the browser imposes an upper limit.

zazz-ops commented 5 months ago

@BrianJM I'm not entirely sure what you mean by "this can be resolved", but the uploader I described in my previous comment very much works. It's in production on 5 systems I've built and has been for years. And it handles "large" file uploads (typically videos) every day.

You're mentioning WebSockets, so I'm assuming that you mean a socket connection between the UI and the Payload API. But that's not how the uploader I've described works. It bypasses Payload entirely and uploads directly to S3 from the browser, or CLI, or any runtime.

This requires that Payload create a client that leverages the @aws-sdk/lib-storage Upload library. And a client would need to be created for many contexts: browser (JS), CLI (JS, TS, GO, PYTHON, RUST, etc...), runtime (JS, TS, GO, PYTHON, RUST, etc...).

In this case Payload API gives up all functionality/responsibility for uploads. It's the client that co-ordinates the steps. My Payload media collection is just any old data collection. All upload and resizing or transcoding/transformatting functionality is lost and alternatives (like AWS Elemental Media Convert and Blitline) need to be leveraged.

As I mentioned in my last comment, the uploader I sketched is very much a sketch. When you factor in digital asset access/security, transcoding/transformatting, CDN cacheing/security, licence-based-rights access, and more ... it gets pretty hairy pretty fast.

At this point, from Payload's perspective, it's very safe for you to assume you're nowhere near where you need to be with uploads and DAM. You actually have a long way to go to catch up to Wordpress.

I'm going to work on implementing my uploader as a separate entity in my system that simply fires against the Payload API and keep evaluating Payload only because the core functionality is so elegant.

But I can tell you with certainty right now, the current Upload and DAM functionality of Payload is essentially useless for anything other than a brochure site. Maybe even not that.

BrianJM commented 5 months ago

@BrianJM I'm not entirely sure what you mean by "this can be resolved", but the uploader I described in my previous comment very much works. It's in production on 5 systems I've built and has been for years. And it handles "large" file uploads (typically videos) every day.

@zazz-ops Which uploader is this and what type of connection is established?

You're mentioning WebSockets, so I'm assuming that you mean a socket connection between the UI and the Payload API.

That's one way, but I assumed direct uploads from the browser to S3 uses a socket connection as well. I believe this is generally how large uploads work.

But that's not how the uploader I've described works. It bypasses Payload entirely and uploads directly to S3 from the browser, or CLI, or any runtime.

This requires that Payload create a client that leverages the @aws-sdk/lib-storage Upload library. And a client would need to be created for many contexts: browser (JS), CLI (JS, TS, GO, PYTHON, RUST, etc...), runtime (JS, TS, GO, PYTHON, RUST, etc...).

I understand the concepts.

What I do not know is the method the browser uses to upload (with lib-storage). Do you? Is it a socket connection?

Can you can share a repo that implements direct uploads from the browser using the library?

At this point, from Payload's perspective, it's very safe for you to assume you're nowhere near where you need to be with uploads and DAM. You actually have a long way to go to catch up to Wordpress.

WordPress media uploads are also bound to server timeouts. Have you been using the WordPress media library to upload and host 5 GB files?

But I can tell you with certainty right now, the current Upload and DAM functionality of Payload is essentially useless for anything other than a brochure site. Maybe even not that.

That's an interesting opinion. You are welcome contribute.

zazz-ops commented 5 months ago

@BrianJM

@zazz-ops Which uploader is this and what type of connection is established? https://github.com/aws/aws-sdk-js-v3/tree/main/lib/lib-storage

That's one way, but I assumed direct uploads from the browser to S3 uses a socket connection as well. I believe this is generally how large uploads work.

I'm not really aware of any uploaders that do so via a socket connection. The AWS lib-storage Upload class linked above uses an XHR PUT request so far as I can tell:

Screenshot 2024-02-24 at 10 22 05 PM

What I do not know is the method the browser uses to upload (with lib-storage). Do you? Is it a socket connection?

See above.

Can you can share a repo that implements direct uploads from the browser using the library?

I can't share any of the ones I've built as they are proprietary, but the Upload class of the lib-storage package is an example itself. The uploaders I built are all have lib-storage as a hard dependency. It (obviously) handles the actual transfer of bytes to S3, and then everything else I wrote is all the "meta logic" that goes with an upload like filename cleansing, status tracking, transcoding/transformatting, etc.

WordPress media uploads are also bound to server timeouts. Have you been using the WordPress media library to upload and host 5 GB files?

No, the largest files I've had to push up to Wordpress capped-out at about 500Mb, but it's not outside reason to think that WP could handle 5Gb with some tweaking of the PHP and Nginx or Apache config. But generally WP offers DAM functionality that is far superior to where Payload is at right now. The point being: even WP is doing a better job at this than Payload, perhaps it should be an area of special focus to get this sorted out given that DAM is a critical part of a CMS.

BrianJM commented 5 months ago

@zazz-ops I believe the AWS SDK uses a socket connection for multipart uploads.

I agree that it is logical for S3 uploads to bypass Payload / Node. I planned to develop a component or plugin to do this (to save egress fees), but this limitation may result in core logic changes (so I won't need to do that).

The WP Media library is subject to the same limitation you're seeing today in Payload. The maximum upload is bound by the sever configurations and the browser timeout.

Do you have open discussions or issues regarding the areas you feel are lacking in DAM?

denolfe commented 3 months ago

Hey @zazz-ops , it sounds like you have a pretty good idea of the problem you're facing. We'd be open to a PR that would handle these scenarios if you feel up to it. Let me know.

The alternative is to create a custom adapter that has your desired functionality and pass that in. Depending on your needs, this might be a viable option.

janus-reith commented 3 months ago

I second the suggestion of uploading directly to S3 and have built file uploads in a similar way before aswell.

When initiating a file upload, the API would call S3 to create a presigned url for the upload and return that to the client who would then do a mutipart request directly to that. Neither a request to payload nor a DB should need to stay open during that time and rather as @zazz-ops pointed out, a separate call in the end should mark the upload as completed.

Besides handling of large file sizes and avoiding a roundtrip to the API, another main factor would be to keep the Payload part of this stateless for serverless or otherwise distributed environments.

I think this pattern of requesting a file upload and returning an url to upload to should be the default implementation for file uploads and not specific for S3, even for basic storage options where the same host that receives the request for uploads also handles the actual upload, so we have said flexibility baked into the Admin UI and API.

janus-reith commented 3 months ago

If I’m not missing something, just to make sure we‘re on the same page, the client/browser facing part shouldn’t have any requirement for an AWS specific or otherwise lib to handle the upload and usually would not need to know about the implementation details of the storage provider, although potentially there might be other storage backends where that could matter.

DanRibbens commented 3 months ago

Creating a pre signed upload URL could be built on top of our current uploads implementation under a new feature flag or perhaps as an alternate adapter in the plugin. I see this as a valid request. It would be of great help if this was built by the community.

If the upload request could be redirected from payload to the pre signed address then a feature flag wouldn't be necessary, it would just happen automatically from the browser to the s3 URL. I would have to do more digging on this pattern to know the details on what's possible here.

janus-reith commented 3 months ago

It would be of great help if this was built by the community.

Happy to help with this, just need to find the time to get more familiar with the Payload plugin structure but that's on my ToDo anyways :)

If the upload request could be redirected from payload to the pre signed address then a feature flag wouldn't be necessary, it would just happen automatically from the browser to the s3 URL.

Could you elaborate further on that? I'm not sure yet I got under which circumstances a feature flag would be required and when not. But yes, the implementation I'd intend to build would let the API call S3 to create a presigned url and return that to the browser, which would upload the files directly to that. Depending on wether the existing behabviour should also be kept intact, a feature flag might still make sense. However, I guess that will all be more clear once we have a PR ready and a decision could be made at that point.

58bits commented 1 month ago

A big +1 for the ability to use AWS S3 pre-signed upload URLs - bypassing the API and server for the actual upload. We have a fast connection to our server, and a reverse proxy behind Nginx. Large uploads are working for us at the moment (with a serverless Atlas MongoDB). But... if we wanted to deploy to Vercel we would hit their hard upload size limit of about 4MB (IIRC). In either case, a direct client upload via a signed URL would be more efficient (with API/DB calls only responsible for update the record/document).

We'd like to deploy to Vercel, but keep our CDN on AWS - and so AWS S3 pre-signed URLs are pretty high on our list of 'would be VERY nice to have'.

thijssmudde commented 3 weeks ago

I get a similar issue when uploading videos of around 70 mb in the video collection using the cloudStorage icw Azure Blob Storage. Videos less than 15 mb are working fine.

[09:39:21] ERROR (payload): There was an error while uploading files corresponding to the collection videos with filename Roma_2024-(1080p).mp4:
[09:39:21] ERROR (payload): Cannot read properties of undefined (reading 'timeout')
    err: {
      "type": "TypeError",
      "message": "Cannot read properties of undefined (reading 'timeout')",
      "stack":
          TypeError: Cannot read properties of undefined (reading 'timeout')
              at Object.handleUpload (/node_modules/@payloadcms/plugin-cloud-storage/src/adapters/azure/handleUpload.ts:38:36)
              at map (/node_modules/@payloadcms/plugin-cloud-storage/src/hooks/beforeChange.ts:48:25)
              at Array.map (<anonymous>)
              at hook (/node_modules/@payloadcms/plugin-cloud-storage/src/hooks/beforeChange.ts:47:32)
              at /node_modules/payload/src/collections/operations/create.ts:185:16
              at create (/node_modules/payload/src/collections/operations/create.ts:181:5)
              at createHandler (/node_modules/payload/src/collections/requestHandlers/create.ts:26:17)
    }