Open stevejalim opened 4 years ago
I've done some work to do direct S3-to-S3 copying using a custom field adapter, which - while not yet in production - seems to be pretty reliable within some known constraints (eg only works for data with a public-read policy). If anyone's interested, the code is open source and I can point you at the relevant bits of the implementation.
If there's appetite for making this part of WT, @jacobtoppm, I'd be happy to do that when I have time.
@stevejalim I'd be interested in taking a look if it still works and is up somewhere!
Hi @easherma-truth I've moved on from the org where I was using it but looks like the code is still there: https://github.com/uktrade/great-cms/blob/develop/core/wagtail_hooks.py#L135
(I've discussed this loosely with Matt and Jacob in Slack, but writing it up here)
When a site is hosted on a platform which has a hard, non-configurable threshold for how long HTTP request can take (eg 30 seconds) a transfer that involves a sizeable video, or a number of other media files, can easily exceed this threshold. This kills the transfer, leaving pages rolled back, but third-party models (eg
wagtailmedia
) can be in an indeterminate state in terms of files on disks, somewhere.The timeout happens because the overall WT import takes place over a single HTTP request, and transferring an asset file as part of the request-response cycle involves the time take to copy the file.
This problem is exacerbated when media files are stored in cloud storage, which is common for many PaaS setups.
eg:
Destination Server -> asks Source Server -> asks Source's Storage for file -> Source's Storage returns file to Source Server -> Source Server sends file to Destination Server -> Destination Server stores file in Destination's Storage.
So that's the same file being processed (read or written) 3 or 4 times, depending on upload spooling.
Possible solutions
Temporary workaround: Use field-based identification of the problematic model types using
WAGTAILTRANSFER_LOOKUP_FIELDS
and manually pre-copy the data to the Destination server, so that the Source Server does not have to send it. This works forwagtailmedia.media
and its MTI subclassesA solution: Move the Transfer process to multiple AJAX calls (as suggested by Matt), so that we reduce the risk of a timeout. However for large files we may still miss this
Alternative solution: Detect if files are in cloud storage and support direct cloud-to-cloud syncing of those files, if possible. (However this could also get complex, especially if a cloud-based function is needed to do the copy)
CopyObject
[docs] which is promisingMore to come - and more welcome
(Separate from all the above, it would be nice to have a pre-flight check before a transfer to warn about large files that will be sent over)