silverstripe / silverstripe-s3

Silverstripe module to store assets in S3 rather than on the local filesystem (SS4/SS5 only)
BSD 3-Clause "New" or "Revised" License
20 stars 25 forks source link

Uploading many files through to S3 - some fail, most don't #51

Closed samandeggs closed 1 year ago

samandeggs commented 2 years ago

Hi, happy to provide more information, please ask how I can be more helpful. Essentially, either in a has_many images relationship on a Dataobject, or straight in the AssetAdmin when uploading quite a lot of images through to S3, most successfully upload. However, some do not.

But - they do actually get uploaded to S3, and the DB file system appears to be mostly (if not entirely to my naked eye) correct.

The Image exists with the correct details on its associated File and the pathname leads to an image that does indeed exist on S3.

Checking if $this->Image->exists() causes a range of warnings such as:

[Warning] ftell() expects parameter 1 to be resource, bool given [Warning] stream_get_meta_data() expects parameter 1 to be resource, bool given [Notice] Trying to access array offset on value of type null [Warning] hash_update_stream() expects parameter 2 to be resource, bool given

(all seemingly linked to Flysystem, but something fishy is happening between SS and S3 imo)

On those images that don't "exist". There's likely some disconnection between the Image/File and the DBFile happening for the exists() method to fall over itself so badly. However, it is failing to find the image on S3, even though it does exist there - presumably some portion of the hashing isn't lining up else it should see it there.

Most images are working fine, published/draft etc - however for my client, they're uploading an awful lot of images, and to be quite honest it's okay if a couple fail - but the warnings are bad, and then generally the failed uploads aren't great. This is the same on our live server hosted in AWS, and my testing locally.

If I can't check exists - I can't inform them generally that the images uploaded on a dataobject have failed, but if I utilize exists, then we get awful warnings and their experience is deteriorated. And because they fail the exists() method, they're also appearing as "File cannot be found" on the dataobject itself.

Using SS 4.10 and latest on this module.

samandeggs commented 2 years ago

Okay - appears to be related to #1026 on silverstripe/silverstripe-asset-admin since if I clear cache (actual deleting of the contents of silverstripe-cache not just ?flush), it finds the files correctly. However with cache_bust being set false by default, one would assume that this issue should be stopped.

samandeggs commented 2 years ago

The more I've looked into this, the more it appears to be linked with a variety of issues I see on this repo and assets-admin regarding how the FlySystem adapater checks if the file exists and some form of hashing between the files in DB and what's on S3 since it clears up each time the silverstripe-cache folder is removed and rebuilt through a ?flush.

My first instinct is that if a file through the current means is coming through as non-existant, that it needs to hard check S3 again and rewrite that portion of the cache(?). All files appear to upload successfully and return the right information to the CMS front end to confirm this - it's simply not until a refetch occurs that it fails on the exists() on some of the files (specifically images). I'm not so concerned if the actual assets-admin fails to show these files, but when my client is uploading enmasse to dataobjects, and some are failing - it's limiting their work flow - however I presume these concepts are interlinked.

phptek commented 2 years ago

@samandeggs We've recently launched a site on AWS with very similar sounding issues. We also see many instances of the error: Trying to access array offset on value of type null.

We're using versions:

silverstripe/s3 1.1.1 silverstripe/framework 4.9.1

Interestingly, the issue doesn't appear to affect one of the environments (UAT) but does in PROD - both are entirely separate AWS accounts btw but both managed in Terraform.

Symptoms include:

Screen Shot 2022-05-03 at 08 15 26
samandeggs commented 2 years ago

@phptek We're also using 1.1.1 and we're also using Terraform to get prod and staging out. Interestingly we're also using focuspoint. Likely a very similar setup. However on my local setup, which is plugging into the same AWS account is a simple docker-compose envrionment, and is outside of the terraform world is having the exact same issues so I can replicate the problem 1:1 locally.

I'm not having issues consistantly with preview, but some are failing. Most preview images are loading fine either in draft or published. The bigger issue on our side is if we upload say 20 images, 4 will straight up say "File cannot be found" (failing the exists() check) and 2 will have their previews missing but will at least have passed the exists() so we can deal with the image.

All images in question are in the database with seemingly the correct information, and all images are on S3 and exist. Clearing the silverstripe-cache corrects the issue until the next time people upload images. My suspicion is entirely aimed at some form of the hashing / caching check, and there's some form of race going on that when it trips up on an image, that failure is stored in the cache and cannot be rectified until it's cleared. Presumably there's multiple parts affecting both the preview version and the actual image itself, hence the two situations.

samandeggs commented 2 years ago

As an example on our end in assets admin - seemingly not quite the same issue on the preview image - more to do with the thumbnail:

image

phptek commented 2 years ago

The first thing that occurred to me when I first spotted this is that both league dependencies have had several updates made to them since the last release of this module:

league/flysystem-aws-s3-v3: installed: 1.0.29, available: 3.0.13 league/flysystem-cached-adapter: installed: 1.0.2, available: 1.1.10

If we forked the module, and gradually upgraded the deps (league/flysystem-aws-s3-v3 to say ~2.0 and league/flysystem-cached-adapter to 1.1.10 - that might sort it. I remember a couple of weeks ago tracing the trying to access array offset on value of type null error back to one of those deps and finding the relevant line of code had a new guard added to it in a subsequent release.

samandeggs commented 2 years ago

This was the same thought I had last night - my senior dev recommended against me attempting the fork and instead trying to get someone from SilverStripe or from this package previously to investigate the issue but I do agree that this is likely one of the causes of issue. Since there has been so much progress on the base modules that this package surely is going to start leaking issues. His concern was how much of this package is dependant on older functions to perform as expected and it could just be a rabbit hole. However this package isn't so large, it could end up being a fairly trivial process.

I also did some loose browsing of the package releases, and there appeared to be a fair amount of work handling caching specifically - if you don't fork this today, I will attempt it later today and ping you how it goes. If you do attempt this before me, let me know and I can test your attempts on my end as well and report back.

phptek commented 2 years ago

Your developer is of course correct. But in order to test any assumption and have a crack at fixing it, one first needs to fork --> modify --> test. The patched package can then be issued as a PR for those changes to be propagated into the main branch. Fortunately, the maintainer works in my studio :-)

madmatt commented 2 years ago

Hey @samandeggs, when I first wrote this module one of the issues I found was around the preview images not loading. The problem there is that Silverstripe CMS is quite heavily based on the assumption that when an upload process is finished, the file will be immediately available. In any distributed system like S3, this isn't the case. I don't know if this has been resolved over the years or not, but I suspect not given this is quite a similar issue.

Compounding that is that for performance reasons this module includes cached adapters as you've found, this was originally due to S3 being slow to respond to many API requests (for example Silverstripe has to load previews for all 30 images when you open a single folder for example) - so my guess is that any issue where we can't immediately retrieve a file then gets cached (hence why it works after you manually clear the silverstripe-cache folder). It looks like we create s3metadata files to store the cached data, and we seem to be missing a flush method that clears those out, so that might be why it doesn't resolve unless you fully delete the directory itself.

I'm more than happy to review any PRs, or help work toward a resolution for these issues however I can within my limited time - but this is what I have seen in the past. Hopefully having some historic context is useful!

samandeggs commented 2 years ago

Thanks @madmatt - that context helps a lot.

I think a good course of action would be to see if there is the possibility that the upgrades to the dependencies can mostly just plug in however unrelated to SilverStripe they are directly. Separately if I can figure out where the flush method could live within the context of the S3 module, that would be seemingly the obvious next step - as I think it's the real crux of the issue here, and you've hit the nail on the head in that regard.

samandeggs commented 2 years ago

@phptek @madmatt hitting my limits in knowing how to do much here, what I did find though is we cannot update league/flysystem-aws-s3-v3 above 2.4.3 because it requires php 8. 2.4.3 however requires league/flysystem to be ^2.0.0 which is locked on silverstripe/assets - to which I suspect will be a significant rabbit hole for myself, and unlikely to pass any sort of QC. Doing a basic update of league/flysystem-cached-adapter up to 1.1.0 breaks the basic s3 functionality for some reason that I don't really even know where to begin looking into other than just comparing the files which is likely the place to start if that's the continued direction. That is a more likely upgrade that could occur, I just don't know my way around this repo enough right now. Could be simple, could be a big problem.

If I'm the one who needs to do more work here, then - I think working towards writing in some form of flush is likely my best option - as it could be more directly useful to my immediate needs and the needs of what's going on here, I just need to know a bit more accurately where I should be aiming my shot. Let me know if you have any details that could help me here.

Edit: On second review I was having an internal issue that made me think I couldn't update league/flysystem-cached-adapter up to 1.1.0 - and it appears to be bringing that in fine, I suspect most people were already using this with the ^1.0 directive as it stands.

samandeggs commented 2 years ago

as a cheap patch, if it holds, I've reduced the expiry on the protected and public s3metadata to 1 second, since we're running an ECS <> S3, it's close enough to the metal that it's appearing to be running fine without a cache essentially. The main system is cached separately so the wider public wont be hitting the server anyway. Did this via our own s3.yml:

---
Name: project-s3-config
After:
  - "#silverstripes3-flysystem"
---
SilverStripe\Core\Injector\Injector:
  League\Flysystem\Cached\Storage\Adapter.public:
    class: League\Flysystem\Cached\Storage\Adapter
    constructor:
      adapter: '%$League\Flysystem\Adapter\Local'
      file: "s3metadata/public"
      expire: 1
  League\Flysystem\Cached\Storage\Adapter.protected:
    class: League\Flysystem\Cached\Storage\Adapter
    constructor:
      adapter: '%$League\Flysystem\Adapter\Local'
      file: "s3metadata/protected"
      expire: 1

Not a good solution, but it appears to be doing the job for now and I'm not in a position to try and figure out upgrades or issues across seemingly three packages (s3, assets, asset-admin).

obj63mc commented 2 years ago

@samandeggs - SS is using flysystem v1 so to upgrade to a different version would be a core change to silverstripe/assets which would then cascade down to these various adapters.

The issue that I am guessing you are having is more of an io/write issue. Basically when the file metadata is read from aws - it is stored in one giant file s3metadata/public and s3metadata/protected. When uploading many files at once it will have to open and read/write to that same file over and over and with the asynchronous nature of uploading the files to s3 I am guessing that there is some random race conditions or write locks with trying to keep the file up to date and accurate.

When you set the expires to 1 second, you are basically just stating to not use the cache. You can set your site to not use the cache by setting your YML to the following -

SilverStripe\Core\Injector\Injector:
  SilverStripe\Assets\Flysystem\PublicAdapter: '%$SilverStripe\S3\Adapter\PublicAdapter'
  SilverStripe\Assets\Flysystem\ProtectedAdapter: '%$SilverStripe\S3\Adapter\ProtectedAdapter'

Note haven't tested the above but basically you are just setting Silverstripes main adapters to be the non cached ones.

Additionally the fact that flysystem's caching stores everything in one file is a known issue https://github.com/thephpleague/flysystem-cached-adapter/issues/23

Other projects like Drupal or the v2 cache adapter https://github.com/Lustmored/flysystem-v2-simple-cache-adapter have moved to using different cache keys per file, and then not worrying about caching directory listings. For sites like Silverstripe, this really then only impacts Asset Admin but only when not on AWS infrastructure as the api calls will usually be fast enough locally.

I have added a commit to my fork for dev at https://github.com/obj63mc/silverstripe-s3/tree/custom-cache which implements the caching strategy the other projects have used and just use the default Silverstripe caching. https://docs.silverstripe.org/en/4/developer_guides/performance/caching/

I have tested this on one of our really large sites with thousands of images and am not seeing any noticeable differences performance wise but would love to see what others think. This change then gets rid of one of the dependencies that if Silverstripe switches to flysystem v2+ in the future we couldn't use the default cached adapters anyway.

madmatt commented 1 year ago

Not using a cached adapter at all is horrific for performance - CMS requests can time out trying to load just 5-10 images in one folder. This is because Flysystem has no caching unless you use an adapter that provides caching, and very often chains many calls together (e.g. every call to get a file first checks if the file exists, then checks if the filesystems 'has' the file, then checks if the file is readable - all of these result in 1 or more separate API calls to the S3 API).

What I ended up doing is creating a separate temporary branch here: pulls/tmp/use-in-memory-cache. It's part of this repo, and what it does is split the difference - there is no long-lived cache that can be poisoned in weird ways, instead each CMS request has it's own in-memory cache that does not persist beyond the request itself.

Unfortunately, I think there's a bug with the Config system in Silverstripe - I wasn't able to do this just using YML config (to override the cached adapter), hence I needed to modify this repository directly. The bonus is that it's easy for others to switch to if they need to - just set your composer require line for this module to be:

{
    "require": {
        "silverstripe/s3": "dev-pulls/tmp/use-in-memory-cache as 1.1.1"
    }
}

I've made a PR but would like another maintainer to review as I'm not entirely sure of everything here, and it seems like many people don't have this issue. For what it's worth, we found it was exponentially worse the more servers you ran (because Silverstripe fans out requests to multiple servers, immediately after an upload you might 'poison' multiple servers with bad cache data).

Hopefully this helps someone else! Please comment here if you try this solution and it fixes the problem for you!

samandeggs commented 1 year ago

Thank you @madmatt - really appreciate your work on this.