thumbor-community / aws

Thumbor AWS extensions
MIT License
155 stars 70 forks source link

S3 Results storage fails when using gifv #86

Closed ghost closed 7 years ago

ghost commented 7 years ago

Not entirely sure if this is Thumbor core or AWS plugins domain but Thumbor fails to load GIFv results from S3. The first request works perfectly and regular images like jpegs work as expected.

An example request: curl -I "http://localhost:8000/unsafe/filters:gifv(mp4)/https://somedomain.com/big.gif"

First response:

2017-01-05 06:19:15 tornado.access:INFO 200 HEAD /unsafe/filters:gifv(mp4)//https://somedomain.com/big.gif (172.17.0.1) 10220.31ms

All subsequent responses:

2017-01-05 06:28:47 tornado.application:ERROR Future exception was never retrieved: Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/tornado/gen.py", line 1024, in run
    yielded = self.gen.send(value)
  File "/usr/local/lib/python2.7/site-packages/thumbor/handlers/__init__.py", line 117, in execute_image_operations
    self.context.request.engine.load(buffer, EXTENSION.get(mime, '.jpg'))
  File "/usr/local/lib/python2.7/site-packages/thumbor/engines/pil.py", line 312, in load
    super(Engine, self).load(buffer, self.extension)
  File "/usr/local/lib/python2.7/site-packages/thumbor/engines/__init__.py", line 162, in load
    image_or_frames = self.create_image(buffer)
  File "/usr/local/lib/python2.7/site-packages/thumbor/engines/pil.py", line 70, in create_image
    img = Image.open(BytesIO(buffer))
  File "/usr/local/lib/python2.7/site-packages/PIL/Image.py", line 2319, in open
    % (filename if filename else fp))
IOError: cannot identify image file <_io.BytesIO object at 0x7fd77b692710>

Example config here:

DETECTORS=['thumbor.detectors.face_detector','thumbor.detectors.profile_detector','thumbor.detectors.feature_detector']
FILTERS=['thumbor.filters.format','thumbor.filters.extract_focal','thumbor.filters.no_upscale']
HTTP_LOADER_REQUEST_TIMEOUT=120
MAX_AGE=31557600
OPTIMIZERS=['thumbor.optimizers.gifv']
STORAGE=thumbor.storages.no_storage
THUMBOR_WORKER_COUNT=1
USE_GIFSICLE_ENGINE=True
WEBP_QUALITY=90
LOG_PARAMETER=-l info
AWS_ACCESS_KEY_ID=AN_ID
AWS_SECRET_ACCESS_KEY=A_KEY
TC_AWS_REGION=A_REGION
TC_AWS_ENDPOINT=AN_ENDPOINT
TC_AWS_RESULT_STORAGE_BUCKET=A_BUCKET
TC_AWS_RESULT_STORAGE_ROOT_PATH=A_DIRECTORY
RESULT_STORAGE=tc_aws.result_storages.s3_storage
RESULT_STORAGE_STORES_UNSAFE=True

I'm using the APSL docker image which you can easily run with: docker run -p 8000:8000 --env-file .env "apsl/thumbor"

^ You just need to create a .env file, adding valid credentials for an S3 bucket. Unsafe is optional.

Note that the transformed result (an mp4) is in fact sitting in S3, it did store correctly. It just fails to recognize the format upon retrieval.

Bladrak commented 7 years ago

Given the stack trace, I'd say this is more of a thumbor core issue, given you're only using the result storage and if I recall correctly, the processing of the image is performed before storing it into the result storage. Might be worth creating an issue on thumbor's repo.

ghost commented 7 years ago

I'll submit a ticket with the core project then, and close this. Thank you.

P.S. Just poking around the error / thumbor code and I am somewhat confused as to why they would ever try to determine the "engine" for the target format if the result can be fetched from storage. I would think you'd just want to..

if response=get_from_resultstore(url)
  return response
else
  get_engine(url)
  ....
end

Like, right at the very beginning of the request to avoid unnecessary processing overhead. But then I find myself wondering what is the point of "results storage" in comparison to say a local varnish/nginx? The advantage for me in this case is that nginx and varnish depend on finite resources (disk or memory) ... both of which are in short supply on a container setup. Sure I could add shared disk or whatever but on elastic beanstalk but then you have to switch to their more complex multi image docker thingy. So the appeal of S3 results storage was pretty high.

I only bring that up here because you have obviously worked with result storages before. Curious if you have thoughts. Thanks again.

Bladrak commented 7 years ago

If I recall correctly, thumbor will firstly check if the image is in the result storage, before fetching it from the storage (if it is enabled and in it), and eventually load it from the source. Then it stores the source in the storage, processes it, store the result in the result storage and return the image (those processes are parallelized).

Regarding the usage of the result storage, s3 is indeed a good solution as you can store an unlimited amount of data. If you want to achieve performance, you will indeed need a varnish / nginx / CDN in front of thumbor, so the data is returned faster (and it avoids using resources on your thumbor instance). You should also handle If-Modified-Since header, which is fully compatible with aws result storage.

ghost commented 7 years ago

Fixed in thumbor version 6.2.1 for any googlers.