thumbor-community / aws

Thumbor AWS extensions
MIT License
155 stars 70 forks source link

[Question] Saving images without HMAC in the path? #118

Closed tstrijdhorst closed 6 years ago

tstrijdhorst commented 6 years ago

Currently with the default configuration the HMAC is part of the path which means if I upload an image and request a transformation of the following kind: https://example.com/w1_S7wKnOTg5zeqddi5QWS-83mI=/example-image-upload/be70c78afca44d46916fa455828179ed.jpg

It will be stored in s3 with the following path: example-image-result/w1_S7wKnOTg5zeqddi5QWS-83mI%3D/example-image-upload/be70c78afca44d46916fa455828179ed.jpg

We want to sync our images from production to staging so this leads to a problem for us. Since staging does not have access to the production keys and therefore cannot calculate the same HMAC and therefore cannot generate the right path. In order to overcome this problem we would have to recalculate the HMAC with the staging key and replace this on sync.

I can't imagine our situation is unique. Is there a way to configure this so that it does not use the HMAC for the path? if not, what would be another option?

P.S: This also leads to a very weird directory structure since every image will be it's own path (since the HMAC is unique and the first key in the pathstructure). So your filetree becomes N wide for N files. This makes it harder to work with on fs level.

Bladrak commented 6 years ago

Hello :)

If I understand your issue correctly, the hashing of the storage paths when you set TC_AWS_RANDOMIZE_KEYS to true is an issue to you when working with HMAC URLs and wanting to synchronize between staging and prod images? If that's the case, you can disable the randomization by setting the parameter to false.

Let me know if this does the trick.

Regarding the choice of directory structure, this is based on AWS S3 best practices for performant access to the objects.

tstrijdhorst commented 6 years ago

I tried your proposed solution but it doesn't provide the functionality that we require. In our settings the TC_AWS_RANDOMIZE_KEYS was already set to false. I tried enabling it because maybe I didn't understand you but then it will add another random key to the path, which is exactly not what I want.

Let me restate our problem: Given the following url: https://example.com/w1_S7wKnOTg5zeqddi5QWS-83mI=/example-image-upload/be70c78afca44d46916fa455828179ed.jpg

There will be a file in our production result bucket: example-image-result-production/w1_S7wKnOTg5zeqddi5QWS-83mI%3D/example-image-upload/be70c78afca44d46916fa455828179ed.jpg

The problem is the HMAC part w1_S7wKnOTg5zeqddi5QWS-83mI%3D. This is generated by taking the URL and calculating the sha1 HMAC with a secret key. This key is only available in the production environment.

We want exactly the same images in our staging environment however since the key is different, we cannot use the same HMACs. So if we would just keep the buckets in sync then staging would not be able to read them because the generated URL will have a different HMAC because the key is different and since the HMAC is in the filepath it will not be able to find it.

What I would like is that we can save the result image without the HMAC in the path. Something like: example-image-result-production/example-image-upload/be70c78afca44d46916fa455828179ed.jpg

If that is the case we can just simply sync the production s3 to our staging s3 and it would result in something like: example-image-result-staging/example-image-upload/be70c78afca44d46916fa455828179ed.jpg

Bladrak commented 6 years ago

Ok, I see what you mean now, thanks for explaining :) We are storing the image on the result storage based on the requested URL. This is done because we want to adopt the same behavior as thumbor's other storages, which store also based on the same URL (see: https://github.com/thumbor/thumbor/blob/master/thumbor/result_storages/file_storage.py#L34). I'm not sure that we can have the image without the HMAC by default in the request which is in the context, but even if it were possible, I'm not sure we should vary the logic from the one in Thumbor loaders at the moment. So changing the logic might be an issue to see with thumbor's team as well.

What is precisely the use case that requires you to synchronize the cache (ie result storage) between your staging and production environments? Maybe there is another option there?

Bladrak commented 6 years ago

Closing for inactivity, feel free to re-open with more details.