Closed brendanheywood closed 8 years ago
I think this is a priority. I want to move everybody onto virtual servers on the Internet backbone rather my LAN, and I want to have the image issues resolved before I do this.
Couple more task items:
I've very conscious that sorting our image serving is by far the # 1 performance bottle neck and eclipses anything else we could do anywhere else in the stack. That said we have a very large amount of code not yet shipped which I think we should release first. When you say 'everyone' are you just talking about us dev's or is there anything production which has a dependency on your lan?
When I say priority, I mean after the current release. So I align with your opinion.
No production relies on dev. Timing dependency is when Nicky needs his own dev then I will build him a Internet backbone dev rather then my Lan dev. The image stuff will have to be in place by then so dev can work off an independent image storage. Maybe we can use static with a prod image store and a dev image storage so we only need one large disk for all our devs and prod.
Thinking out aloud of an easy implementation strategy
We should now be accessing some performance benefits. But work could continue as follows:
Something to consider is swapping to a content addressable way of storing files. This is how git, dropbox, moodle and almost every other big store / cdn works
https://en.wikipedia.org/wiki/Content-addressable_storage
ie this is what the moodles internal file store looks like:
brendan@silverbirch 09:21:40 /var/lib/sitedata/vanilla-moodle/filedir $ find |less
.
./d0
./d0/2d
./d0/2d/d02dc787583d8b59bdd6fb045c344708ccefaba4
./d0/be
./d0/be/d0be22d2cb1240b0142d5692b992a28c2dec1a04
./d0/3d
./d0/3d/d03d73a6237f9e8082500d9ebe59de01e9a76b4c
./7d
./7d/3a
./7d/3a/7d3ac1f20e669d5dc1c60704dbcbbb6a65a013e8
./44
./44/56
./44/56/4456a4614ac765b4135d339e055f97b2f7230b2f
and the db model (just a small subset):
id | contenthash | filepath | filename | mimetype
----+------------------------------------------+----------+----------------+------------
1 | 41cfeee5884a43a4650a851f4f85e7b28316fcc9 | / | background.jpg | image/jpeg
3 | fb262df98d67c4e2a5c9802403821e00cf2992af | / | smile.png | image/png
5 | a4f146f120e7e00d21291b924e26aaabe9f4297a | / | sad.png | image/png
6 | 33957e31ba9c763a74638b825f0a9154acf475e1 | / | tick.png | image/png
7 | d613d55f37bb76d38d4ffb4b7b83e6c694778c30 | / | cross.png | image/png
Every file is stored as the sha1 hash of it's contents. This has many benefits like automattic de-duplicating at the store level. And it makes it much easier to shard the file system across multiple servers or mounts later, or re-shard as the content grows without touching the file metadata (eg we don't need the 'volume' concept any more)
FYI, I have installed thumbor on static
out of the box install without any config. Note I had to install about three undocumented obscure dependencies.
apt-get install python-pip python-dev build-essential pip install --upgrade pip # note that this removed pip from path so I needed to do sh to continue pip install --upgrade virtualenv apt-get install libcurl4-openssl-dev pip install --upgrade setuptools pip install thumbor thumbor&
Notes:
Ok some rough general todos:
http://static.thecrag.com/t/300x200/photo/00003/original/724238361
- [x] test it still all works via https
https://static.thecrag.com/t/300x200/photo/00003/original/724238361
- [ ] get rid of safe mode and get client running on dev to sign the urls https://github.com/thumbor/thumbor/wiki/Configuration#security_key
https://static.thecrag.com/t//cryptohash300x200/photo/00003/original/724238361
- [ ] get AUTO_WEBP = true serving turned on
https://static.thecrag.com/t/blahblahblah/300x200/photo/00003/modified/724238361.jpg
- [ ] get cahe and expire headers configured right, at the moment isn't far futures and has caching off, assume this is due to the unsafe mode stuff so may just start working right after that's fixed
- [ ] I don't mind face detection on for avatars, I'm on the fence for the general climbing pics (probably off), but they should definitely be off for the topos. May be better to just turn them all off for the first rollout
- [ ] check MAX_WIDTH and MAX_HEIGHT, we will be sending some big images some times (find a big image to test)
- [x] given that rotate is just another filter, and it's desirable to give access to the templates to use other filters (blur, colorize, rounded corners, there are a bunch I had immediate ways to use in different places), it seems that the addition of the rotate() has to be done in the templates rather than the backend sending a preformed url
and graft the route onto static (via /t/ ?)
I am a little confused at how we do this. In apache land and redirect to the thumbor port, or use something like Ngnix as front end, or just use the python Tornado webserver for all static web traffic. Or have I missed something alltogether?
it seems that the addition of the rotate() has to be done in the templates rather than the backend sending a preformed url
I am a little confused with what you are saying.
We needed to rotate because EXIF data did not always align with what looked like the natural orientation. In this case the user did the re-orientation, which was stored as an EXIF override meta data setting.
IMO we should not adjust the original for any reason, this includes correcting bad EXIF data. Instead we should store additional rotate meta data in the database model and pass this to the templates. Is this what you meant by 'done in the templates'?
There is design choice to make here. Given that we have to store rotate meta data in the db model, do we disable EXIF on the server (RESPECT_ORIENTATION) and store all EXIF in the db model, or do we enable EXIF on the server and just store the EXIF delta? (I vote for the former.)
I am not sure if this will be useful, but what if we had a subdomain image.thecrag.com (which pointed to static) and then redirected everything on image.thecrag.com to static.thecrag.com:8888
Here is an example of apache config:
Comments below
and graft the route onto static (via /t/ ?) I am a little confused at how we do this. In apache land and redirect to the thumbor port, or use something like Ngnix as front end, or just use the python Tornado webserver for all static web traffic. Or have I missed something alltogether?
It's trivial, google apache mod_proxy. We never want redirect only internal forwarding
it seems that the addition of the rotate() has to be done in the templates rather than the backend sending a preformed url
I am a little confused with what you are saying.
We needed to rotate because EXIF data did not always align with what looked like the natural orientation. In this case the user did the re-orientation, which was stored as an EXIF override meta data setting.
IMO we should not adjust the original for any reason, this includes correcting bad EXIF data. Instead we should store additional rotate meta data in the database model and pass this to the templates. Is this what you meant by 'done in the templates'?
Yup, pass the data to the templates so it can rotate and optionally apply other filters as well
There is design choice to make here. Given that we have to store rotate meta data in the db model, do we disable EXIF on the server (RESPECT_ORIENTATION) and store all EXIF in the db model, or do we enable EXIF on the server and just store the EXIF delta? (I vote for the former.)
Don't really have an opinion. My only requirement is that the exit data in the file is honoured as the default which we can then swivel
No redirects they are a performance killer.
We should be going the opposite direction, turn on spdy and http2 and then turn off static completely.
Sent from my iPhone
On 04/10/2015, at 2:56 PM, Simon Dale notifications@github.com wrote:
I am not sure if this will be useful, but what if we had a subdomain image.thecrag.com (which pointed to static) and then redirected everything on image.thecrag.com to static.thecrag.com:8888
Here is an example of apache config:
— Reply to this email directly or view it on GitHub https://github.com/theCrag/website/issues/2029#issuecomment-145315678.
From Wikipedia: "As of February 2015, Google has announced that following the recent final ratification of the HTTP/2 standard, support for SPDY would be deprecated, and that support for SPDY would be withdrawn completely in 2016"
Are you suggesting that http2 will remove one of the requirements that got us to use static. There are other reasons to have a static server, but these will probably go over time as well.
Yup SPDY was the basis for the start of the http2 spec which replaces it. Between them both support is great at around 70% (using our GA stats with caniuse). Every evergreen browser supports it.
http://caniuse.com/#feat=http2
And yes the primary reason why you want a static server is cookieless domains, and domain sharding. Both of these are solved by the new specs and once we turn them on became an anti feature. The other things you get from a full cdn are early ssl termination, edge serving, etc which would be nice too but aren't a big deal for us until we grow a bunch more
And I am completely agnostic about to underlying server architecture, the only thing I care about here is the client facing services. We could still have thumber being served by static under the hood and reverse proxied from www.thecrag.com, but honestly given it's current bandwidth constraints I think thats not ideal.
Whenever we want, don't we just turn on http2 on in apache in prod. I presume our version of apache supports this mod. Let's just get thumbor image serving working first, I don't want to many distractions.
@brendanheywood do we want a separate repository for dev and prod images, or have one for both. We need to solve images issue in dev as well as prod. If we just have one image repository for both dev and prod then this will save a lot of system resources.
Also in reference to your response for EXIF rotation.
Don't really have an opinion. My only requirement is that the exit data in the file is honoured as the default which we can then swivel
If we do it this way then we must store the rotation change between EXIF and what the user selected. Currently we store the absolute as an EXIF override. We will have to test this thoroughly to make sure thumbor behaves as I would expect it too.
I think we could easily implement a custom loader in dev so it looks for images locally, an then if it doesn't find it then it goes over the network or whatever and grabs the image from prod. So we'd have a dev and prod thumbor so we can play with dev config safely but store almost nothing.
@brendanheywood I have come across a blocking issue with something that I thought should be a trivial piece of work. I am going crazy trying to work out why I cannot set https for thumbor. There is nothing in the documentation. Am I missing something really obvious????
The way around this is to just do this in the proxy https -> http. This works, but I am guessing could be a security hole.
apache config: /etc/apache2/sites-enabled/003-cids-static thumbor config: /etc/thumbor.conf
I think we could easily implement a custom loader in dev so it looks for images locally, an then if it doesn't find it then it goes over the network or whatever and grabs the image from prod. So we'd have a dev and prod thumbor so we can play with dev config safely but store almost nothing. Yes but for the sake of a couple of spurious images do we need to write and maintain a custom image loader?
We are going to be loading through the filesystem, so I was thinking we could have something like /opt/image-store/thecrag /opt/image-store/thecrag/dev One way of doing this is a custom loader would try and load /opt/image-store/thecrag/dev first then if not there load /opt/image-store/thecrag.
Another way is to have our own local thumbors which tried local filesystem, if failed went to prod thumbor.
Alternatively we could generate a hash file key of something like d02dc787583d8b59bdd6fb045c344708ccefaba4-dev for images uploaded from dev whic is stored in a file ./d0/2d/d02dc787583d8b59bdd6fb045c344708ccefaba4-dev. This way dev and prod would look exactly the same, which will keep the code base small. I don' have to write and maintain a custom loader. We can still go through the image store and remove dev images if we wanted to. I am leaning to this method because it has the smallest code base. Downside is that dev need access to Internet to get images.
Re ssl, I would have done ssl termination in apache and proxy to thumbor exactly as you have done. This is standard, no security issues
Re custom loader, not too fussed but the "tried local filesystem, if failed went to prod thumbor" method I think is the simplest. The custom loader will be 20 lines of python so I don't see that as a big deal, just munge the file system and http loaders together
https://github.com/thumbor/thumbor/blob/master/thumbor/loaders/file_loader.py
How important is it to sign the urls. As far as I can see there are a couple of reasons why we would want to do this.
https://github.com/thumbor/thumbor/wiki/Security
I ask because it potentially has implications on the client side. The whole url plus arguments must be encoded in the hash. This means the template must be doing this on the fly or javascript must be doing this on the fly where images are being brought in by javascript. Actually the javascript is a problem because it will expose the security key, so we need to pass in the whole encoded url from the template, which means we can only use pre-defined arguments for images in javascript, or write a service to encode arguments.
Is it worth the effort for the initial port to thumbor?
1) is the main driver. If its not an issue then probably doesn't matter
However long term I think we should sign so we could go live without and migrate all urls to images to server side stuff and then turn on later
I think 1) is potentially an issue, but there are a lot of other use cases on the server with this issue. So I think we plan to release without signing. If I get the urge I will put on signing, otherwise we can do it for a future release.
Given that I have taken the word 'unsafe' out of the static proxy url. We do not want the word unsafe being seen by our customers. So access to images will now look something like:
FYI, I have decided to bite the bullet and learn how to branch with git. This work is on the thumbor-int branch. The config file changes have been pushed to repo.
Also, just thinking out aloud, if somebody is embedding a theCrag photo in a discussion then I presume we just call the service with no arguments and let thumbor worry about the size???
https://static.thecrag.com/img/http://www.thecrag.com/image/photo/00003/modified/724238361.jpg
Found this:
http://blogs.bethel.edu/web-services/2014/04/the-responsive-image-challenge-part-2/
Have you ever used imager.js or something like http://placehold.it/ ?
In this comment I am thinking out aloud, but feel free to put in your views.
The db model currently has a few tables which hold meta data about images. Actually the meta data currently the width and height of the various dimensions (NULL if not available). The image id comes from the photo record ID (or topo record id, etc) and is what is used in the filesystem.
Note that there use to be a 'Resource' table which maps fairly well to the concept of an asset.
I think we now want a single original image table that is referenced by each of the Photo, Topo, etc tables. This Image table stores the
I don't want to go back to the Resource table because it does not store the width and height.
We need to know the width and height of the original so that we do not request images from thumbor which are bigger than the original and/or force cropping unless we want it cropped.
We then need some helper functions which take the original height and width and create standard sizes. For example if we want images which are always 100 pixels wide but not cropped then we have to work out the target height for thumbor without cropping. The helper functions should possible just replicate the standard dimensions initially. This means we can cutover functionality.
I updated Rotation. We need to know any final user rotation required after upload.
Also, just thinking out aloud, if somebody is embedding a theCrag photo in a discussion then I presume we just call the service with no arguments and let thumbor worry about the size???
https://static.thecrag.com/img/http://www.thecrag.com/image/photo/00003/modified/724238361.jpg
We should parse it and rewrite it back to the size we want, the size we want should be determined by the markdown output and not by the author, ie in some context would want small, in others big etc.
Also can you rewrite out the second www part so instead of this:
it's just this:
https://static.thecrag.com/img/300x200/image/photo/00003/modified/724238361.jpg
or possibly even this: (.jpg may be wrong if retyped to webp on the fly)
https://static.thecrag.com/img/300x200/photo/00003/original/724238361
(we should only ever be dealing with scaled versions of 'original' and never 'modified' etc, the small versions will all be removed??)
"The db model currently has a few tables which hold meta data about images."
Yeah the internals is all up to you. Don't know it well enough to comment. I didn't think it needed to change that much really, we could still have the 'sizes' array but it would only ever contain an 'original' size and all the others would get removed. The only extra data model field would be the rotation.
We then need some helper functions which take the original height and width and create standard sizes. For example if we want images which are always 100 pixels wide but not cropped then we have to work out the target height for thumbor without cropping. The helper functions should possible just replicate the standard dimensions initially. This means we can cutover functionality.
I think you get that all out of the box with thumbor, eg:
vs
https://static.thecrag.com/img/300x/http://www.thecrag.com/image/photo/00003/modified/724238361.jpg
Agreed.
Thanks for the heads up with the non-cropping resize for thumbor. This will definitely make the helper function somewhat lighter.
Found this: http://blogs.bethel.edu/web-services/2014/04/the-responsive-image-challenge-part-2/
I missed this comment before: there are heaps of different script that do stuff like this, I think this is more or less a duplicate of #1761 (lazy defer load)
Have you ever used imager.js or something like http://placehold.it/ ?
Re imager.js yes, also native srcset is coming along nicely too (at 42% http://caniuse.com/#feat=srcset)
Re placehold.it yes but not sure what you have in mind for this? Typically these types of services are used for mockups by design agencies before real content has arrived. (I like https://placekitten.com/)
If you are thinking of having something load in the interim until the real image loads then I wouldn't use yet another image, I'd do it as very lightweight inline html / css, like just a grey box with the alt text visible which would then get replaced by the image when you scrolled down andor it loaded etc.
I have added some extra stuff to the DB Model.
+---------------------------+---------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------------------+---------------------+------+-----+---------+-------+
| CreateDate | int(11) | NO | MUL | NULL | |
| ID | bigint(20) | NO | PRI | NULL | |
| RecordOrder | bigint(20) unsigned | NO | | NULL | |
| Height | double | NO | | NULL | |
| Width | double | NO | | NULL | |
| CreatedBy | bigint(20) | NO | MUL | NULL | |
| HashID | varchar(255) | NO | MUL | NULL | |
| Rotate | double | YES | | NULL | |
| RecordStatus | tinyint(1) | NO | | NULL | |
| LastUpdated | int(11) | YES | | NULL | |
| MimeType | varchar(255) | YES | | NULL | |
| Filename | varchar(255) | YES | | NULL | |
| Path | varchar(255) | YES | | NULL | |
| FieldOfView | varchar(255) | YES | | NULL | |
| GPSDirection | varchar(255) | YES | | NULL | |
| GPSPoint | varchar(255) | YES | | NULL | |
| DepthOfField | varchar(255) | YES | | NULL | |
| TimeTaken | int(11) | YES | MUL | NULL | |
| EXIFOrientationCorrection | double | YES | | NULL | |
| CameraMake | varchar(255) | YES | MUL | NULL | |
| CameraModel | varchar(255) | YES | MUL | NULL | |
| EXIFOrientation | double | YES | | NULL | |
+---------------------------+---------------------+------+-----+---------+-------+
MimeType as detected. Filename as uploaded. Path as you suggested. I am not sure if we will end up using these as the images will be served by Thumbor which will not have access to the db model, just the filesystem. Anyway they are here for completeness.
EXIFOrientation is the Orientation stored in the EXIF data. From time to time the user rotates the images when they are uploaded, in which case this is stored in the EXIFOrientationCorrection field. Note that Thumbor has been set up to honor the EXIF Orientation data, however it does not know about the EXIFOrientationCorrection value. This is where the Rotate field comes in, which tells the template to set a rotate value (90,180 or 270) so that the image is rotated from the EXIFOrientation to the EXIFOrientationCorrection. (Note that this is a delta rotation from the already encoded EXIF data in the original image). This must be tested thoroughly.
The GPSPoint, GPSDirection, FieldOfView, DepthOfField, TimeTaken, CameraMake and CameraModel fields are all deduced from EXIF data. We had previously discussed and implemented this directly onto the Photo and Topo tables. I am moving them to the new image db model, so it collects it for all image uploads.
GPSPoint, GPSDirection, FieldOfView, DepthOfField, TimeTaken, CameraMake
hey that's a nice bonus :) will this get retro extracted from the existing body of photos? I have some quick thoughts on that which I'll knock up in a new issue...
Yup, will be retro populated. Actually it was already in the photo and topo fields. It is way better in a single common table. On Oct 7, 2015 2:22 PM, "Brendan Heywood" notifications@github.com wrote:
GPSPoint, GPSDirection, FieldOfView, DepthOfField, TimeTaken, CameraMake
hey that's a nice bonus :) will this get retro extracted from the existing body of photos? I have some quick thoughts on that which I'll knock up in a new issue...
— Reply to this email directly or view it on GitHub https://github.com/theCrag/website/issues/2029#issuecomment-146068338.
Some test use cases for rotation
http://www.thecrag.com/photo/183901959 # EXIF 6, corrected 1, Rotate 270 http://www.thecrag.com/photo/186275373 # EXIF 1, corrected 6, Rotate 90 http://www.thecrag.com/photo/275428440 # No EXIF, corrected 6, Rotate 90 - note that this is an avatar that has been intentionally rotated. http://www.thecrag.com/photo/25060522 # EXIF 6, no correction, no rotation
Ahh, I assumed Thumbor rotates clockwise, it is anti-clockwise. Also there are some older images which we did not rotate in accordance with the EXIF. For example the following image has an EXIF Orientation of 6, but is stored as if it had no Orientation:
http://www.thecrag.com/photo/103356465
When I regenerated the image it came good. This implies there was an historical bug in my code somewhere. The new process will automatically correct a lot of examples, but the inconsistency worries me a little. We will have to see how the end result looks.
New set of test examples and results:
http://www.thecrag.com/photo/183901959 # EXIF 6, corrected 1, Rotate 90 http://www.thecrag.com/photo/186275373 # EXIF 1, corrected 6, Rotate 270 http://www.thecrag.com/photo/275428440 # No EXIF, corrected 6, Rotate 270 - note that this is an avatar that has been intentionally rotated. http://www.thecrag.com/photo/427087752 # EXIF 6, no correction, no rotation
tested the above and am happy that the rotation algorithm works as I expect it to.
I think we will have to allow users to rotate images after they have been uploaded.
Next phase involves integration. Here is a rough task list:
Config
Upload
Prod Loader
Dev Loader
Data
Web templates
grep image * | grep volume
in various template directories should get them allDelete
Thumbor config
Specific PDF stuff:
I'll mention this now as I'm not sure whether it's a distraction or would be more convenient to implement now, but with the topos it would be much better if the topo tool rendered exclusively as some large size, probably modified or it's equivalent, and then the output topo image is then fed into thumbor to be resized as needed. This may mean thumbor gets used twice, but will mean that the topos will always come out looking consistent when scaled. (ie no really fat looking lines on small topos)
Later, distraction. Will topo lines be ok being reduced a second time given they will not be vector graphics? On Oct 9, 2015 7:01 PM, "Brendan Heywood" notifications@github.com wrote:
I'll mention this now as I'm not sure whether it's a distraction or would be more convenient to implement now, but with the topos it would be much better if the topo tool rendered exclusively as some larger know size, probably modified or it's equivalent, and then the output topo image is then fed into thumbor to be resized as needed. This may mean thumbor gets used twice, but will mean that the topos will always come out looking consistent when scaled. (ie no really fat looking lines on small topos)
— Reply to this email directly or view it on GitHub https://github.com/theCrag/website/issues/2029#issuecomment-146790396.
Also, we can just make it bigger, but use the same width in cm. This will do the same thing. On Oct 9, 2015 8:59 PM, "Simon Dale" scd@thecrag.com wrote:
Later, distraction. Will topo lines be ok being reduced a second time given they will not be vector graphics? On Oct 9, 2015 7:01 PM, "Brendan Heywood" notifications@github.com wrote:
I'll mention this now as I'm not sure whether it's a distraction or would be more convenient to implement now, but with the topos it would be much better if the topo tool rendered exclusively as some larger know size, probably modified or it's equivalent, and then the output topo image is then fed into thumbor to be resized as needed. This may mean thumbor gets used twice, but will mean that the topos will always come out looking consistent when scaled. (ie no really fat looking lines on small topos)
— Reply to this email directly or view it on GitHub https://github.com/theCrag/website/issues/2029#issuecomment-146790396.
Should be, needs testing and tuning
If we want to pass our static topos through thumbor we will have to write a custom loader for dev and for prod.
In prod:
In dev
It is way more convenient if I do this release in two stages. First stage will just build the hash filesytem in parallel to the existing filesystem. This will allow me then to test the cutover on dev without having to find more disk space somehow. ie images which are not here locally go to static.
I will try and kick off stage one early this week.
Interesting minor complication with hard deletes of images. Because the new image filesystem storage will store the same image at the same address (ie the content is the same the filename will be the same) this means that if somebody uploads a photo twice there will be one image stored, but two photo records pointing to the same image record/file. Also if somebody makes a topo from an existing photo, then there will be both a photo and topo record pointing to the same image record/file.
Soft deleting should work in exactly the same way by changing the status of the master Photo/Topo record.
Because of the possibility of multiple references to one image record/file, hard deleting needs to take into an additional consideration. Hard deleting should soft delete plus remove the link from the photo and if there are no more references to the image record/file then remove that as well.
High level:
Current tool of choice is:
http://thumbor.org/
Rough todo list: