simonw / til

Today I Learned
https://til.simonwillison.net
Apache License 2.0
1.02k stars 81 forks source link

Move post card images to S3 #74

Closed simonw closed 1 year ago

simonw commented 1 year ago

Following:

simonw commented 1 year ago

I can increase the resolution of the images too when I do this, since they won't need to be small enough to not take up too much space any more.

I can use the til.simonwillison.net S3 bucket for this.

simonw commented 1 year ago

Images are currently generated by shot-scraper run from this Python script:

https://github.com/simonw/til/blob/e2e4819d33613410efa533541262599f23fd6223/generate_screenshots.py#L15-L36

simonw commented 1 year ago

Huh... those are PNGs. I bet they'd be a lot smaller if they were JPEGs, and even retina JPEGs might be smaller while still displaying well.

simonw commented 1 year ago

Ran this locally:

datasette . --get /sqlite/multiple-indexes > generate.html

Then:

shot-scraper shot generate.html -w 800 -h 400 --retina

Got this 216KB image:

generate-html

Tried a JPEG too - quality 80 was almost as big, but this got a smaller image (159KB):

shot-scraper shot generate.html -w 800 -h 400 --retina --quality 60

generate-html 1

simonw commented 1 year ago

Biggest question to decide is how to tell if an image has been created in S3 or not.

I'm tempted to do it based on the filename: use the shot hash as that name, do a quick list-files operation to see what files exist already, create the ones that don't.

simonw commented 1 year ago

That should run in GitHub Actions and generate JPEGs for every post and upload them to S3.

https://github.com/simonw/til/actions/runs/4842339363/jobs/8629221973

simonw commented 1 year ago

It's working...

 % s3-credentials list-bucket til.simonwillison.net
[
  {
    "Key": "0cf1e455f161435a4aea07480c27da89.jpg",
    "LastModified": "2023-04-30 03:54:06+00:00",
    "ETag": "\"c1ef69673fda4ebf1cd1cfa41d8dc255\"",
    "Size": 90039,
    "StorageClass": "STANDARD"
  },
  {
    "Key": "1447c8cdd4caa68e5514a1bb5b9f9f49.jpg",
    "LastModified": "2023-04-30 03:54:12+00:00",
    "ETag": "\"4adfdd03def8e54c651451f5b56e43b9\"",
    "Size": 111841,
    "StorageClass": "STANDARD"
  },
  {
    "Key": "14e4b902d5511a639a6c8d1e91d3dabb.jpg",
    "LastModified": "2023-04-30 03:54:35+00:00",
    "ETag": "\"2d3e29f3eaca62ba688c04a82d923fba\"",
    "Size": 118002,
    "StorageClass": "STANDARD"
  },
simonw commented 1 year ago

Generated image example: http://s3.amazonaws.com/til.simonwillison.net/f19a4a99ca28b20786ed7e35d8f9a8e7.jpg

simonw commented 1 year ago

To see how many are done:

% s3-credentials list-bucket til.simonwillison.net | jq length
43

410 total.

simonw commented 1 year ago

Partial logs from that GitHub Actions run:

Stored 96126 byte JPEG for github-actions_grep-tests.md shot hash 3e71efb58ec2d72ce37d6c93d7ace74e
Stored 70990 byte JPEG for github-actions_commit-if-file-changed.md shot hash 3b4a2012993962434fc8f5853cf5396b
Stored 72935 byte JPEG for bash_loop-over-csv.md shot hash d06963c31326ae773a8e7face614668c
simonw commented 1 year ago

It finished. All 410 images should be there now.

simonw commented 1 year ago

This query shows all the images on one page:

select
  json_object(
    'img_src',
    'https://s3.amazonaws.com/til.simonwillison.net/' || shot_hash || '.jpg',
    'width',
    400
  ) as img
from
  til

https://til.simonwillison.net/tils

I scrolled through and they all look good. This one was a favourite: https://s3.amazonaws.com/til.simonwillison.net/990ce33b65e40356be0035f185b3484c.jpg

simonw commented 1 year ago

Last steps:

simonw commented 1 year ago

Oops broke it:

Traceback (most recent call last):
  File "generate_screenshots.py", line 92, in <module>
    generate_screenshots(root)
  File "generate_screenshots.py", line 55, in generate_screenshots
    shot_html_hash.update(filepath.read_text().encode("utf-8"))
  File "/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/pathlib.py", line 1236, in read_text
    with self.open(mode='r', encoding=encoding, errors=errors) as f:
  File "/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/pathlib.py", line 1222, in open
    return io.open(self, mode, buffering, encoding, errors, newline,
  File "/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/pathlib.py", line 1078, in _opener
    return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/home/runner/work/til/til/main/templates/row.html'
simonw commented 1 year ago

That's deployed now.

https://developers.facebook.com/tools/debug/?q=https%3A%2F%2Ftil.simonwillison.net%2Fllms%2Ftraining-nanogpt-on-my-blog shows this:

image
simonw commented 1 year ago

Wrote this up as a TIL: https://til.simonwillison.net/shot-scraper/social-media-cards