Add script to create symlinks to duplicated videos

melissawm commented 1 month ago

Because of the multiple versions being deployed to the gh-pages website, our artifacts are growing and contain many duplicated videos (and images).

This PR adds a script to be run from the unversioned_pages.yml action that replaces duplicated videos in older versions of the docs (as tested by their sha256sum) with the most recent version (in dev). I chose to do it this way because dev will almost certainly contain a superset of the videos in the repo so it made more sense, but it does mean we are touching the older deployments so this could be destructive if we don't have the full repo history from now on.

A couple of points:

SHA256 is not unique (multiple files could, in theory, have the same SHA256 sum). I think this is unique enough for our purposes, but admitedly don't know much about its use in this way.
I am only checking video files right now, but we could do the same to images, too.
I only have superficial knowledge of bash so I could certainly use reviews on the script 😅

Any feedback is appreciated!

Czaki commented 1 month ago

Did you expect that filenames of movies could change with keep same content?

melissawm commented 1 month ago

Not really at this point, but @jni mentioned we may want to do it in the future in case we replace an old video with a new one (for example, we may be showing the same functionality but with a new gui look or theme)

jni commented 1 month ago

I only have superficial knowledge of bash so I could certainly use reviews on the script 😅

🤣 Literally @DragaDoncila and I were sitting here looking at the script going 😮 "I can't believe she didn't do this in Python" 😂

But the comments are super good so we like it! Have you tested it on your local napari.github.io clone and checked that it works? 🙏

SHA256 is not unique

I don't think that's a concern — this is how duplicate finders work throughout history. If we manage to make a new video with a hash that matches an existing video, I think it will be exciting enough that we won't care that we overwrote the old one. 😂

I am only checking video files right now, but we could do the same to images

I think we should do this.

Anyway, this is amazing! 🥳 Gonna test locally and (hopefully) approve.

jni commented 1 month ago

Ok so I can't figure out how to run mapfile on macOS 😅 — command is missing and brew doesn't have it. So can you confirm that this works locally @melissawm and then we can merge? I'm super excited about this btw. 🥳 Optionally change the directory handling, but that's just a minor readability nitpick.

psobolewskiPhD commented 1 month ago

Looks like it's a bash built-in after version 4, so if you brew install bash it should work?

jni commented 1 month ago

Geez!

Apple: Please turn on automatic updates for all our OSs, it's good for your safety.

Also Apple: we ship our latest OS with a 10 year old version of bash.

😂

I ran this locally and everything works 🥳 All the older videos have symlinks and browsing them using the version switcher doesn't throw any problems. Awesome stuff @melissawm!

I would say definitely let's do the same thing with images because it's still a fair bit of disk usage even without the movies:

$ du -hs ./*
 72M    ./0.4.15
105M    ./0.4.16
103M    ./0.4.17
105M    ./0.4.18
108M    ./0.4.19
120M    ./0.5.0
124M    ./0.5.1

[...]

182M    ./dev

[...]

melissawm commented 1 month ago

Perfect! Will update. Cheers!

melissawm commented 1 month ago

For context: I did this in bash because otherwise the CI would have to setup python etc so I thought this would be cheaper. Also, it keeps me on my toes 🦉

melissawm commented 1 month ago

With this change:

➜  napari.github.io git:(duplicated-videos) du -hs ./*
63M     ./0.4.15
97M     ./0.4.16
96M     ./0.4.17
91M     ./0.4.18
93M     ./0.4.19
105M    ./0.5.0
84M     ./0.5.1

184M    ./dev

Which is ok but not great - the sphinx gallery images are not being detected as duplicated because they get generated every time so I guess the hash is different. We may have to thing of another strategy there.

melissawm commented 1 month ago

Running pngquant on the images helps a lot, maybe we want to consider that?

Czaki commented 1 month ago

Which is ok but not great - the sphinx gallery images are not being detected as duplicated because they get generated every time so I guess the hash is different. We may have to thing of another strategy there.

the problem is that metadata (like generation time) impacts sha. For images, there should be pixel-vise comparission.

jni commented 1 month ago

This is already a huge improvement and will make sure that we are under 1GB for 0.5.2, so I'm going to merge, thanks @melissawm! We can improve things indeed by compressing the pngs (Matthias used optipng iirc but I guess pngquant does the same?) and pixel-wise comparisons later. (But I guess those will also fail any time that we change the viewer for example. And in that case it's ok, the images are different so we shouldn't try to deduplicate things that aren't actually duplicates.)

napari / napari.github.io

Add script to create symlinks to duplicated videos #412