Closed melissawm closed 1 month ago
Did you expect that filenames of movies could change with keep same content?
Not really at this point, but @jni mentioned we may want to do it in the future in case we replace an old video with a new one (for example, we may be showing the same functionality but with a new gui look or theme)
I only have superficial knowledge of bash so I could certainly use reviews on the script 😅
🤣 Literally @DragaDoncila and I were sitting here looking at the script going 😮 "I can't believe she didn't do this in Python" 😂
But the comments are super good so we like it! Have you tested it on your local napari.github.io clone and checked that it works? 🙏
SHA256 is not unique
I don't think that's a concern — this is how duplicate finders work throughout history. If we manage to make a new video with a hash that matches an existing video, I think it will be exciting enough that we won't care that we overwrote the old one. 😂
I am only checking video files right now, but we could do the same to images
I think we should do this.
Anyway, this is amazing! 🥳 Gonna test locally and (hopefully) approve.
Ok so I can't figure out how to run mapfile on macOS 😅 — command is missing and brew doesn't have it. So can you confirm that this works locally @melissawm and then we can merge? I'm super excited about this btw. 🥳 Optionally change the directory handling, but that's just a minor readability nitpick.
Looks like it's a bash built-in after version 4, so if you brew install bash
it should work?
Geez!
Apple: Please turn on automatic updates for all our OSs, it's good for your safety.
Also Apple: we ship our latest OS with a 10 year old version of bash.
😂
I ran this locally and everything works 🥳 All the older videos have symlinks and browsing them using the version switcher doesn't throw any problems. Awesome stuff @melissawm!
I would say definitely let's do the same thing with images because it's still a fair bit of disk usage even without the movies:
$ du -hs ./*
72M ./0.4.15
105M ./0.4.16
103M ./0.4.17
105M ./0.4.18
108M ./0.4.19
120M ./0.5.0
124M ./0.5.1
[...]
182M ./dev
[...]
Perfect! Will update. Cheers!
For context: I did this in bash because otherwise the CI would have to setup python etc so I thought this would be cheaper. Also, it keeps me on my toes 🦉
With this change:
➜ napari.github.io git:(duplicated-videos) du -hs ./*
63M ./0.4.15
97M ./0.4.16
96M ./0.4.17
91M ./0.4.18
93M ./0.4.19
105M ./0.5.0
84M ./0.5.1
184M ./dev
Which is ok but not great - the sphinx gallery images are not being detected as duplicated because they get generated every time so I guess the hash is different. We may have to thing of another strategy there.
Running pngquant on the images helps a lot, maybe we want to consider that?
Which is ok but not great - the sphinx gallery images are not being detected as duplicated because they get generated every time so I guess the hash is different. We may have to thing of another strategy there.
the problem is that metadata (like generation time) impacts sha. For images, there should be pixel-vise comparission.
This is already a huge improvement and will make sure that we are under 1GB for 0.5.2, so I'm going to merge, thanks @melissawm! We can improve things indeed by compressing the pngs (Matthias used optipng iirc but I guess pngquant does the same?) and pixel-wise comparisons later. (But I guess those will also fail any time that we change the viewer for example. And in that case it's ok, the images are different so we shouldn't try to deduplicate things that aren't actually duplicates.)
Because of the multiple versions being deployed to the gh-pages website, our artifacts are growing and contain many duplicated videos (and images).
This PR adds a script to be run from the
unversioned_pages.yml
action that replaces duplicated videos in older versions of the docs (as tested by theirsha256sum
) with the most recent version (in dev). I chose to do it this way because dev will almost certainly contain a superset of the videos in the repo so it made more sense, but it does mean we are touching the older deployments so this could be destructive if we don't have the full repo history from now on.A couple of points:
Any feedback is appreciated!