internetarchive / dweb-mirror

Offline Internet Archive project
https://www-dweb-mirror.dev.archive.org/
GNU Affero General Public License v3.0
261 stars 27 forks source link

Mediawiki - image sizes and duplication #297

Open mitra42 opened 4 years ago

mitra42 commented 4 years ago

Open question about image sizes, how to optimize not to waste substantial disk, and ideally how to integrate with images stored by dweb-mirror.

mitra42 commented 4 years ago

David says should be full size and thumbnail (~400px) only, but thats not what I'm seeing

-rw-r--r-- 1 www-data pi   1746 Dec 28 23:40 images/c/cc/120px-usadha-cetik_56.jpeg
-rw-r--r-- 1 www-data pi   3048 Dec 28 23:39 images/0/04/180px-usadha-cetik_56.jpeg
-rw-r--r-- 1 www-data pi   4539 Dec 28 23:40 images/b/b0/240px-usadha-cetik_56.jpeg
-rw-r--r-- 1 www-data pi   7120 Dec 28 23:40 images/7/7c/320px-usadha-cetik_56.jpeg
-rw-r--r-- 1 www-data pi  11038 Dec 28 23:40 images/f/f1/400px-usadha-cetik_56.jpeg
-rw-r--r-- 1 www-data pi  23891 Dec 28 23:40 images/5/50/600px-usadha-cetik_56.jpeg
-rw-r--r-- 1 www-data pi  45855 Dec 28 23:40 images/f/ff/800px-usadha-cetik_56.jpeg
-rw-r--r-- 1 www-data pi 109011 Dec 28 23:39 images/6/6e/1200px-usadha-cetik_56.jpeg
-rw-r--r-- 1 www-data pi 225615 Dec 28 23:40 images/9/9e/1599px-usadha-cetik_56.jpeg
-rw-r--r-- 1 www-data pi 644082 Dec 29 12:36 images/7/72/20190913062124!usadha-cetik_56.jpeg
-rw-r--r-- 1 www-data pi 644082 Dec 29 12:51 images/d/d6/usadha-cetik_56.jpeg 

Shows I believe 11 of the same image

mitra42 commented 4 years ago

From Slack: MA> so there is 400px for thumbnail, 2000px for ‘full-size’ on MW and then original at much larger resolution which you are getting via IIIF for the enhancement process. So I’d expect we have the 400p, and 2000px in MW and cached in DW, that would be nice (but not if its too hard) to eliminate duplication, and then a full-size that will be cached in DW if someone starts editing the image, and passed to React via IIF. DK:enhancement occurs in either viewing or editing, but only when zoomed in so, the only sensible way to make it work offline is to already have the original IA image available enhancement is not really optional -- you can't comfortably read everything without it -- hence my idea of removing the 2000px version and going directly to the original

mitra42 commented 4 years ago

For now, we'll keep 400p and 2000px in MW, and cache full image on DM

mitra42 commented 4 years ago

cd $MW/images ; find . -name "[0-9]*" -delete ; # Freed up space from 5.5GB of images to 3.02GB (51000 images) 
php maintenance/checkImages.php | grep missing | sed -E 's/^(.+):.+/File:\1/' |sudo php 
# Should then fix the DB
php maintenance/deleteBatch.php # VERY slow 
# And to ccleanup
php maintenance/deleteArchivedRevisions.php --delete
php maintenance/deleteArchivedFiles.php --delete