Open desb42 opened 5 years ago
Thanks. That's a useful query to have.
I'd like to reduce disk space, primarily so it would make file upload / download easier. 5 GB is significant too.
[Oops. Premature comment]
My main concern is that some thumbnails are small (20x20) and I remember that downloading a full-size image for them (let's say 1000x1000) had a noticeable performance impact, especially on Android. I did write code that did remove code by approximation (if I had an image that was 100x100, use it for anything that was less than 100x100) but that only saved 1 or 2 GB, and didn't seem worth the maintenance headache.
I'll keep this in mind for any future work on the file system dbs, but for now, I feel the redundancy percentage is small (3% - 5%)
On a separate note the first two results seem to be for the same image Flag_of_the_United_States.svg but they have different fil_ids 846 and 506149
Yeah, this is non-obvious, but they belong to two separate wikis: enwiki and commonswiki. The fil_owner_id indicates as much. See excerpt below.
I've seen different images with the same name in both enwiki and commonswiki. So I uniquefy by wiki+file_name.
sqlite> SELECT * FROM fsdb_fil WHERE fil_name = 'Flag_of_the_United_States.svg' LIMIT 10;
fil_id fil_owner_id fil_xtn_id fil_ext_id fil_bin_db_id fil_name fil_size fil_modified fil_hash
---------- ------------ ---------- ---------- ------------- ----------------------------- ---------- ------------ ----------
846 20 1 7 -1 Flag_of_the_United_States.svg -1
506149 1 1 7 -1 Flag_of_the_United_States.svg -1
sqlite> SELECT * FROM fsdb_dir;
dir_id dir_owner_id dir_name
---------- ------------ ---------------------
1 0 commons.wikimedia.org
20 0 en.wikipedia.org
looking at en.wikipedia.org-file-core.xowa (2019-03-01)
fsdb_thm has (potentially) multiple entries for an image at different file dimensions (and hence different filesizes) I wanted to check which images had the most different file dimensions and what amount of filespace they used I wrote this query
With the cutoff of a minimum of 5 different sizes there are 56,683 images Summing up the count (cnt) gives 391,830 entries in fsdb_thm total space used 5.47Gb
Would it be reasonable to suggest that only the largest needs to be stored? That is, always serve the largest file regardless of the other (smaller) dimensions desired (the browser should scale the image appropriately)
In that case, if only the largest is stored then the space would be 1.65Gb (in this case a saving of approx 4Gb)
These are the results of the query query.zip
On a separate note the first two results seem to be for the same image Flag_of_the_United_States.svg but they have different fil_ids 846 and 506149