gnosygnu / xowa

xowa offline wiki application
Other
374 stars 41 forks source link

stored thumbnail images at different dimensions #447

Open desb42 opened 5 years ago

desb42 commented 5 years ago

looking at en.wikipedia.org-file-core.xowa (2019-03-01)

fsdb_thm has (potentially) multiple entries for an image at different file dimensions (and hence different filesizes) I wanted to check which images had the most different file dimensions and what amount of filespace they used I wrote this query

select d.tid, f.fil_name, d.cnt, d.mts "max size", d.sts "tot size", d.sts/d.cnt "avg" from fsdb_fil f
join 
(select c.tid, count(*) cnt, max(thm_size) mts, sum(thm_size) sts from
(select thm_owner_id as tid, thm_id, thm_size from fsdb_thm 
) c
group by c.tid ) d on d.tid=f.fil_id
where d.cnt > 4
order by d.cnt desc

With the cutoff of a minimum of 5 different sizes there are 56,683 images Summing up the count (cnt) gives 391,830 entries in fsdb_thm total space used 5.47Gb

Would it be reasonable to suggest that only the largest needs to be stored? That is, always serve the largest file regardless of the other (smaller) dimensions desired (the browser should scale the image appropriately)

In that case, if only the largest is stored then the space would be 1.65Gb (in this case a saving of approx 4Gb)

These are the results of the query query.zip

On a separate note the first two results seem to be for the same image Flag_of_the_United_States.svg but they have different fil_ids 846 and 506149

gnosygnu commented 5 years ago

Thanks. That's a useful query to have.

I'd like to reduce disk space, primarily so it would make file upload / download easier. 5 GB is significant too.

[Oops. Premature comment]

My main concern is that some thumbnails are small (20x20) and I remember that downloading a full-size image for them (let's say 1000x1000) had a noticeable performance impact, especially on Android. I did write code that did remove code by approximation (if I had an image that was 100x100, use it for anything that was less than 100x100) but that only saved 1 or 2 GB, and didn't seem worth the maintenance headache.

I'll keep this in mind for any future work on the file system dbs, but for now, I feel the redundancy percentage is small (3% - 5%)

On a separate note the first two results seem to be for the same image Flag_of_the_United_States.svg but they have different fil_ids 846 and 506149

Yeah, this is non-obvious, but they belong to two separate wikis: enwiki and commonswiki. The fil_owner_id indicates as much. See excerpt below.

I've seen different images with the same name in both enwiki and commonswiki. So I uniquefy by wiki+file_name.

sqlite> SELECT * FROM fsdb_fil WHERE fil_name = 'Flag_of_the_United_States.svg' LIMIT 10;
fil_id      fil_owner_id  fil_xtn_id  fil_ext_id  fil_bin_db_id  fil_name                       fil_size    fil_modified  fil_hash
----------  ------------  ----------  ----------  -------------  -----------------------------  ----------  ------------  ----------
846         20            1           7           -1             Flag_of_the_United_States.svg  -1
506149      1             1           7           -1             Flag_of_the_United_States.svg  -1
sqlite> SELECT * FROM fsdb_dir;
dir_id      dir_owner_id  dir_name
----------  ------------  ---------------------
1           0             commons.wikimedia.org
20          0             en.wikipedia.org