Open audiodude opened 3 months ago
Here is a .tsv for every entry in the ZIMs. It has the format:
path,june size, july size
Doing some analysis in pandas, we see that there are 681 webps that are larger in July, out of 2969 total webps:
The mean size difference is +10,661 bytes for those webps that are larger.
However, the total difference, including webps that are smaller in July, is only 6.77 MB:
Clearly I'm doing something wrong, because the total sums of July sizes and June sizes are only 107 MB and 110 MB:
I tried to iterate over all entries in the ZIM using the _get_entry_by_id
hack:
import csv
from libzim.reader import Archive
june = Archive("zims/wikipedia_ab_all_maxi_2024-06.zim")
july = Archive("zims/wikipedia_ab_all_maxi_2024-07.zim")
path_to_sizes = {}
for i in range(0, june.all_entry_count):
entry = june._get_entry_by_id(i)
path_to_sizes[entry.path] = [entry.get_item().size]
for i in range(0, july.all_entry_count):
entry = july._get_entry_by_id(i)
if entry.path in path_to_sizes:
path_to_sizes[entry.path].append(entry.get_item().size)
else:
path_to_sizes[entry.path] = [None, entry.get_item().size]
for entry, sizes in path_to_sizes.items():
if len(sizes) == 1:
sizes.append(None)
with open('comparison.tsv', 'w', newline='') as csvfile:
csvwriter = csv.writer(csvfile, delimiter='\t')
sorted_keys = sorted(path_to_sizes.keys())
csvwriter.writerow(('path', 'june', 'july'))
for key in sorted_keys:
csvwriter.writerow((key, *path_to_sizes[key]))
Your tsv is not filtered on WEBP files ; it contains all entries, including compressed ones (text) and indexes.
sum(june._get_entry_by_id(i).get_item().size for i in range(0, june.all_entry_count) if june._get_entry_by_id(i).get_item().mimetype == "image/webp")
> 13594704 # 12.96 MiB
July WEBP are 24129838 / 23.01 MiB
Your tsv is not filtered on WEBP files ; it contains all entries, including compressed ones (text) and indexes.
The pandas code limits it to webp:
webps = df[df['path'].str.endswith('.webp')]
The larger question is why is my total size 118 MB?
Edit: I misread the original ZIM sizes as 26/36 GB instead of MB. So actually, uncompressed, 110/118 MB makes sense.
Here is my Jupyter notebook with analysis: https://github.com/audiodude/zim-investigation/blob/main/compare.ipynb
Doing a more "apples to apples" comparison of the wiki scraped right now with 1.13 versus 1.14, the discrepancy is much less:
30244 zims/wikipedia_ab_all_maxi_2024-08.113.zim
34728 zims/wikipedia_ab_all_maxi_2024-08.114.zim
Dumping some of the webps from the respective ZIMs, we see that the 1.14 ones are much bigger:
$ du zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/*
16 zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/1924WOlympicPoster.jpg.webp
16 zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/200412_-_Plaqueminier_et_ses_kakis.jpg.webp
12 zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/Ambara_church_ruins_in_Abkhazia%2C_1899.jpg.webp
20 zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/Carmen_habanera_original.jpg.webp
20 zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/Carmen_-_illustration_by_Luc_for_Journal_Amusant_1911.jpg.webp
8 zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/Christos_Acheiropoietos.jpg.webp
28 zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/Hovenia_dulcis.jpg.webp
12 zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/Lashkendar_temple_ruins.JPG.webp
28 zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/Paliurus_fg01.jpg.webp
20 zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/Tsebelda_iconostasis.jpg.webp
$ du zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/*
112 zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/1924WOlympicPoster.jpg.webp
16 zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/200412_-_Plaqueminier_et_ses_kakis.jpg.webp
120 zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/Ambara_church_ruins_in_Abkhazia%2C_1899.jpg.webp
112 zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/Carmen_habanera_original.jpg.webp
144 zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/Carmen_-_illustration_by_Luc_for_Journal_Amusant_1911.jpg.webp
112 zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/Christos_Acheiropoietos.jpg.webp
28 zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/Hovenia_dulcis.jpg.webp
168 zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/Lashkendar_temple_ruins.JPG.webp
28 zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/Paliurus_fg01.jpg.webp
140 zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/Tsebelda_iconostasis.jpg.webp
Confirmed manually that the 1.14 images have much bigger dimensions.
Confirmed this is due to larger images.
@audiodude What does that mean concretly it term of resolution and quality? Are they all impacted in the same manner?
@kelson42 The resolutions are much bigger, they have larger widths and heights in terms of pixels. I didn't do any systematic analysis of to what degree that is the case, but it's most likely due to #1925
@audiodude I'm not against to downscale images but:
Doing a more "apples to apples" comparison of the wiki scraped right now with 1.13 versus 1.14, the discrepancy is much less:
30244 zims/wikipedia_ab_all_maxi_2024-08.113.zim 34728 zims/wikipedia_ab_all_maxi_2024-08.114.zim
I noticed this with https://download.kiwix.org/zim/wikivoyage/wikivoyage_en_all_maxi_2024-08.zim, which is scraped with 1.13 from new endpoint, has larger images at least in terms of display dimensions, but which hardly increases the ZIM size compared to ZIMs scraped from the old endpoint.
I actually rather like the larger display size for images at least in that Wikivoyage version (which I've just released as a packaged app). If we could hit that sweet-spot in terms of display-size vs compression, it would be a good solution IMHO. What is 1.13 doing right here?
The ZIM scraped in July 2024 by 1.14 has a different size than the one scraped in June 2024 by 1.13:
Oddly, this is the opposite problem as #2070. We don't yet know what the issue might be.