openzim / mwoffliner

Mediawiki scraper: all your wiki articles in one highly compressed ZIM file
https://www.npmjs.com/package/mwoffliner
GNU General Public License v3.0
292 stars 74 forks source link

ZIM for ab_all_maxi has different sizes between 1.13 and 1.14 #2071

Open audiodude opened 3 months ago

audiodude commented 3 months ago

The ZIM scraped in July 2024 by 1.14 has a different size than the one scraped in June 2024 by 1.13:

wikipedia_ab_all_maxi_2024-06.zim             2024-06-17 02:21   26M   
wikipedia_ab_all_maxi_2024-07.zim             2024-07-22 20:56   36M

Oddly, this is the opposite problem as #2070. We don't yet know what the issue might be.

audiodude commented 3 months ago

ZIM file links:

wikipedia_ab_all_maxi_2024-06.zim

wikipedia_ab_all_maxi_2024-07.zim

audiodude commented 3 months ago

Here is a .tsv for every entry in the ZIMs. It has the format:

path,june size, july size

comparison.zip

Doing some analysis in pandas, we see that there are 681 webps that are larger in July, out of 2969 total webps:

image

The mean size difference is +10,661 bytes for those webps that are larger.

However, the total difference, including webps that are smaller in July, is only 6.77 MB:

image

audiodude commented 3 months ago

Clearly I'm doing something wrong, because the total sums of July sizes and June sizes are only 107 MB and 110 MB:

image

I tried to iterate over all entries in the ZIM using the _get_entry_by_id hack:

import csv

from libzim.reader import Archive

june = Archive("zims/wikipedia_ab_all_maxi_2024-06.zim")
july = Archive("zims/wikipedia_ab_all_maxi_2024-07.zim")

path_to_sizes = {}

for i in range(0, june.all_entry_count):
  entry = june._get_entry_by_id(i)
  path_to_sizes[entry.path] = [entry.get_item().size]

for i in range(0, july.all_entry_count):
  entry = july._get_entry_by_id(i)
  if entry.path in path_to_sizes:
    path_to_sizes[entry.path].append(entry.get_item().size)
  else:
    path_to_sizes[entry.path] = [None, entry.get_item().size]

for entry, sizes in path_to_sizes.items():
  if len(sizes) == 1:
    sizes.append(None)

with open('comparison.tsv', 'w', newline='') as csvfile:
  csvwriter = csv.writer(csvfile, delimiter='\t')
  sorted_keys = sorted(path_to_sizes.keys())
  csvwriter.writerow(('path', 'june', 'july'))
  for key in sorted_keys:
    csvwriter.writerow((key, *path_to_sizes[key]))
rgaudin commented 3 months ago

Your tsv is not filtered on WEBP files ; it contains all entries, including compressed ones (text) and indexes.

rgaudin commented 3 months ago
sum(june._get_entry_by_id(i).get_item().size for i in range(0, june.all_entry_count) if june._get_entry_by_id(i).get_item().mimetype == "image/webp")
> 13594704  # 12.96 MiB
rgaudin commented 3 months ago

July WEBP are 24129838 / 23.01 MiB

audiodude commented 3 months ago

Your tsv is not filtered on WEBP files ; it contains all entries, including compressed ones (text) and indexes.

The pandas code limits it to webp:

webps = df[df['path'].str.endswith('.webp')]
audiodude commented 3 months ago

The larger question is why is my total size 118 MB?

Edit: I misread the original ZIM sizes as 26/36 GB instead of MB. So actually, uncompressed, 110/118 MB makes sense.

audiodude commented 3 months ago

Here is my Jupyter notebook with analysis: https://github.com/audiodude/zim-investigation/blob/main/compare.ipynb

audiodude commented 3 months ago

Doing a more "apples to apples" comparison of the wiki scraped right now with 1.13 versus 1.14, the discrepancy is much less:

30244   zims/wikipedia_ab_all_maxi_2024-08.113.zim
34728   zims/wikipedia_ab_all_maxi_2024-08.114.zim
audiodude commented 3 months ago

Dumping some of the webps from the respective ZIMs, we see that the 1.14 ones are much bigger:

$ du zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/*
16  zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/1924WOlympicPoster.jpg.webp
16  zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/200412_-_Plaqueminier_et_ses_kakis.jpg.webp
12  zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/Ambara_church_ruins_in_Abkhazia%2C_1899.jpg.webp
20  zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/Carmen_habanera_original.jpg.webp
20  zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/Carmen_-_illustration_by_Luc_for_Journal_Amusant_1911.jpg.webp
8   zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/Christos_Acheiropoietos.jpg.webp
28  zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/Hovenia_dulcis.jpg.webp
12  zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/Lashkendar_temple_ruins.JPG.webp
28  zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/Paliurus_fg01.jpg.webp
20  zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/Tsebelda_iconostasis.jpg.webp

$ du zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/*
112 zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/1924WOlympicPoster.jpg.webp
16  zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/200412_-_Plaqueminier_et_ses_kakis.jpg.webp
120 zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/Ambara_church_ruins_in_Abkhazia%2C_1899.jpg.webp
112 zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/Carmen_habanera_original.jpg.webp
144 zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/Carmen_-_illustration_by_Luc_for_Journal_Amusant_1911.jpg.webp
112 zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/Christos_Acheiropoietos.jpg.webp
28  zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/Hovenia_dulcis.jpg.webp
168 zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/Lashkendar_temple_ruins.JPG.webp
28  zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/Paliurus_fg01.jpg.webp
140 zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/Tsebelda_iconostasis.jpg.webp

Confirmed manually that the 1.14 images have much bigger dimensions.

audiodude commented 3 months ago

Confirmed this is due to larger images.

kelson42 commented 3 months ago

@audiodude What does that mean concretly it term of resolution and quality? Are they all impacted in the same manner?

audiodude commented 3 months ago

@kelson42 The resolutions are much bigger, they have larger widths and heights in terms of pixels. I didn't do any systematic analysis of to what degree that is the case, but it's most likely due to #1925

kelson42 commented 3 months ago

@audiodude I'm not against to downscale images but:

Jaifroid commented 3 months ago

Doing a more "apples to apples" comparison of the wiki scraped right now with 1.13 versus 1.14, the discrepancy is much less:

30244 zims/wikipedia_ab_all_maxi_2024-08.113.zim
34728 zims/wikipedia_ab_all_maxi_2024-08.114.zim

I noticed this with https://download.kiwix.org/zim/wikivoyage/wikivoyage_en_all_maxi_2024-08.zim, which is scraped with 1.13 from new endpoint, has larger images at least in terms of display dimensions, but which hardly increases the ZIM size compared to ZIMs scraped from the old endpoint.

I actually rather like the larger display size for images at least in that Wikivoyage version (which I've just released as a packaged app). If we could hit that sweet-spot in terms of display-size vs compression, it would be a good solution IMHO. What is 1.13 doing right here?