openzim / mwoffliner

Mediawiki scraper: all your wiki articles in one highly compressed ZIM file
https://www.npmjs.com/package/mwoffliner
GNU General Public License v3.0
273 stars 72 forks source link

ZIM for bm_all_maxi has different sizes between 1.13 and 1.14 #2070

Closed audiodude closed 1 week ago

audiodude commented 1 month ago

The ZIM that was scraped in July 2024 by 1.14 for bm_all_maxi is about half the size of the one for June, scraped by 1.13:

wikipedia_bm_all_maxi_2024-06.zim         2024-06-12 00:05   41M   
wikipedia_bm_all_maxi_2024-07.zim         2024-07-22 05:40   23M

We've started looking at the ZIMs and there is definitely a disparity in image resolution. Many of the images in the July ZIM have much smaller dimensions.

This could have been caused by clearing the image cache between runs. If 1.14 didn't find the image in the cache, it may have resorted to either:

  1. Downloading it again at a lower resolution
  2. Downloading it and transcoding to a different resolution webp
audiodude commented 1 month ago

ZIM file links:

wikipedia_bm_all_maxi_2024-06.zim

wikipedia_bm_all_maxi_2024-07.zim

Jaifroid commented 1 month ago

It seems clear this is the same issue as #2071. Perhaps close this and generalize the title of that?

audiodude commented 4 weeks ago

It's the opposite problem actually, the version scraped with 1.14 is half the size (smaller).

audiodude commented 4 weeks ago

The first step in analyzing this would be to do the "apples to apples" and scrape the wiki as it is now with 1.13 versus 1.14.

audiodude commented 1 week ago

Here's the results of scraping the current wiki with 1.13 and 1.14:

14M output/wikipedia_bm_all_maxi_2024-08.113.zim
22M output/wikipedia_bm_all_maxi_2024-08.114.zim

It is clear there were major structural changes between June and July that cause the most recent scrapes to be smaller.

audiodude commented 1 week ago

In the end, it turns out this is in fact the same issue as #2071. Closing as duplicate.