openzim / mwoffliner

Mediawiki scraper: all your wiki articles in one highly compressed ZIM file
https://www.npmjs.com/package/mwoffliner
GNU General Public License v3.0
288 stars 73 forks source link

Multimedia is not always scraped properly (was: 20-40% size decrease for Spanish zim files?) #456

Closed Popolechien closed 5 years ago

Popolechien commented 5 years ago

wikipedia_es_medicine_2018-10.zim (630 Mb) is smaller than the earlier wikipedia_es_medicine_2018-09.zim (845 Mb)

I don't see that pattern in French, for instance.

Any way we can test what's missing?

kelson42 commented 5 years ago

The problem is around the pictures, both files have almost the same pictures, but it looks like that in the October version, they are bigger. For example:

This kind of situation is not new, so this is probably not a regression, but we have too much of a difference to say that this is normal.

@ISNIT0 I suspect a weakness in the image optimization process

ISNIT0 commented 5 years ago

After a cursory look, they seem to have a similar number of articles:

> zimdump -l wikipedia_es_medicine_2018-09.zim | wc -l
57406
> zimdump -l wikipedia_es_medicine_2018-10.zim | wc -l
57461

@kelson42 It seems the more recent file is smaller than the older one even though it has more files

Possibly an optimization issue that has been fixed?

ISNIT0 commented 5 years ago

Okay, will look into it

ISNIT0 commented 5 years ago

Still finding weird numbers...

> zimdump wikipedia_es_all_nopic_2017-08.zim -l | wc -l
3142316
> zimdump wikipedia_es_all_nopic_2018-04.zim -l | wc -l 
3230955

The larger file has fewer articles...

kelson42 commented 5 years ago

@ISNIT0 Please focus on the medical example... for the wikipedia_es_all_nopic example (which has been removed from the comment), I think the explanation is easy: One as the fulltext index, the other one does not.

kelson42 commented 5 years ago

I have copied the files to http://tmp.kiwix.org/, to be sure we keep them.

Popolechien commented 5 years ago

Yeah, I figured out afterwards that the es_all example was comparing 2018 to 2017. On a side note, as soon as we have a reliable update process we should definitely get rid of (or at least separate) the "last two zims in folder" concept.

ISNIT0 commented 5 years ago

@kelson42 The main difference seems to be the webm files: e.g. The 09 file has:

m%2fAneurysms.webm.160p.webm
14710995 bytes

The 10 file has:

m%2fAneurysms.webm.120p.vp9.webm
8437019 bytes

There are about 50 similar occurrences

I think we can put it down to the lower resolution videos. Is this a problem?

kelson42 commented 5 years ago

@ISNIT0 How do you explain the difference of size of I/m/f1,4-butanediol.svg.png?

kelson42 commented 5 years ago
$ ls -la *zim
-rw-rw-r-- 1 kelson kelson 886548766 Jun 17 08:02 wikipedia_es_medicine_2018-09.zim
-rw-rw-r-- 1 kelson kelson 660456297 Jun 17 08:01 wikipedia_es_medicine_2018-10.zim
$ zimdump -n M -f Counter -d wikipedia_es_medicine_2018-09.zim ; echo
application/javascript=12;application/ogg=15;image/gif=408;image/jpeg=7729;image/png=7236;image/svg+xml=552;ogv=5;text/css=16;text/html=12402;wav=4;webm=110;
$ zimdump -n M -f Counter -d wikipedia_es_medicine_2018-10.zim ; echo
application/javascript=12;image/gif=409;image/jpeg=7770;image/png=7238;image/svg+xml=553;ogv=3;text/css=16;text/html=12424;webm=111;

We see here that the main reason is that in wikipedia_es_medicine_2018-09.zim, we have far more multimedia content. We have in particular a file Pure_Alpha_Waves_7_to_12,9_Hz_Binaural_Beats_V3.wav which is almost 100MB.

Looks like wikipedia_es_medicine_2018-10.zim is missing a few multimedia content. I'll check that we MWoffliner 1.9.3.

kelson42 commented 5 years ago

@ISNIT0 I confirm, with a new ZIM file created with MWoffliner 1.9.3 with videos:

mwoffliner --adminEmail="contact@kiwix.org" --articleList="http://download.kiwix.org/wp1/eswiki/customs/medicine" --customMainPage="Wikiproyecto:Medicina/Open_Textbook_of_Medicine" --customZimDescription="La Enciclopedia medical de Wikipedia" --customZimFavicon="https://upload.wikimedia.org/wikipedia/commons/2/25/Wiki_Offline_Spanish_logo_colored_final.png" --customZimTitle="Wikipedia Médica" --mwUrl="https://es.wikipedia.org/" --format="" --verbose

Pure_Alpha_Waves_7_to_12,9_Hz_Binaural_Beats_V3.wav is not in because it is not part of the article "Pulsos_binaurales", like it is online http://localhost:8080/wikipedia_es_medicine_2019-06/A/Pulsos_binaurales

ISNIT0 commented 5 years ago

I think this is related to #842

kelson42 commented 5 years ago

@ISNIT0 Probably. We have currently 3-4 tickets related to audio scraping problemns. I think we can say: it's totally broken and in different ways. Would be great to have a PR fixing/testing all the different reported buggy scenarios.

ISNIT0 commented 5 years ago

Closing this for now, the audio issues we know about are now fixed

kelson42 commented 5 years ago

@ISNIT0 Considering that it seems we still have problem around multimedia mgmt (see #381 reopen this morning). I will reopen that ticket, and retest everything myself when we are so far.