Closed Popolechien closed 5 years ago
The problem is around the pictures, both files have almost the same pictures, but it looks like that in the October version, they are bigger. For example:
I/m/f1,4-butanediol.svg.png
has 910 bytes in October, but 1802 bytes in NovemberThis kind of situation is not new, so this is probably not a regression, but we have too much of a difference to say that this is normal.
@ISNIT0 I suspect a weakness in the image optimization process
After a cursory look, they seem to have a similar number of articles:
> zimdump -l wikipedia_es_medicine_2018-09.zim | wc -l
57406
> zimdump -l wikipedia_es_medicine_2018-10.zim | wc -l
57461
@kelson42 It seems the more recent file is smaller than the older one even though it has more files
Possibly an optimization issue that has been fixed?
Okay, will look into it
Still finding weird numbers...
> zimdump wikipedia_es_all_nopic_2017-08.zim -l | wc -l
3142316
> zimdump wikipedia_es_all_nopic_2018-04.zim -l | wc -l
3230955
The larger file has fewer articles...
@ISNIT0 Please focus on the medical example... for the wikipedia_es_all_nopic example (which has been removed from the comment), I think the explanation is easy: One as the fulltext index, the other one does not.
I have copied the files to http://tmp.kiwix.org/, to be sure we keep them.
Yeah, I figured out afterwards that the es_all example was comparing 2018 to 2017. On a side note, as soon as we have a reliable update process we should definitely get rid of (or at least separate) the "last two zims in folder" concept.
@kelson42 The main difference seems to be the webm files: e.g. The 09 file has:
m%2fAneurysms.webm.160p.webm
14710995 bytes
The 10 file has:
m%2fAneurysms.webm.120p.vp9.webm
8437019 bytes
There are about 50 similar occurrences
I think we can put it down to the lower resolution videos. Is this a problem?
@ISNIT0 How do you explain the difference of size of I/m/f1,4-butanediol.svg.png
?
$ ls -la *zim
-rw-rw-r-- 1 kelson kelson 886548766 Jun 17 08:02 wikipedia_es_medicine_2018-09.zim
-rw-rw-r-- 1 kelson kelson 660456297 Jun 17 08:01 wikipedia_es_medicine_2018-10.zim
$ zimdump -n M -f Counter -d wikipedia_es_medicine_2018-09.zim ; echo
application/javascript=12;application/ogg=15;image/gif=408;image/jpeg=7729;image/png=7236;image/svg+xml=552;ogv=5;text/css=16;text/html=12402;wav=4;webm=110;
$ zimdump -n M -f Counter -d wikipedia_es_medicine_2018-10.zim ; echo
application/javascript=12;image/gif=409;image/jpeg=7770;image/png=7238;image/svg+xml=553;ogv=3;text/css=16;text/html=12424;webm=111;
We see here that the main reason is that in wikipedia_es_medicine_2018-09.zim, we have far more multimedia content. We have in particular a file Pure_Alpha_Waves_7_to_12,9_Hz_Binaural_Beats_V3.wav
which is almost 100MB.
Looks like wikipedia_es_medicine_2018-10.zim
is missing a few multimedia content. I'll check that we MWoffliner 1.9.3.
@ISNIT0 I confirm, with a new ZIM file created with MWoffliner 1.9.3 with videos:
mwoffliner --adminEmail="contact@kiwix.org" --articleList="http://download.kiwix.org/wp1/eswiki/customs/medicine" --customMainPage="Wikiproyecto:Medicina/Open_Textbook_of_Medicine" --customZimDescription="La Enciclopedia medical de Wikipedia" --customZimFavicon="https://upload.wikimedia.org/wikipedia/commons/2/25/Wiki_Offline_Spanish_logo_colored_final.png" --customZimTitle="Wikipedia Médica" --mwUrl="https://es.wikipedia.org/" --format="" --verbose
Pure_Alpha_Waves_7_to_12,9_Hz_Binaural_Beats_V3.wav
is not in because it is not part of the article "Pulsos_binaurales", like it is online http://localhost:8080/wikipedia_es_medicine_2019-06/A/Pulsos_binaurales
I think this is related to #842
@ISNIT0 Probably. We have currently 3-4 tickets related to audio scraping problemns. I think we can say: it's totally broken and in different ways. Would be great to have a PR fixing/testing all the different reported buggy scenarios.
Closing this for now, the audio issues we know about are now fixed
@ISNIT0 Considering that it seems we still have problem around multimedia mgmt (see #381 reopen this morning). I will reopen that ticket, and retest everything myself when we are so far.
wikipedia_es_medicine_2018-10.zim (630 Mb) is smaller than the earlier wikipedia_es_medicine_2018-09.zim (845 Mb)
I don't see that pattern in French, for instance.
Any way we can test what's missing?