openzim / zim-requests

Want a new ZIM file? Propose ZIM content improvements or fixes? Here you are!
https://farm.openzim.org
37 stars 2 forks source link

Check data source for wikipedia_en_100_maxi (seems out of date) #352

Open Jaifroid opened 3 years ago

Jaifroid commented 3 years ago

I'm not sure if the top 100 ZIMs are supposed to have a dynamic data source, whether it is for the past year, all-time, etc. Whatever it is, it seems out of date. The most startling example is that COVID-19 still hasn't made its way into the Top 100, whereas according to Wikipedia's own all-time list of top 100, it's been in the top 100 since June 2020. See:

https://en.wikipedia.org/wiki/Wikipedia:Multiyear_ranking_of_most_viewed_pages#Top-100_list

And slightly off-topic: I wonder if there is a way to reduce the size of the Top-100 ZIM. It is about 10-times larger than Ray Charles, which has the same number of pages. I think it's mostly due to images. Would it be possible to do a mini version with pictures, i.e. with the lead picture(s)? I use this as a sample ZIM to distribute with the vanilla app -- I think a sample ZIM should have pictures -- but 28MB is too big and bloats the app (which on its own would be only about 4MB, the UWP version). Alternatively a Top-25 ZIM? See https://en.wikipedia.org/wiki/Wikipedia:Top_25_Report/May_16_to_22,_2021 .

kelson42 commented 3 years ago

@Jaifroid I guess you speak about this recipe https://farm.openzim.org/recipes/wikipedia_en_100?

Jaifroid commented 3 years ago

@kelson42 Yes, that's the one!

kelson42 commented 3 years ago

I can not really judge why the COVID-19 is not in the top 100, but what is sure is the the TOP 100 does not count only popularity and that it is conceived to not be moving too much. So I don't wonder so much about that honestly. You can check the details about all inputs for all articles at https://download.openzim.org/wp1/enwiki_2021-07/all.tsv.zip

Do you want a TOP10? http://download.openzim.org/wp1/enwiki/tops/10.tsv

Jaifroid commented 3 years ago

But what's the data source? Is it dynamic? In the link I gave above, Wikipedia itself publishes a list of the top 100 articles over the full period 2007-2021, which is quite different to our top 100, and in which COVID_19 is number 58. I don't see how it could NOT be on any other top-100 list. if it's number 58 on the all-time list That's why I think maybe our data source has got stuck on an old or static list.

Top 10 would be great, and would reduce the size of the app download to make a really lightweight app like it used to be (when I distributed only Ray Charles). Top 20 or 25 (if it's possible) might be a slightly better compromise, but I'll settle for 10 if it's what's possible!

Jaifroid commented 3 years ago

PS I'm downloading the ZIP to look at the source!

Jaifroid commented 3 years ago

Enormous file! If we compare these entries:

Elizabeth_II    12153654    116302  19034   175 15703921
COVID-19    63030231    243465  14881   160 2336534

What determines that the first one is in the top 100, and the second one isn't? I can't see anything obvious in the category list either. Sorry for my ignorance of how this works. It's not something you should spend any time on, I'm just curious...

kelson42 commented 3 years ago

25% more links + 8x more visitors probably explain why Elizabeth_II is in the list and not COVID-19. The PERL script which builds scores is here https://github.com/openzim/wp1_selection_tools/blob/master/build_scores.pl

Jaifroid commented 3 years ago

Thinking about this, while I understand the intent, I believe there are quite a few distorsions produced by this algorithm. Some examples:

Of course these are all arguable,. But such selections suggest over-emphasis on backlinks. Maybe this algorithm needs tweaking to give a bit more weight to what people are interested in averaged out over a long time period? Obviously it's good that it's not completely filled up with "Donald Trump", "Joe Biden", "QAnon", or the top pop star of the month etc. But it seems pointless publishing a "Top 100" updated monthly if it is an almost completely static list where major world-shattering events can't get into it. I think the average user downloading a "Top 100" ZIM is going to be completely baffled by this selection. I wonder if @Popolechien has an opinion on this?

kelson42 commented 3 years ago

@Jaifroid It is difficult to have the perfect algorithm. At this stage, the best is you propose an alternative algorithm or simply a different tuning.

Jaifroid commented 3 years ago

@kelson42 OK, I'll experiment a bit with the weightings / tuning.

Popolechien commented 3 years ago

@Jaifroid same thinking here. Some sort of 2-year moving average (e.g. so that Trump slowly takes over Obama, and Biden over Trump) would make sense imho.