Open Jaifroid opened 3 years ago
@Jaifroid I guess you speak about this recipe https://farm.openzim.org/recipes/wikipedia_en_100?
@kelson42 Yes, that's the one!
I can not really judge why the COVID-19 is not in the top 100, but what is sure is the the TOP 100 does not count only popularity and that it is conceived to not be moving too much. So I don't wonder so much about that honestly. You can check the details about all inputs for all articles at https://download.openzim.org/wp1/enwiki_2021-07/all.tsv.zip
Do you want a TOP10? http://download.openzim.org/wp1/enwiki/tops/10.tsv
But what's the data source? Is it dynamic? In the link I gave above, Wikipedia itself publishes a list of the top 100 articles over the full period 2007-2021, which is quite different to our top 100, and in which COVID_19 is number 58. I don't see how it could NOT be on any other top-100 list. if it's number 58 on the all-time list That's why I think maybe our data source has got stuck on an old or static list.
Top 10 would be great, and would reduce the size of the app download to make a really lightweight app like it used to be (when I distributed only Ray Charles). Top 20 or 25 (if it's possible) might be a slightly better compromise, but I'll settle for 10 if it's what's possible!
PS I'm downloading the ZIP to look at the source!
Enormous file! If we compare these entries:
Elizabeth_II 12153654 116302 19034 175 15703921
COVID-19 63030231 243465 14881 160 2336534
What determines that the first one is in the top 100, and the second one isn't? I can't see anything obvious in the category list either. Sorry for my ignorance of how this works. It's not something you should spend any time on, I'm just curious...
25% more links + 8x more visitors probably explain why Elizabeth_II is in the list and not COVID-19. The PERL script which builds scores is here https://github.com/openzim/wp1_selection_tools/blob/master/build_scores.pl
Thinking about this, while I understand the intent, I believe there are quite a few distorsions produced by this algorithm. Some examples:
Sigmund Freud
(not in our top 100) who has 15.5 million page views since 2015 (source: https://pageviews.toolforge.org/langviews/ - select All Time, page views for English Wikipedia).Hoover Dam
, which seems quite arbitrary to me (6m page views since 2015) compared to Chernobyl Disaster
(not in our top 100) which has 49m page views since 2015.Cougar
: I bet this is accidentally in our top 100 because of porn-based searches rather than because it is such an important animal.Gastropoda
, Mollusca
, Amphibian
, Bivalva
), but Cat
(20m views) and Dog
(18m views) are not in it.Of course these are all arguable,. But such selections suggest over-emphasis on backlinks. Maybe this algorithm needs tweaking to give a bit more weight to what people are interested in averaged out over a long time period? Obviously it's good that it's not completely filled up with "Donald Trump", "Joe Biden", "QAnon", or the top pop star of the month etc. But it seems pointless publishing a "Top 100" updated monthly if it is an almost completely static list where major world-shattering events can't get into it. I think the average user downloading a "Top 100" ZIM is going to be completely baffled by this selection. I wonder if @Popolechien has an opinion on this?
@Jaifroid It is difficult to have the perfect algorithm. At this stage, the best is you propose an alternative algorithm or simply a different tuning.
@kelson42 OK, I'll experiment a bit with the weightings / tuning.
@Jaifroid same thinking here. Some sort of 2-year moving average (e.g. so that Trump slowly takes over Obama, and Biden over Trump) would make sense imho.
I'm not sure if the top 100 ZIMs are supposed to have a dynamic data source, whether it is for the past year, all-time, etc. Whatever it is, it seems out of date. The most startling example is that COVID-19 still hasn't made its way into the Top 100, whereas according to Wikipedia's own all-time list of top 100, it's been in the top 100 since June 2020. See:
https://en.wikipedia.org/wiki/Wikipedia:Multiyear_ranking_of_most_viewed_pages#Top-100_list
And slightly off-topic: I wonder if there is a way to reduce the size of the Top-100 ZIM. It is about 10-times larger than Ray Charles, which has the same number of pages. I think it's mostly due to images. Would it be possible to do a mini version with pictures, i.e. with the lead picture(s)? I use this as a sample ZIM to distribute with the vanilla app -- I think a sample ZIM should have pictures -- but 28MB is too big and bloats the app (which on its own would be only about 4MB, the UWP version). Alternatively a Top-25 ZIM? See https://en.wikipedia.org/wiki/Wikipedia:Top_25_Report/May_16_to_22,_2021 .