Closed kelson42 closed 2 years ago
Hi, I will be needing these details in order to proceed.
@proteek-dev Thank you for commenting here. Before starting any code please consider explaning first here what you plan to do and how, so it gets validated first.
The dashboard provides all the necessary information for anonymous users. Unsure about the API. But, as soon as it will be necessary to code, we will provide a token if needed.
The popularity is based on the number of unique visitors for a ZIM file. The ZIM file with the most unique visitors is the most popular. The browsers, keywords, etc... don't play a role in this. The solution should retrieve for all ZIM files downloaded on https://download.kiwix.org/zim (and related torrent links) the number of unique visitors, sort them, and rank them linearly on a scale from 0 to 100 (on the output for example).
Then, considering that there is zero code base available for the CMS yet, it won't be possible to generate the library.xml
anyway right now. Writing library.xml
is one of the main goal of the solution but can not be achieved without having all the infrastructure in place.
I was able to find a way to get necessary information regarding ZIM file. As I had to manually download from the dashboard>download column and was able to code. the progress stats
import json
class UniqueVis:
def __init__(self, jsonfile_path):
self.jsonfile_path = jsonfile_path
def get_unique_visits(self):
zim_data = []
with open(self.jsonfile_path) as readjson:
data = json.load(readjson)
subtables = data[0]['subtable']
for subtable in subtables:
for k, v in subtable.items():
if k == "url" and v.endswith("zim.torrent"):
zim_data.append(subtable)
self.calculate_popularity(zim_data)
def calculate_popularity(self, zim_download_data):
max_visits = zim_download_data[0]['nb_uniq_visitors']
for i in range(len(zim_download_data)):
print("Url Downloaded: {}".format(zim_download_data[i]['url']))
print("No. of unique vistors: {}".format(zim_download_data[i]['nb_uniq_visitors']))
print("No. of hits: {}".format(zim_download_data[i]['nb_hits']))
percent = (zim_download_data[i]['nb_uniq_visitors']/max_visits)*100
print("Percentage: {}".format(float(percent)))
print("\n==================")
if name == "main": obj = UniqueVis("download_kiwix_org.json") obj.get_unique_visits()
@proteek-dev thx. @rgaudin can you please handle it from now?
Thank you @proteek-dev ; could you please share the URL of that JSON? Looks like you manually downloaded it for this test but we'd need to automate that of course.
Is there a reason for getting this data only for torrent files?
@proteek-dev, I checked with @kelson42 and what we'd like is for you to submit a pull request with a standalone python script (can have external dependencies of course) that produces an output similar to:
rank, score, zim
1, 100, wikipedia_en_all_nopic
2, 95.2, wikipedia_fr_all_maxi
[...]
1432, 1.4, ted_en_playlist-the-quest-to-end-poverty
The output can be printed or writen to a file, doesn't matter at tihis point.
In this output, there are 3 informations per line:
zim
: the stem of the name of zim files. All zim files we publish are suffixed with a YYYY-MM
period string. You must aggregate results for each version of the Zim.score
: this is a 0-100
score computed from the number of uniq visitors for this zim compared to the total number of uniq_visitors for all zims' downloads. This is not what you computed above.rank
is just the position of the zim in the list.Hope this clarifies the objective ; please let me know if you have any question.
I have a question about the score calculation.
In the examples given, the max value for the score of wikipedia_en_all_nopic
is 100
.
That value can only be the result of the calculation if the calculation is entry.visitors / max(entries.visitors)
.
If the calculation is entry.visitors / sum(entries.visitors)
, the maximum value cannot be 100 but less.
Or am I missing something?
From @kelson42's comment above:
The solution should retrieve for all ZIM files downloaded on download.kiwix.org the number of unique visitors, sort them, and rank them linearly on a scale from 0 to 100
My understanding of “linearly on a scale from 0 to 100” is that max(entry.visitors / sum(entries.visitors))
become 100, min(entry.visitors / sum(entries.visitors))
is 0 and anything in between is placed using this function but you might be right as well. @kelson42 can you confirm which one it is ?
Only the ranking matters, not the values, so I don't understand why a ratio should be computed. Just sort the entries and put them on a scale from 0 to 100.
In that case, can we get confirmation that the formula for the score to be 100 * (total - rank) / (total - 1)
? That is, rank 1 is 100, last rank is 0?
Yes, but the ranks should be linearly spread across the scale (0-100).
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
library.xml