Retrieve popularity script

kelson42 commented 3 years ago

Scrape information from https://stats.kiwix.org
Make popularity computation
Write popularity in the library.xml

proteek-dev commented 3 years ago

Hi, I will be needing these details in order to proceed.

In the dashboard there are multiple segments like Website, Browers, Keywords, Visit over time etc. for which segment do I need to make popularity computation.
As in accessing the KIWIX dashboard(https://stats.kiwix.org), it does require login credentials. Could I be provided with dummy login credentials for testing my code. as hitting the server through API, requires Basic AUTH.
To write the popularity, kindly let me know the location of library.xml

kelson42 commented 3 years ago

@proteek-dev Thank you for commenting here. Before starting any code please consider explaning first here what you plan to do and how, so it gets validated first.

The dashboard provides all the necessary information for anonymous users. Unsure about the API. But, as soon as it will be necessary to code, we will provide a token if needed.

The popularity is based on the number of unique visitors for a ZIM file. The ZIM file with the most unique visitors is the most popular. The browsers, keywords, etc... don't play a role in this. The solution should retrieve for all ZIM files downloaded on https://download.kiwix.org/zim (and related torrent links) the number of unique visitors, sort them, and rank them linearly on a scale from 0 to 100 (on the output for example).

Then, considering that there is zero code base available for the CMS yet, it won't be possible to generate the library.xml anyway right now. Writing library.xml is one of the main goal of the solution but can not be achieved without having all the infrastructure in place.

proteek-dev commented 3 years ago

I was able to find a way to get necessary information regarding ZIM file. As I had to manually download from the dashboard>download column and was able to code. the progress stats

From the downloaded json file, was able to parse and filter out "zim.torrent" downloads.
Written a code base to get fetch unique visitors for a particular ZIM file and sort them in linearly fashion (0 to 100).
Currently in progress on creating a XML file in order to populate (also to learn XML).
posting the code what has been achieved.

import json

class UniqueVis:

def __init__(self, jsonfile_path):
    self.jsonfile_path = jsonfile_path

def get_unique_visits(self):
    zim_data = []

    with open(self.jsonfile_path) as readjson:
        data = json.load(readjson)
        subtables = data[0]['subtable']
        for subtable in subtables:
            for k, v in subtable.items():
                if k == "url" and v.endswith("zim.torrent"):
                    zim_data.append(subtable)

    self.calculate_popularity(zim_data)

def calculate_popularity(self, zim_download_data):
    max_visits = zim_download_data[0]['nb_uniq_visitors']

    for i in range(len(zim_download_data)):
        print("Url Downloaded: {}".format(zim_download_data[i]['url']))
        print("No. of unique vistors: {}".format(zim_download_data[i]['nb_uniq_visitors']))
        print("No. of hits: {}".format(zim_download_data[i]['nb_hits']))
        percent = (zim_download_data[i]['nb_uniq_visitors']/max_visits)*100
        print("Percentage: {}".format(float(percent)))
        print("\n==================")

if name == "main": obj = UniqueVis("download_kiwix_org.json") obj.get_unique_visits()

kelson42 commented 3 years ago

@proteek-dev thx. @rgaudin can you please handle it from now?

rgaudin commented 3 years ago

Thank you @proteek-dev ; could you please share the URL of that JSON? Looks like you manually downloaded it for this test but we'd need to automate that of course.

Is there a reason for getting this data only for torrent files?

rgaudin commented 3 years ago

@proteek-dev, I checked with @kelson42 and what we'd like is for you to submit a pull request with a standalone python script (can have external dependencies of course) that produces an output similar to:

rank, score, zim
1, 100, wikipedia_en_all_nopic
2, 95.2, wikipedia_fr_all_maxi
[...]
1432, 1.4, ted_en_playlist-the-quest-to-end-poverty

The output can be printed or writen to a file, doesn't matter at tihis point.

In this output, there are 3 informations per line:

zim: the stem of the name of zim files. All zim files we publish are suffixed with a YYYY-MM period string. You must aggregate results for each version of the Zim.
score: this is a 0-100 score computed from the number of uniq visitors for this zim compared to the total number of uniq_visitors for all zims' downloads. This is not what you computed above.
rank is just the position of the zim in the list.

Hope this clarifies the objective ; please let me know if you have any question.

christianhujer commented 3 years ago

I have a question about the score calculation.

In the examples given, the max value for the score of wikipedia_en_all_nopic is 100. That value can only be the result of the calculation if the calculation is entry.visitors / max(entries.visitors).

If the calculation is entry.visitors / sum(entries.visitors), the maximum value cannot be 100 but less.

Or am I missing something?

rgaudin commented 3 years ago

From @kelson42's comment above:

The solution should retrieve for all ZIM files downloaded on download.kiwix.org the number of unique visitors, sort them, and rank them linearly on a scale from 0 to 100

My understanding of “linearly on a scale from 0 to 100” is that max(entry.visitors / sum(entries.visitors)) become 100, min(entry.visitors / sum(entries.visitors)) is 0 and anything in between is placed using this function but you might be right as well. @kelson42 can you confirm which one it is ?

kelson42 commented 3 years ago

Only the ranking matters, not the values, so I don't understand why a ratio should be computed. Just sort the entries and put them on a scale from 0 to 100.

christianhujer commented 3 years ago

In that case, can we get confirmation that the formula for the score to be 100 * (total - rank) / (total - 1)? That is, rank 1 is 100, last rank is 0?

rgaudin commented 3 years ago

Yes, but the ranks should be linearly spread across the scale (0-100).

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

openzim / cms

Retrieve popularity script #11