Add popularity computation script

openzim / cms

ZIM file Publishing Platform

https://cms.openzim.org

GNU General Public License v3.0

4 stars 0 forks source link

Add popularity computation script #14

Closed anshulxyz closed 2 years ago

anshulxyz commented 2 years ago

This is regarding issue #11

For calculation of the score, I am using the formula

y = (x – min) / (max – min) * 100

Where:

x is the input rank
min is the minimum rank in the series
max is the maximum rank in the series
y is the resulting rescaled score

My resulting output file is like this

output.csv

rank,score,zim
1,100.00,wikipedia_en_all_maxi
2,95.83,wikipedia_en_all_novid
3,91.67,wikipedia_es_all_maxi
4,87.50,wikipedia_zh_all_maxi
5,83.33,wikipedia_en_all_nopic
6,79.17,wikipedia_es_all_nopic
7,75.00,wiktionary_es_all_novid
8,70.83,wikipedia_ar_all_novid
9,66.67,wikipedia_zh_all_novid
...

how to run

Install dependencies

pip install -r requirements.txt

# for testing
pip install -r requirements-dev.txt

Run the tests

pytest

Run the script

python src/script.py

To view output, open the output.csv file.

rgaudin commented 2 years ago

@kelson42 I'm gonna need some directions to review this. Let me know when you have a few minutes to discuss it.

anshulxyz commented 2 years ago

Hi @rgaudin , I have updated the script as the per feedback.

anshulxyz commented 2 years ago

I had missed the https://github.com/openzim/cms/pull/14#discussion_r723402447

So I updated the commit and (force) pushed

anshulxyz commented 2 years ago

how come the generated output only have 25 rows while the downloaded JSON appears to have 57 .zim.torrent entries in the first row

Because I am clubbing together entries like

wikipedia_en_all_maxi_2020-12
wikipedia_en_all_maxi_2020-06
wikipedia_en_all_maxi_2021-02

and getting the collective score for the wikipedia_en_all_maxi

rgaudin commented 2 years ago

how come the generated output only have 25 rows while the downloaded JSON appears to have 57 .zim.torrent entries in the first row Because I am clubbing together entries like and getting the collective score for the wikipedia_en_all_maxi

Yeah that makes sense ; so my question would thus be ; how come we have so few results ? The JSON from that request seems to only provide 100 results while we have more than a thousand ZIM files. Is it capped? We obviously need compute this for all ZIMs

anshulxyz commented 2 years ago

@rgaudin if I wanted to submit changes to the script, how should I go about it? Do you want me to submit a PR, or something else (like TBD)?

I want to fix the https://github.com/openzim/cms/pull/14#discussion_r724839900

rgaudin commented 2 years ago

@anshulxyz, if it's just about that one thing, we can live with it and when time comes to reuse this code, it can be fixed then. If you have additional contributions, open another PR.

Thanks again for your work.