openzim / cms

ZIM file Publishing Platform
https://cms.openzim.org
GNU General Public License v3.0
4 stars 0 forks source link

Add popularity computation script #14

Closed anshulxyz closed 2 years ago

anshulxyz commented 2 years ago

This is regarding issue #11

For calculation of the score, I am using the formula

y = (x – min) / (max – min) * 100

Where:

My resulting output file is like this

output.csv

rank,score,zim
1,100.00,wikipedia_en_all_maxi
2,95.83,wikipedia_en_all_novid
3,91.67,wikipedia_es_all_maxi
4,87.50,wikipedia_zh_all_maxi
5,83.33,wikipedia_en_all_nopic
6,79.17,wikipedia_es_all_nopic
7,75.00,wiktionary_es_all_novid
8,70.83,wikipedia_ar_all_novid
9,66.67,wikipedia_zh_all_novid
...

how to run

  1. Install dependencies
pip install -r requirements.txt

# for testing
pip install -r requirements-dev.txt
  1. Run the tests
pytest
  1. Run the script
python src/script.py
  1. To view output, open the output.csv file.
rgaudin commented 2 years ago

@kelson42 I'm gonna need some directions to review this. Let me know when you have a few minutes to discuss it.

anshulxyz commented 2 years ago

Hi @rgaudin , I have updated the script as the per feedback.

anshulxyz commented 2 years ago

I had missed the https://github.com/openzim/cms/pull/14#discussion_r723402447

So I updated the commit and (force) pushed

anshulxyz commented 2 years ago

how come the generated output only have 25 rows while the downloaded JSON appears to have 57 .zim.torrent entries in the first row

Because I am clubbing together entries like

wikipedia_en_all_maxi_2020-12
wikipedia_en_all_maxi_2020-06
wikipedia_en_all_maxi_2021-02

and getting the collective score for the wikipedia_en_all_maxi

rgaudin commented 2 years ago

how come the generated output only have 25 rows while the downloaded JSON appears to have 57 .zim.torrent entries in the first row Because I am clubbing together entries like and getting the collective score for the wikipedia_en_all_maxi

Yeah that makes sense ; so my question would thus be ; how come we have so few results ? The JSON from that request seems to only provide 100 results while we have more than a thousand ZIM files. Is it capped? We obviously need compute this for all ZIMs

anshulxyz commented 2 years ago

@rgaudin if I wanted to submit changes to the script, how should I go about it? Do you want me to submit a PR, or something else (like TBD)?

I want to fix the https://github.com/openzim/cms/pull/14#discussion_r724839900

rgaudin commented 2 years ago

@anshulxyz, if it's just about that one thing, we can live with it and when time comes to reuse this code, it can be fixed then. If you have additional contributions, open another PR.

Thanks again for your work.