openzim / zimfarm

Farm operated by bots to grow and harvest new zim files
https://farm.openzim.org
GNU General Public License v3.0
81 stars 25 forks source link

Track popular requests #912

Open Popolechien opened 7 months ago

Popolechien commented 7 months ago

This was a soft requirement from the sponsor of the zimit 2.0 update: but we should be able to identify recurring requests over a given period of time.

rgaudin commented 7 months ago

Difficulty here is to define what is a recurring request. Is domain name enough? We see many requests for subreddits in youzim.it for instance

Popolechien commented 7 months ago

Top-level URL should be enough (so https://www.reddit.com/ rather than reddit.com/r/fubar in your example). There are also quite a few wikipedia requests, surprisingly enough.

benoit74 commented 7 months ago

Is it ok if we report only www.reddit.com in your example (instead of https://www.reddit.com?). It would be very easy to report this manually once in a while with a small SQL query based on recipe names.

benoit74 commented 7 months ago

And do you need to have this in a UI, or is it something which could be done manually by a developer.

I imagine we could have an already prepared SQL query which would build the list of domain names over a given period, and it would "consume" only few minutes for a dev to run it upon an inquiry from someone.

kelson42 commented 7 months ago

I would propose to implement an export of tasks in CSV and then this kind of sorting/filtering can be done in a third party software.

benoit74 commented 7 months ago

Should we move this issue to zimfarm then? Or does it needs to be available in zimit UI? (I don't get who will use this feature)

Popolechien commented 7 months ago

@benoit74 Yes probably a good idea to move it to zimfarm, and yes for reddit.com instead of whatever longer and equally informative alternative.

benoit74 commented 1 week ago

What has been discussed is a monthly CSV export of tasks from last month ; probably running in Github CI (of kiwix/operations ?), based on a (Python) script and publishing Github CI artifacts.

Popolechien commented 1 week ago

LGTM. What are Github artifacts?

rgaudin commented 1 week ago

In CI/CD runs, you can upload files directly on Github, it auto expires. See the bottom of this one for an example.

benoit74 commented 5 days ago

I consider that artifacts are in fact going to be painful, it is hard to see all artifacts from a given CI, so it is not like we can send it easily to our client.

Shall we simply publish it to https://download.openzim.org/ ? And run it in our k8s cluster ? (it is a monthly job, and it is not like we expect it to take lot of resources)

benoit74 commented 1 day ago

@kelson42 @Popolechien I now need your input on this issue:

We should probably discuss it live for simplicity.

kelson42 commented 1 day ago
  • where do we publish these stats files from farm.zimit.kiwix.org? (Github artifacts is not OK for end-users from my PoV)

I would put them in the drive

  • should we also publish the ones from farm.openzim.org?

Not yet

  • how should we name each file (farm.zimit.kiwix.org_2024-08.csv is OK?)

farm.zimit.kiwix.org_tasks_2024-08.csv