Open Popolechien opened 7 months ago
Difficulty here is to define what is a recurring request. Is domain name enough? We see many requests for subreddits in youzim.it for instance
Top-level URL should be enough (so https://www.reddit.com/ rather than reddit.com/r/fubar in your example). There are also quite a few wikipedia requests, surprisingly enough.
Is it ok if we report only www.reddit.com
in your example (instead of https://www.reddit.com
?). It would be very easy to report this manually once in a while with a small SQL query based on recipe names.
And do you need to have this in a UI, or is it something which could be done manually by a developer.
I imagine we could have an already prepared SQL query which would build the list of domain names over a given period, and it would "consume" only few minutes for a dev to run it upon an inquiry from someone.
I would propose to implement an export of tasks in CSV and then this kind of sorting/filtering can be done in a third party software.
Should we move this issue to zimfarm then? Or does it needs to be available in zimit UI? (I don't get who will use this feature)
@benoit74 Yes probably a good idea to move it to zimfarm, and yes for reddit.com instead of whatever longer and equally informative alternative.
What has been discussed is a monthly CSV export of tasks from last month ; probably running in Github CI (of kiwix/operations ?), based on a (Python) script and publishing Github CI artifacts.
LGTM. What are Github artifacts?
In CI/CD runs, you can upload files directly on Github, it auto expires. See the bottom of this one for an example.
I consider that artifacts are in fact going to be painful, it is hard to see all artifacts from a given CI, so it is not like we can send it easily to our client.
Shall we simply publish it to https://download.openzim.org/ ? And run it in our k8s cluster ? (it is a monthly job, and it is not like we expect it to take lot of resources)
@kelson42 @Popolechien I now need your input on this issue:
farm.zimit.kiwix.org_2024-08.csv
is OK?)We should probably discuss it live for simplicity.
- where do we publish these stats files from farm.zimit.kiwix.org? (Github artifacts is not OK for end-users from my PoV)
I would put them in the drive
- should we also publish the ones from farm.openzim.org?
Not yet
- how should we name each file (
farm.zimit.kiwix.org_2024-08.csv
is OK?)
farm.zimit.kiwix.org_tasks_2024-08.csv
This was a soft requirement from the sponsor of the zimit 2.0 update: but we should be able to identify recurring requests over a given period of time.