Open Popolechien opened 4 years ago
@kelson42 @rgaudin Do we have a tool to scrape that?
Since it's just a collection of files, a mini scraper retrieving the list of files and their metadata via the API and producing a nautilus collection JSON might be a good option
I have a started a nautilus recipe https://farm.openzim.org/recipes/Terra_x_de Will update the status accordingly
My attempt would be:
If it does not work, then this would be a bug
I don't see how nautilus coykd do the job without preparatory work (me aybe this has been done)
Recipe name does not respect the norm
@kelson42 Does WP1 work on Commons?
Hmmm, it should IMO. Sorry, on the road again, difficult for me to test right now.
I don't see how nautilus coykd do the job without preparatory work (me aybe this has been done)
It can't obviously. My comment mentioning nautilus clearly indicated it required a mini-scraper that would produce the nautilus-friendly data.
Many things wrong with this recipe. Archive
config mentions URL to a ZIP archive containing all the documents
. How can this commons link be considered a ZIP archive?
The recipe failed… on the favicon because its URL is incorrect.
Hmmm, it should IMO. Sorry, on the road again, difficult for me to test right now.
Commons is not in the list of Projects (neither simple, SPARQL or petscan)
I gave it a try with nautilus. Generation of JSON file for nautilus is mostly straightforward.
I moved the recipe to https://farm.openzim.org/recipes/commons.wikimedia.org_de_terra-x
Do we agree on this ZIM name?
Unfortunately, recipe failed: https://farm.openzim.org/pipeline/5f4f0d93-41b1-41d5-880b-485491186e56/debug.
Since file is hosted on upload.wikimedia.org, we must comply with their User-Agent policy at https://meta.wikimedia.org/wiki/User-Agent_policy
I'll open an upstream bug. Pretty easy to solve probably.
@kelson42 regarding mwoffliner usage, is mwoffliner capable to process an Article List which is just a list of files like https://commons.wikimedia.org/wiki/File:%22Ich_bin_ein_Berliner!%22_-_John_F._Kennedy_1963_in_Berlin.webm ?
This looks like a file and not an article, so I'm pretty sure mwoffliner is not capable to create a ZIM out of it, but I would like you confirm this. Because it could still be an alternative (instead of generating a JSON for nautilus, it is way simpler to generate a CSV for mwoffliner).
@kelson42 It will IMHO, but it won't give you the expected result IMHO (because it will render the wikitext only of the "File:" page).
For the record, nautilus files generation script is at https://gist.github.com/benoit74/8b8c684822527b135d64fc5b1c7b6668 (very quick-and-dirty and only creating a ZIM of first 10 videos for now ... to give it a try)
The problem I see with Nautilus is that we would need a script to generate a JSON every month or so (I don't know if the show is still running, or at which frequency). How perennial can we make it be?
Once coded, the script is probably going to work unmodified for months / years. Running the script and uploading JSON files takes less than 10 minutes. I don't think we need to do it more than once every quarter. If this prove to take too much time, we can easily invest into a more perennial solution (typically integrate the script in the scraper).
Files are hosted in a Commons category: do we have a tool to scrape that?