Terra X - Githubissues

Popolechien commented 4 years ago

Website URL: https://commons.wikimedia.org/wiki/Category:Videos_by_Terra_X
License: CC-by-sa
Desired ZIM Title: Terra X
Desired ZIM Description: Die wissenschaftliche Sendung des ZDF
Desired ZIM Icon –png : https://damisoft.ch/wp-content/uploads/2019/05/terra-x.png
Language (ISO 639-3): de
Desired Main Page (homepage, if different from website URL): n/a

Files are hosted in a Commons category: do we have a tool to scrape that?

RavanJAltaie commented 1 year ago

@kelson42 @rgaudin Do we have a tool to scrape that?

rgaudin commented 1 year ago

I'm not sure about mwoffliner behavior for commons
I think the category filter of mwoffliner if not working properly (https://github.com/openzim/mwoffliner/issues/829)

Since it's just a collection of files, a mini scraper retrieving the list of files and their metadata via the API and producing a nautilus collection JSON might be a good option

RavanJAltaie commented 1 year ago

I have a started a nautilus recipe https://farm.openzim.org/recipes/Terra_x_de Will update the status accordingly

kelson42 commented 1 year ago

My attempt would be:

build a list of articles using wp1.openzim.org (with petscam selection module)
use this list with mwoffliner

If it does not work, then this would be a bug

I don't see how nautilus coykd do the job without preparatory work (me aybe this has been done)

Recipe name does not respect the norm

Popolechien commented 1 year ago

@kelson42 Does WP1 work on Commons?

kelson42 commented 1 year ago

Hmmm, it should IMO. Sorry, on the road again, difficult for me to test right now.

rgaudin commented 1 year ago

I don't see how nautilus coykd do the job without preparatory work (me aybe this has been done)

It can't obviously. My comment mentioning nautilus clearly indicated it required a mini-scraper that would produce the nautilus-friendly data.

Many things wrong with this recipe. Archive config mentions URL to a ZIP archive containing all the documents. How can this commons link be considered a ZIP archive?

The recipe failed… on the favicon because its URL is incorrect.

rgaudin commented 1 year ago

Hmmm, it should IMO. Sorry, on the road again, difficult for me to test right now.

Commons is not in the list of Projects (neither simple, SPARQL or petscan)

benoit74 commented 1 week ago

I gave it a try with nautilus. Generation of JSON file for nautilus is mostly straightforward.

I moved the recipe to https://farm.openzim.org/recipes/commons.wikimedia.org_de_terra-x

Do we agree on this ZIM name?

Unfortunately, recipe failed: https://farm.openzim.org/pipeline/5f4f0d93-41b1-41d5-880b-485491186e56/debug.

Since file is hosted on upload.wikimedia.org, we must comply with their User-Agent policy at https://meta.wikimedia.org/wiki/User-Agent_policy

I'll open an upstream bug. Pretty easy to solve probably.

benoit74 commented 1 week ago

@kelson42 regarding mwoffliner usage, is mwoffliner capable to process an Article List which is just a list of files like https://commons.wikimedia.org/wiki/File:%22Ich_bin_ein_Berliner!%22_-_John_F._Kennedy_1963_in_Berlin.webm ?

This looks like a file and not an article, so I'm pretty sure mwoffliner is not capable to create a ZIM out of it, but I would like you confirm this. Because it could still be an alternative (instead of generating a JSON for nautilus, it is way simpler to generate a CSV for mwoffliner).

kelson42 commented 1 week ago

@kelson42 It will IMHO, but it won't give you the expected result IMHO (because it will render the wikitext only of the "File:" page).

benoit74 commented 1 week ago

For the record, nautilus files generation script is at https://gist.github.com/benoit74/8b8c684822527b135d64fc5b1c7b6668 (very quick-and-dirty and only creating a ZIM of first 10 videos for now ... to give it a try)

Popolechien commented 1 week ago

The problem I see with Nautilus is that we would need a script to generate a JSON every month or so (I don't know if the show is still running, or at which frequency). How perennial can we make it be?

benoit74 commented 1 week ago

Once coded, the script is probably going to work unmodified for months / years. Running the script and uploading JSON files takes less than 10 minutes. I don't think we need to do it more than once every quarter. If this prove to take too much time, we can easily invest into a more perennial solution (typically integrate the script in the scraper).

openzim / zim-requests

Terra X #272