ipfs / distributed-wikipedia-mirror

Putting Wikipedia Snapshots on IPFS
https://github.com/ipfs/distributed-wikipedia-mirror#readme
623 stars 55 forks source link

Automate snapshot updates #58

Open lidel opened 5 years ago

lidel commented 5 years ago

This is a placeholder issue. Will be updated with more details when we gain better understanding of what is needed here.

In the long run, we want to introduce CI/CD automation that does something along these lines:

Then, maintainer would review PR and merge it. Updating manifest in master would trigger an update of DNSLink under <lang>.wikipedia-on-ipfs.org, propagating change to collaborative cluster etc.

kelson42 commented 5 years ago

@lidel For the updates, we start to advert and use our OPDS feed (which works like an atom feed). I would recommend to use that in the future. See https://wiki.kiwix.org/wiki/OPDS (still in beta).

lidel commented 5 years ago

@kelson42 thats sounds very useful! what would be a valid query to return the latest snapshot of english or turkish wiki?

Tried https://library.kiwix.org/catalog/search?lang=en&tag=wikipedia but it points at old snapshot: wikipedia_en_wp1-0.8_orig_2010-12.zim

kelson42 commented 5 years ago

@lidel This feed delivers the most recent ZIM files... but a few or them are simply not newly generated. Let me know if you find a recent file which is not in it.

lidel commented 5 years ago

@kelson42 I think things like https://github.com/kiwix/kiwix-tools/issues/231 and https://github.com/kiwix/kiwix-tools/issues/316 need to land before we can use OPDS feed.

Right now, I was unable to come up with filters to get the latest English wikipedia with pictures and without video (wikipedia_en_all_novid)

Looking at https://download.kiwix.org/zim/wikipedia/ directly sounds like more robust solution atm.

mkg20001 commented 5 years ago

Right now, I was unable to come up with filters to get the latest English wikipedia with pictures and without video (wikipedia_en_all_novid)

In my solution I'm using a dynamic parser, which should solve that

https://github.com/ipfs/distributed-wikipedia-mirror/pull/40/files#diff-31235a619c2d46324cca9e5429d49b3cR106-R132

kelson42 commented 5 years ago

@lidel Looks like you have pretty well identified what needs to be done. An alternative would be to rely on https://download.kiwix.org/library/library_zim.xml (is is not dynamic like the OPDS feed, but easier to parse than HTML)... and more robust.

alzinging commented 1 year ago

@kelson42 thats sounds very useful! what would be a valid query to return the latest snapshot of english or turkish wiki?

Tried https://library.kiwix.org/catalog/search?lang=en&tag=wikipedia but it points at old snapshot: wikipedia_en_wp1-0.8_orig_2010-12.zim

We need to be working of MWDumper.pl and the XML bz2 dataset from Wikipedia ... I will do an export to static HTML and collect the required code again, it's "known working".

I'd like to see more functionality here, we need "search and editing". Afaik there is not yet a good marriage of git or wiki and IPFS and it should be core to ... us.