Closed lidel closed 3 years ago
@lidel Thank you for having explained and detailed which problem we face here. We will do our best to fix the 4 tickets you have reported before the end of next January.
I have created two tickets to get binaries for:
Motivation
Creating a new snapshot requires unpacking data from ZIM archive.
Legacy process relied on a customized
extract_zim
tool which unfortunately is no longer able to unpack latest snapshots (https://github.com/ipfs/distributed-wikipedia-mirror/issues/60#issuecomment-546905445).Good news: we now have upstream openzim/zim-tools which not only unpacks archives without a problem, but removes maintenance burden from the mirror project
Prerequisites
@kelson42 I took a look at output of
zimdump v1.0.5
and believe we could switch to this tool when below issues are addressed:[x] Filenames should match article URLs (https://github.com/openzim/zim-tools/issues/24) (Replace spaces with underscore and add
.html
suffix, so we can load then via HTTP gateway as-is) Example: https://en.wikipedia-on-ipfs.org/wiki/Vincent_van_Gogh.html[x] Unescape paths before creating assets (https://github.com/openzim/zim-tools/issues/68)
[x] Redirects via HTML file with
<meta http-equiv="refresh"
(https://github.com/openzim/zim-tools/issues/23#issuecomment-493767980)[x] Performance (https://github.com/openzim/zim-tools/issues/69)
Nice-to-haves
Not blockers, but things to consider in the future:
[ ] Prebuilt binaries for other platforms than 64bit Linux While folks on Windows and MacOS could run this in a VM, would be really nice if the pipeline for building a new snapshot worked on all three platforms.
[ ] Ability to skip processing of Xapian index (Not a hard blocker, but perhaps could speed up the build even further, as we don't use it atm?)