openzim / zim-tools

Various ZIM command line tools
https://download.openzim.org/release/zim-tools/
GNU General Public License v3.0
127 stars 35 forks source link

Question: what is the proper way to merge two zims? #238

Closed tim-moody closed 3 years ago

tim-moody commented 3 years ago

Given zim1 and zim2, where zim2 contains articles not in zim1 and articles changed from z1, I ran (using version 2.1.0)

zimdiff zim1 zim2 zim3 which yielded

Assertion failed at ../../SOURCE/libzim_release/src/writer/cluster.cpp:266
 data.value.empty()[1] == false[0]
[0x41ef36]
[0x43d7a5]
[0x43d8a7]
[0x43df68]
[0x44060b]
[0x81f555]
[0x98f219]
terminate called after throwing an instance of 'std::runtime_error'
  what():
Assertion failed at ../../SOURCE/libzim_release/src/writer/cluster.cpp:266
 data.value.empty()[1] == false[0]
Aborted (core dumped)

I was expecting zim3 to be a file suitable for use with zimpatch.

data-man commented 3 years ago

@tim-moody

using version 2.1.0

Try git version.

tim-moody commented 3 years ago

using 2.2.0 I got remarkably similar results:

Assertion failed at ../../SOURCE/libzim_release/src/writer/cluster.cpp:266
 data.value.empty()[1] == false[0]
[0x41ef36]
[0x43d7a5]
[0x43d8a7]
[0x43df68]
[0x44060b]
[0x81f555]
[0x98f219]
terminate called after throwing an instance of 'std::runtime_error'
  what():
Assertion failed at ../../SOURCE/libzim_release/src/writer/cluster.cpp:266
 data.value.empty()[1] == false[0]
Aborted (core dumped)

built with

#!/bin/bash -x
# download zim-tools, libzim, compile tools, and place in $PATH

PREFIX=/opt/iiab
cd $PREFIX
if [ ! -d "$PREFIX/zim-tools" ];then
   git clone https://github.com/openzim/zim-tools
fi
if [ ! -d "$PREFIX/libzim" ];then
   git clone https://github.com/openzim/libzim
fi

apt install -y libzstd-dev
apt install -y libdocopt-dev
apt install -y libgumbo-dev
apt install -y libmagic-dev
apt install -y liblzma-dev
apt install -y libxapian-dev
apt install -y libicu-dev
apt install -y docopt-dev
apt install -y ninja
apt install -y meson
apt install -y cmake
apt install -y pkgconf

cd $PREFIX/libzim
meson . build
ninja -C build
if [ $? -ne 0 ];then
   echo Build of libzim failed. Quitting . . .
   exit 1
fi
ninja -C build install
ldconfig

cd $PREFIX/zim-tools
meson . build
ninja -C build
if [ $? -ne 0 ];then
   echo Build of zim-tools failed. Quitting . . .
   exit 1
fi

rsync -a $PREFIX/zim-tools/build/src/zim* /usr/local/sbin/
data-man commented 3 years ago

@tim-moody

Assertion failed at ../../SOURCE/libzim_release/src/writer/cluster.cpp:266

This assert isn't from libzim-git. It's 6.3.0 version.

tim-moody commented 3 years ago

I cloned git clone https://github.com/openzim/libzim. Is that what you meant by libzim-git?

It's true the Changelog has only 6.3.0. The libzim.so produced is /usr/local/lib/x86_64-linux-gnu/libzim.so.7.0.0.

Please tell me how to proceed.

veloman-yunkan commented 3 years ago

Why don't you just build using kiwix-build?

kiwix-build --target-platform native_static zim-tools

It will download and build all dependencies on its own.

data-man commented 3 years ago

@tim-moody

what you meant by libzim-git?

master branch

The libzim.so produced is /usr/local/lib/x86_64-linux-gnu/libzim.so.7.0.0.

It's correct. Are you sure you don't have libzim in /usr/lib?

tim-moody commented 3 years ago

Yes. I found /usr/lib/x86_64-linux-gnu/libzim.so.6 -> libzim.so.6.2.2 and removed it, though it is my impression that zimdiff links explicitly with libzim.so.7. I ran again and got

Assertion failed at ../../SOURCE/libzim_release/src/writer/cluster.cpp:266
 data.value.empty()[1] == false[0]
[0x41ef36]
[0x43d7a5]
[0x43d8a7]
[0x43df68]
[0x44060b]
[0x81f555]
[0x98f219]
terminate called after throwing an instance of 'std::runtime_error'
  what():
Assertion failed at ../../SOURCE/libzim_release/src/writer/cluster.cpp:266
 data.value.empty()[1] == false[0]
Aborted (core dumped)
data-man commented 3 years ago

Oh, again.

cluster.cpp:266

It's old source.

data-man commented 3 years ago

Are you sure you don't have zimdiff in /usr/bin?

tim-moody commented 3 years ago

That was it. It wasn't the .so but zimdiff itself. 'which' showed the new one, but executing ran the old one.

Now ran the new one and went to completion. thanks.

tim-moody commented 3 years ago

So back to my original question, is the process to run

zimdiff zim1 zim2 zim3

and then

zimpatch zim1 zim3 zim4

making zim4 = zim1 plus all changes and additions of zim2?

data-man commented 3 years ago

Yes. And check zimpatch in /usr/bin, please. :)

kelson42 commented 3 years ago

@tim-moody Why you believe that zimdiff/zimpatch are tools to merge ZIM files? This is false. For the rest your bug report seems legit. These tools almost not used, but I would be happy to consider fixint it if you share your ZIM files.

tim-moody commented 3 years ago

@kelson42 The zim files I wish to merge are those produced by

https://farm.openzim.org/recipes/wikipedia_en_medicine

and

https://farm.openzim.org/recipes/mdwiki

But there are others that a user might wish to merge such as

http://download.kiwix.org/zim/wikipedia/wikipedia_en_chemistry_maxi_2021-04.zim

and

http://download.kiwix.org/zim/wikipedia/wikipedia_en_physics_maxi_2021-03.zim

(or other combinations such as baseball, basketball, football)

Their usage texts led me to believe zimdiff/zimpatch might do this, which is why I asked the question. But all I really wanted to know was what is the proper way to merge two zims.

kelson42 commented 3 years ago

This tool does not exist and even if it would (relatively easy to build), the result woukd probably not be what you expect because none of the HTML articles wiuld be updated to benefit od fully of the merge.

tim-moody commented 3 years ago

What I would expect is that the result would be a 3rd zim that would contain all articles in one or both of the source zims and that for articles that are in both, the one from one of the zims, in my case mdwiki, would take precedence and the same article in the other would be ignored. This is exactly what mdwiki is supposed to do through mirroring, but using a merge would be a more efficient means of producing the same result.

mgautierfr commented 3 years ago

So back to my original question, is the process to run zimdiff zim1 zim2 zim3 and then zimpatch zim1 zim3 zim4 making zim4 = zim1 plus all changes and additions of zim2?

No, the process is :

zimdiff base_zim target_zim diff_zim (Generate diff_zim)

then zimpatch base_zim diff_zim final_zim (Generate final_zim) making final_zim == target_zim.

(This is a same semantic than diff/patch).

tim-moody commented 3 years ago

@mgautierfr thanks for clarifying. I think I now understand that zimdiff/patch works like the equivalent for files or repos such that if I have base I only need to obtain diff in order to generate target.

mgautierfr commented 3 years ago

Yes. But please be careful with those tools. They have been created a long time ago (for a gsoc I think) and they never have been tested correctly. And one think that we know for sure, the final_zim is not totally equal to target_zim. They should contain the same content but they are NOT binary equal.

data-man commented 3 years ago

This tool does not exist and even if it would (relatively easy to build)

New tool zimjoin?

mgautierfr commented 3 years ago

I would discuss the use case before creating another tool. It will avoid us to have unused tools as zimdiff/zimpatch.

Zim files are not so easy to merge and it make not much sense as zim files are by definition independent (no article link from a zim file to another).

tim-moody commented 3 years ago

The zim files I wish to merge are those produced by

https://farm.openzim.org/recipes/wikipedia_en_medicine

and

https://farm.openzim.org/recipes/mdwiki

I guess we'll wait to see if the problem can be solved at source (mdwiki.org) and if not then who is willing to do something about it.