kiwix / kiwix-tools

Command line Kiwix tools: kiwix-serve, kiwix-manage, ...
https://download.kiwix.org/release/kiwix-tools/
GNU General Public License v3.0
428 stars 83 forks source link

docker kiwix-serve crashing #579

Closed monstermaker closed 1 year ago

monstermaker commented 1 year ago

I have been running the stackoverflow zim file and found that the server crashes intermittently. Initially, it appeared to crash whenever you used the search with keywords "add" and "delete" in it, but then it stopped crashing on these and sometimes just crashes. I cannot see the exit message as it appears to list the contents of the /data folder on crashing which appears as a very long list in my console window and I cannot scroll back far enough in the window to see any messages before the list of /data/.npm and /data/.cache so cannot see what may be causing it.

We are running on Ubuntu 18.04 using the latest docker image.

kelson42 commented 1 year ago

Sounds really similar to #573. It seems there some kind of instability, but so far pretty unclear where.

monstermaker commented 1 year ago

After running docker logs on the instance with the less switch I have found that there is no error output. I get just the standard output telling me the IP and port etc. and that it is running and then immediately after that the debug output of "the content of /data is" followed by hundreds of lines of the content as described above. Additionally using docker stats on the instance shows extensive use of resources when using the search tool with long delays before the autosuggest comes up and sometimes CPU use of over 100%. after a large spike in use, it crashes. I believe that perhaps it is the larger packages that are causing the issue. To try this out I have now loaded in only the Wiki100 zim file which is one of the smallest and has 14 users continually searching to see if it crashes. I will run this for a few hours and post the results.

monstermaker commented 1 year ago

OK. so ran this for 4 hours with up to 14 users trying it, sometimes together sometimes separately and with no issues. the server stayed up and monitored the docker container, the CPU and memory usage were quite low, although higher in the stats than I would expect (docker is still fairly new to me). Everything appears stable on a small zim file. I tried on the stackoverflow file again and sure enough, it still crashed. I tried monitoring the container in docker stats and it appears to crash when it shows a regular CPU usage of over 100%. I am not sure how you can use over 100% CPU, but that is what is said. Now as I understand it the CPU usage stated should be how much host CPU the container uses, but was confused about it being over 100%. I have monitored the host unit and the CPU usage has not gone over 11% at its peak, so this I am very confused about. I think it is crashing due to overusing its resources but it is not crashing the host machine or even using excess resources there. another note is that the host machine is in fact a VM running on VMWare. It does run other systems such as GitLab and Moodle at the same time.

I hope this may be giving some clues as to the issue. I never managed to get kiwix-serve working outside a docker container, so cannot see how it performs there.

kelson42 commented 1 year ago

@veloman-yunkan We should really try to reproduce the error with the SO specific file and get the core dump. From there it should be easier to diagnose the problem... hopefuly.

veloman-yunkan commented 1 year ago

@kelson42 I will try to reproduce the crash

veloman-yunkan commented 1 year ago

Crash was reproduced using the latest (3.3.0-1) docker image of kiwix-serve and http://download.kiwix.org/zim/stack_exchange/stackoverflow.com_en_all_2022-05.zim. Will debug it.

mgautierfr commented 1 year ago

I am not sure how you can use over 100% CPU, but that is what is said

CPU percentage is relative to one core. If you have 4 core, the maximum CPU usage is 400%. So a percentage above 100% is just that we use more than one core (and with multithreading, it is easy)

monstermaker commented 1 year ago

thanks @mgautierfr that makes sense.

veloman-yunkan commented 1 year ago

After kiwix-serve is started, it can take as little as only two properly timed requests to the /suggest endpoint to result in a segmentation fault.

veloman-yunkan commented 1 year ago

On my machine, with hot filesystem the following script crashes kiwix-serve with quite high probability:

#!/usr/bin/env bash

./kiwix-serve --verbose -p 8080 stackoverflow.com_en_all_2022-05.zim &
sleep 1
(
  curl 'http://localhost:8080/suggest?content=stackoverflow.com_en_all_2022-05&userlang=en&term=c' &
  sleep 0.2; curl 'http://localhost:8080/suggest?content=stackoverflow.com_en_all_2022-05&userlang=en&term=co' &
)
wait
veloman-yunkan commented 1 year ago

Such a crash scenario should greatly facilitate debugging.

veloman-yunkan commented 1 year ago

These crashes caused by concurrent suggestion requests on the same book are most likely due to the combination of:

  1. the classzim::SuggestionSearcher and friends (zim::SuggestionSearch, zim::SuggestionDataBase, etc) not being thread safe, and
  2. caching of the searcher objects introduced by kiwix/libkiwix#620

~Therefore concurrent /search requests on the same book should be subject to a similar bug~. /suggest is simply more vulnerable to it because of the usage pattern. While /search requests are not temporally clustered in a particular way, /suggest requests (for the same book) tend to follow in bursts as the user types in the search box. On large ZIM files, where fulfilling a /suggest request may takes quite long, two or more sequential requests may be performed concurrently using the same zim::SuggestionSearcher object, violating Xapian's requirements on concurrent access:

If you really want to access the same Xapian object from multiple threads, then you need to ensure that it won’t ever be accessed concurrently (if you don’t ensure this bad things are likely to happen - for example crashes or even data corruption). One way to prevent concurrent access is to require that a thread gets an exclusive lock on a mutex while the access is made.

kelson42 commented 1 year ago

Top prio to fix obviously and no nrw release of libzim/libkiwix should be done before fixing. I would appreciate if such scenarios are intoduce in automated testa too.

mgautierfr commented 1 year ago

The search should be protected again race condition : https://github.com/kiwix/libkiwix/blob/master/include/library.h#L145-L158 and https://github.com/kiwix/libkiwix/blob/master/src/server/internalServer.cpp#L781

I don't know why this is not the same case for suggestion.

veloman-yunkan commented 1 year ago

The search should be protected again race condition : https://github.com/kiwix/libkiwix/blob/master/include/library.h#L145-L158 and https://github.com/kiwix/libkiwix/blob/master/src/server/internalServer.cpp#L781

I don't know why this is not the same case for suggestion.

Protection against race conditions in /search endpoint was introduced later on in kiwix/libkiwix#729 in the context of implementing a significant enhancement to search. The fact that a similar bug existed in a similar piece of code went unnoticed.

mgautierfr commented 1 year ago

Yes, but I have seen the issue for search (and so implement the protection) but I totally missed the case for suggestion.

monstermaker commented 1 year ago

Will the docker image be updated with this fix?

kelson42 commented 1 year ago

@monstermaker Yes, once the release will be done... soon but no clear date for the moemnt.