Closed kelson42 closed 4 years ago
@holta The error message "system overload error" seems not to be generated by kiwix-serve after a first investigation.
@holta The error message "system overload error" seems not to be generated by kiwix-serve after a first investigation.
Correct. The exact wording of the error message is something I'll try to obtain from one of the teachers in coming days. (Hopefully we can also get a photographed screenshot sent to us from the field, using WhatsApp?)
@holta Yes, screenshot would be even better.
@holta You can also start kiwix-serve with more threads, see --threads
option. Would be good to know if that helps.
@kelson42 here are the flags used by IIAB:
/opt/iiab/kiwix/bin/kiwix-serve --daemon --port 3000 --nolibrarybutton --library /library/zims/library.xml --urlRootLocation=/kiwix/
Perhaps we should also try...
-t, --threads number of threads to run in parallel (default: 4)
-v, --verbose print debug log to STDOUT
PS i expect screenshot(s) from Mexican teachers by the end of the week hopefully.
@holta Yes, I would definitly try to use --thread > 4
@kelson42 here is the error/warning that most every teacher/student is repeatedly running into: (in Oaxaca State, Mexico)
box.lan/kiwix/search?content=wikipedia_es_all_maxi_2019-09?pattern=AMLO
Service Unavailable
The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.
This results from their using the search box (textfield) at the top right of the "maxi" Spanish Wikipedia page:
@holta Thank you for completing your ticket. The error message displayed in the picture is generated by the Apache reverse proxy, not Kiwix serve. That does not mean of course that at the root of the problem there is not a problem with Kiwix serve, which is behind. But, it masks somehow what happens at Kiwix serve level.
For the moment, my best theory is that considering that you start kiwix-serve
only with 4 threads, they are busy and then kiwix-serve does not deliver anything. This is the normal behaviour. If you want to deliver at the same time n
searches, you probably need at least
n` threads running.
Please just try to increase the number of threads and let me know if it works better.
1) @kelson42 we have not itentified the pattern, despite trying, likely because we're no longer in Mexico and do not have a classroom full of 15 client PC's to reproduce this.
i.e. in isolated testing with a single client PC, we've so far been unable to see much any difference between --threads 1
and --threads 8
— despite trying, using many browser tabs (we were hoping each tab would simulate a student/client PC...but so far we're unable to reproduce the "capacity problems" error page that Mexico schools are facing nearly constantly).
2) Here is another error, also from one of these Mexico schools. This 502 error is happening much less often than the 503 error mentioned above... (but with the very same Kiwix search/URL: is this related?)
The last error looks pretty similar ro the first one: kiwix serve does not answer.
--threads 8
would (appear) not to help: https://github.com/iiab/iiab/pull/2063#issuecomment-566698668
So far it would (appear) these serious problems {502, 503, 504 errors} arise regardless whether Apache or NGINX is proxying.
@holta Simplify the problem: try directly to deal with kiwix-serve
.
I wrote a puppeteer script to test the kiwix load on an rpi 4 with 2 G RAM. It issues a set of 10 predetermined search urls and 10 requests to random and does it 6 times for 60 search and 60 random requests, all issued asynchronously within a couple of seconds over a wired connection.
With kiwix proxied through nginx I start to get gateway timeouts about half way through the run. If I change the requests to go directly to port 3000 I get no 504 errors.
The 4 kiwix threads are pinned at nearly 100%, but tail off before all the results have been sent, even directly from port 3000. When proxied, it took about 40 seconds before any results appeared and the whole run took 2 minutes and 10 seconds. Each thread took 8 to 10% of memory, so I think not memory bound.
I tried 8 threads and they ran at about 50% cpu, so I think no benefit.
@tim-moody Could you please share your exact testing procedure in an attempt for me to reproduce it?
Here is the nodejs code:
// setup as non-root
// mkdir iiab-tester
// cd iiab-tester/
// npm init -y
// npm install puppeteer
const puppeteer = require('puppeteer');
//host = 'http://192.168.3.96';
host = 'http://192.168.3.96:3000';
randomUrl = host + '/kiwix/random?content=wikipedia_es_medicine_maxi_2019-11';
testUrls = [
'/kiwix/search?content=wikipedia_es_all_maxi_2019-09&pattern=mexico+',
'/kiwix/search?content=wikipedia_es_all_maxi_2019-09&pattern=futbol+',
'/kiwix/search?content=wikipedia_es_all_maxi_2019-09&pattern=amlo+',
'/kiwix/search?content=wikipedia_es_all_maxi_2019-09&pattern=gobierno+',
'/kiwix/search?content=wikipedia_es_all_maxi_2019-09&pattern=pais+',
'/kiwix/search?content=wikipedia_es_all_maxi_2019-09&pattern=nuevo+',
'/kiwix/search?content=wikipedia_es_all_maxi_2019-09&pattern=juego+',
'/kiwix/search?content=wikipedia_es_all_maxi_2019-09&pattern=oaxaca+',
'/kiwix/search?content=wikipedia_es_all_maxi_2019-09&pattern=gastronomia+',
'/kiwix/search?content=wikipedia_es_all_maxi_2019-09&pattern=pueblo+'
]
process.setMaxListeners(Infinity); // avoid max listeners warning
function main() {
loadSet();
loadSet();
loadSet();
loadSet();
loadSet();
loadSet();
//loadSet();
//loadSet();
}
function loadSet() {
for (let i = 0; i < testUrls.length; i++) {
console.log(i);
loadPage(host + testUrls[i]);
loadPage(randomUrl);
}
}
async function loadPage(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
try {
await page.goto(url, {waitUntil: 'load', timeout: 0});
const title = await page.title();
const html = await page.content();
const pageBody = await page.evaluate(() => document.body.innerHTML);
//const h1 = await page.evaluate(() => document.body.h1.innerHTML);
console.log(title);
//console.log(pageBody); // outputs a lot of text
}
catch (err) {
console.log(err);
}
finally {
await browser.close();
}
}
main();
kiwix is running as a service:
ExecStart=/opt/iiab/kiwix/bin/kiwix-serve --daemon --port 3000 --nolibrarybutton --library /library/zims/library.xml --urlRootLocation=/kiwix/
After investigating, there is several things:
Puppeteer is a full browser. When it does a request to a search page, it also interpret the content and do sub-request if needed (image, css, js). Handling this and doing the layout takes time. When measuring all the time need by all request to render one page, kiwix-serve takes around 1.3s to handle all requests(9). But when measuring puppeteer request, it took more than 2s per page. This overhead is not on kiwix-serve and we cannot do a lot about this.
All sub-requests are dues to the search bar. If we launch kiwix-serve with --nosearchbar
. The search result is a plain html page without any sub-elements to download. Then the full puppeteer request (download+rendering+...) is reduced by 1.5s. In real life, this is probably reduced by the browser cache system (and a apache proxy if possible). With the option --nosearchbar
the difference between what we measure on kiwix-serve side and puppeteer is under 300ms.
Then came the interesting part.
It seems that most of the time is spend not doing the search, but get the article content to generate the snippet used in the result html page. If we patch the libzim to give a dummy text instead of get the content and generate the snippet, we earn a lot. We go from around 1s for a request to (100ms-300ms). So we can reduce the time to generate the page by storing a generic (not specific to the search) snippet in the xapian database, but the database will take more space.
Sound good ? Wait, there is more.
Trying to not use puppeteer
to do the request, I rewrite the test script using request
module instead. And then... all requests times are divided by at least 2 (and sometime by 4). And the problem here is that it is the time measure on kiwix-serve
side, not script side. So yes, the module used to do the request changes the time needed by kiwix-serve to handle the request.
Right now, the only explanation I see, is that it may have some delay in the request of puppeteer
and kiwix-serve
spends more time waiting (blocking) for the data. But it definitively need more investigation.
Here my test script if you want to test on your side :
const request = require('request');
host = 'http://localhost:8080';
baseUrl = host + '/kiwix/search?content=wikipedia_es_all_mini_2019-09&pattern=';
testUrls = [
'mexico+',
'futbol+',
'amlo+',
'gobierno+',
'pais+',
'nuevo+',
'juego+',
'oaxaca+',
'gastronomia+',
'pueblo+'
]
process.setMaxListeners(Infinity); // avoid max listeners warning
function main() {
for (let i=0; i<6; i++) {
loadSet(i);
}
}
function loadSet(index) {
for (let i = 0; i < testUrls.length; i++) {
loadPage(testUrls[i], index);
}
}
function loadPage(url, index) {
const options = {
url: baseUrl+url,
// Some custom headers to track the requests on kiwix-serve side.
// headers : {
// 'x-debuginfo': url+index
// }
};
console.time(url+index);
request(options, (err, res, body) => {
if (err)
console.log(err);
else
console.timeEnd(url+index);
});
}
main();
@tim-moody can you give me your setup ? (raspberry version, kiwix-serve version, file on the sdcard or external drive ?)
If it helps, in Mexican classrooms in early Dec 2019 where these issues were uncovered:
CLARIF:
I wrote a puppeteer script to test the kiwix load on an rpi 4 with 2 G RAM.
storage is an sdcard. I think kiwix was as @holta says 3.0.1-8
I started with request, I used puppeteer to simulate a real request with subrequests.
So, I've played a bit with kiwix-serve on a RPi3. There is few things that impact the kiwix-serve search performance.
First requests on kiwix-serve (RPi just started) take indeed around 40s to finish. But request's times slowly decrease to around 10s (in less that 10 requests). After that, all requests take around 10s to finish. If you clean the cache (echo 3 > /proc/sys/vm/drop_caches
as root), the firsts requests take again 40s to finish.
There is nothing we can really do here (except not shutdown the RPi). And this cache is usefull in test context (were we have only one zim file), but it real case, the effect is probably reduced.
To generate the snippet we have to read the articles content (so uncompress cluster, most of the time is spend here), parse it to remove the html and give the raw text to xapian to generate a text corresponding to the request. If we don't include snippet in the result page, request's times drop to less than a second.
I've tried to increase the internal libzim cache. It has some interesting effect but only for large cache (128 instead of 64 cached cluster) and with very few different urls (around 3s per request instead of 10s). But with a lot of different requests, a large cache increase the request time. It may have a bug here, but increasing the internal cache is not a solution for now.
The number of requests has few impact of the performance in itself. Requests are simply queued and kiwix-serve has to finish previous ones before. It is just that people now see that the search request takes time.
A easy workaround to implement would be to add a option to kiwix-serve to generate search page without snippet. The search result would have only the title of the articles but it would greatly improve performance for highly used kiwix-serve (library.kiwix.org) or small hardware (RPi).
A more reliable solution is to rethink how we generate the snippet. This probably need another issue to track this. So I will be succinct here. We can display a snippet not relative to the search (first sentences of the article) or a relative one (so we need text to give to xapian to do a "selection"). In both case we need some text, easily accessible. We have to figure where to store it, in xapian database or in zim file itself (and where). I don't know what is the more efficient (in perf and memory), so we will probably have to implement the different solution and to testing).
We can also thing about our cache system. We cache the clusters. But should we cache the article instead ? Should we compress it (lz4 is really fast https://lz4.github.io/lz4/) to be able to cache more value ? Rewrite it to be more efficient in multi-thread environment ? ...
We could also "simply" improve the main decompression time by using another compression algorithme than lzma (zstd seems to have huge decompression speed x8 for the same compression speed and about the same ratio (+2%))
@mgautierfr To conclude, we have the few ft searches around 40s and then it will slowly come around 10s? All of them are successfully done?
@mgautierfr @holta @tim-moody Thank you very much to all of you for participating to this ticket.
Here are my conclusions:
kiwix-serve
backend has given an answer. It is out-of-topic for this repo.kiwix-serve
on certain devices and under certain conditions takes up to 40s to give a response and that having n
of this kind of requests blocking the whole kiwix-serve
HTTP service is normal even if this is not user friendly.kiwix-serve
version from June 2019 just to confirm this.kiwix-serve
open threads #106 (probably a topic for Kiwix Hotspot)Considering that there is no bug and nothing we can do more here, I propose to close this ticket.
Tested: Kiwix Serve 1.2.1 has the same limitation, it is even slower.
Kiwix load-testing stats published, thanks to @tim-moody:
https://github.com/iiab/iiab/issues/2152#issuecomment-581577976
Feedback from @holta
This happened in a few rural Mexican schools after we leave so will not be able to reproduce this easily. But here's the background if it helps:
Teachers are using (trying to use) the most recent "maxi" Spanish Wikipedia ZIM file.
It only takes a couple kids to access Wikipedia (title search generally, very occasional fulltext search) to lock out all other students.
All kids are blocked when this happens, and teachers become extremely frustrated, as they can no longer depend on Wikipedia for any small group or classroom lessons going forward.
Nobody can use Wikipedia in any way for about 2 minutes or so, as a result of this new "system overload error" that did not occur in the past. (The RPi Internet-in-a-Box server is not actually overloaded as its other services all continue to work, during these times when kiwix-serve fails.)
There is a Kiwix error message they find very alarming, something to the effect of "system is overloaded" even when only a couple students are using it i.e. preventing the use of Wikipedia in group/classroom situations.
The language of the error message is very alarming to teachers, who believe everything is broken. The error message should be changed to explain better that access is generally restored after waiting a few minutes.