kelson42 commented 4 years ago

Feedback from @holta

This happened in a few rural Mexican schools after we leave so will not be able to reproduce this easily. But here's the background if it helps:

Teachers are using (trying to use) the most recent "maxi" Spanish Wikipedia ZIM file.
It only takes a couple kids to access Wikipedia (title search generally, very occasional fulltext search) to lock out all other students.
All kids are blocked when this happens, and teachers become extremely frustrated, as they can no longer depend on Wikipedia for any small group or classroom lessons going forward.
Nobody can use Wikipedia in any way for about 2 minutes or so, as a result of this new "system overload error" that did not occur in the past. (The RPi Internet-in-a-Box server is not actually overloaded as its other services all continue to work, during these times when kiwix-serve fails.)
There is a Kiwix error message they find very alarming, something to the effect of "system is overloaded" even when only a couple students are using it i.e. preventing the use of Wikipedia in group/classroom situations.
The language of the error message is very alarming to teachers, who believe everything is broken. The error message should be changed to explain better that access is generally restored after waiting a few minutes.

kelson42 commented 4 years ago

@holta The error message "system overload error" seems not to be generated by kiwix-serve after a first investigation.

holta commented 4 years ago

@holta The error message "system overload error" seems not to be generated by kiwix-serve after a first investigation.

Correct. The exact wording of the error message is something I'll try to obtain from one of the teachers in coming days. (Hopefully we can also get a photographed screenshot sent to us from the field, using WhatsApp?)

kelson42 commented 4 years ago

@holta Yes, screenshot would be even better.

kelson42 commented 4 years ago

@holta You can also start kiwix-serve with more threads, see --threads option. Would be good to know if that helps.

holta commented 4 years ago

@kelson42 here are the flags used by IIAB:

/opt/iiab/kiwix/bin/kiwix-serve --daemon --port 3000 --nolibrarybutton --library /library/zims/library.xml --urlRootLocation=/kiwix/

Perhaps we should also try...

        -t, --threads           number of threads to run in parallel (default: 4)
        -v, --verbose           print debug log to STDOUT

PS i expect screenshot(s) from Mexican teachers by the end of the week hopefully.

kelson42 commented 4 years ago

@holta Yes, I would definitly try to use --thread > 4

holta commented 4 years ago

@kelson42 here is the error/warning that most every teacher/student is repeatedly running into: (in Oaxaca State, Mexico)

box.lan/kiwix/search?content=wikipedia_es_all_maxi_2019-09?pattern=AMLO

Service Unavailable

The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.

This results from their using the search box (textfield) at the top right of the "maxi" Spanish Wikipedia page:

Is it possible they are getting impatient with Kiwix's search box / textfield since it so often takes 30 seconds to show its autocomplete/popup ?
During the 1st search of the lesson / 1st search of the day especially?
Is it possible they are typing [Enter] too many times, or some such/similar usability bug, triggering too many searches?
Teachers are habituated to expect search results within 1-to-2 seconds similar to Google on their smartphones [yes, even in developing world countries] so it's a fact that they get very confused when nothing happens for 30 seconds!

kelson42 commented 4 years ago

@holta Thank you for completing your ticket. The error message displayed in the picture is generated by the Apache reverse proxy, not Kiwix serve. That does not mean of course that at the root of the problem there is not a problem with Kiwix serve, which is behind. But, it masks somehow what happens at Kiwix serve level.

For the moment, my best theory is that considering that you start kiwix-serve only with 4 threads, they are busy and then kiwix-serve does not deliver anything. This is the normal behaviour. If you want to deliver at the same time n searches, you probably need at leastn` threads running.

Please just try to increase the number of threads and let me know if it works better.

holta commented 4 years ago

1) @kelson42 we have not itentified the pattern, despite trying, likely because we're no longer in Mexico and do not have a classroom full of 15 client PC's to reproduce this.

i.e. in isolated testing with a single client PC, we've so far been unable to see much any difference between --threads 1 and --threads 8 — despite trying, using many browser tabs (we were hoping each tab would simulate a student/client PC...but so far we're unable to reproduce the "capacity problems" error page that Mexico schools are facing nearly constantly).

2) Here is another error, also from one of these Mexico schools. This 502 error is happening much less often than the 503 error mentioned above... (but with the very same Kiwix search/URL: is this related?)

kelson42 commented 4 years ago

The last error looks pretty similar ro the first one: kiwix serve does not answer.

holta commented 4 years ago

--threads 8 would (appear) not to help: https://github.com/iiab/iiab/pull/2063#issuecomment-566698668

So far it would (appear) these serious problems {502, 503, 504 errors} arise regardless whether Apache or NGINX is proxying.

kelson42 commented 4 years ago

@holta Simplify the problem: try directly to deal with kiwix-serve.

tim-moody commented 4 years ago

I wrote a puppeteer script to test the kiwix load on an rpi 4 with 2 G RAM. It issues a set of 10 predetermined search urls and 10 requests to random and does it 6 times for 60 search and 60 random requests, all issued asynchronously within a couple of seconds over a wired connection.

With kiwix proxied through nginx I start to get gateway timeouts about half way through the run. If I change the requests to go directly to port 3000 I get no 504 errors.

The 4 kiwix threads are pinned at nearly 100%, but tail off before all the results have been sent, even directly from port 3000. When proxied, it took about 40 seconds before any results appeared and the whole run took 2 minutes and 10 seconds. Each thread took 8 to 10% of memory, so I think not memory bound.

I tried 8 threads and they ran at about 50% cpu, so I think no benefit.

kelson42 commented 4 years ago

@tim-moody Could you please share your exact testing procedure in an attempt for me to reproduce it?

tim-moody commented 4 years ago

Here is the nodejs code:

// setup as non-root
// mkdir iiab-tester
// cd iiab-tester/
// npm init -y
// npm install puppeteer

const puppeteer = require('puppeteer');

//host = 'http://192.168.3.96';
host = 'http://192.168.3.96:3000';

randomUrl = host + '/kiwix/random?content=wikipedia_es_medicine_maxi_2019-11';
testUrls = [
  '/kiwix/search?content=wikipedia_es_all_maxi_2019-09&pattern=mexico+',
  '/kiwix/search?content=wikipedia_es_all_maxi_2019-09&pattern=futbol+',
  '/kiwix/search?content=wikipedia_es_all_maxi_2019-09&pattern=amlo+',
  '/kiwix/search?content=wikipedia_es_all_maxi_2019-09&pattern=gobierno+',
  '/kiwix/search?content=wikipedia_es_all_maxi_2019-09&pattern=pais+',
  '/kiwix/search?content=wikipedia_es_all_maxi_2019-09&pattern=nuevo+',
  '/kiwix/search?content=wikipedia_es_all_maxi_2019-09&pattern=juego+',
  '/kiwix/search?content=wikipedia_es_all_maxi_2019-09&pattern=oaxaca+',
  '/kiwix/search?content=wikipedia_es_all_maxi_2019-09&pattern=gastronomia+',
  '/kiwix/search?content=wikipedia_es_all_maxi_2019-09&pattern=pueblo+'
]

process.setMaxListeners(Infinity); // avoid max listeners warning

function main() {
  loadSet();
  loadSet();
  loadSet();
  loadSet();
  loadSet();
  loadSet();
  //loadSet();
  //loadSet();
}

function loadSet() {
  for (let i = 0; i < testUrls.length; i++) {
    console.log(i);
    loadPage(host + testUrls[i]);
    loadPage(randomUrl);
  }
}

async function loadPage(url) {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  try {
    await page.goto(url, {waitUntil: 'load', timeout: 0});
    const title = await page.title();
    const html = await page.content();
    const pageBody = await page.evaluate(() => document.body.innerHTML);
    //const h1 = await page.evaluate(() => document.body.h1.innerHTML);

    console.log(title);
    //console.log(pageBody); // outputs a lot of text
  }
  catch (err) {
    console.log(err);
  }
  finally {
    await browser.close();
  }
}

main();

kiwix is running as a service:

ExecStart=/opt/iiab/kiwix/bin/kiwix-serve --daemon --port 3000 --nolibrarybutton --library /library/zims/library.xml --urlRootLocation=/kiwix/

mgautierfr commented 4 years ago

After investigating, there is several things:

Puppeteer is a full browser. When it does a request to a search page, it also interpret the content and do sub-request if needed (image, css, js). Handling this and doing the layout takes time. When measuring all the time need by all request to render one page, kiwix-serve takes around 1.3s to handle all requests(9). But when measuring puppeteer request, it took more than 2s per page. This overhead is not on kiwix-serve and we cannot do a lot about this.

All sub-requests are dues to the search bar. If we launch kiwix-serve with --nosearchbar. The search result is a plain html page without any sub-elements to download. Then the full puppeteer request (download+rendering+...) is reduced by 1.5s. In real life, this is probably reduced by the browser cache system (and a apache proxy if possible). With the option --nosearchbar the difference between what we measure on kiwix-serve side and puppeteer is under 300ms.

Then came the interesting part.

It seems that most of the time is spend not doing the search, but get the article content to generate the snippet used in the result html page. If we patch the libzim to give a dummy text instead of get the content and generate the snippet, we earn a lot. We go from around 1s for a request to (100ms-300ms). So we can reduce the time to generate the page by storing a generic (not specific to the search) snippet in the xapian database, but the database will take more space.

Sound good ? Wait, there is more.

Trying to not use puppeteer to do the request, I rewrite the test script using request module instead. And then... all requests times are divided by at least 2 (and sometime by 4). And the problem here is that it is the time measure on kiwix-serve side, not script side. So yes, the module used to do the request changes the time needed by kiwix-serve to handle the request. Right now, the only explanation I see, is that it may have some delay in the request of puppeteer and kiwix-serve spends more time waiting (blocking) for the data. But it definitively need more investigation.

Here my test script if you want to test on your side :

const request = require('request');

host = 'http://localhost:8080';
baseUrl = host + '/kiwix/search?content=wikipedia_es_all_mini_2019-09&pattern=';

testUrls = [ 
  'mexico+',
  'futbol+',
  'amlo+',
  'gobierno+',
  'pais+',
  'nuevo+',
  'juego+',
  'oaxaca+',
  'gastronomia+',
  'pueblo+'
]

process.setMaxListeners(Infinity); // avoid max listeners warning

function main() {
  for (let i=0; i<6; i++) {
    loadSet(i);
  }
}

function loadSet(index) {
  for (let i = 0; i < testUrls.length; i++) {
    loadPage(testUrls[i], index);
  }
}

function loadPage(url, index) {
  const options = { 
    url: baseUrl+url,
    // Some custom headers to track the requests on kiwix-serve side.
//    headers : {
//      'x-debuginfo': url+index
//    }
  };
  console.time(url+index);
  request(options, (err, res, body) => {
    if (err)
      console.log(err);
    else
      console.timeEnd(url+index);
  });
}

main();

mgautierfr commented 4 years ago

@tim-moody can you give me your setup ? (raspberry version, kiwix-serve version, file on the sdcard or external drive ?)

holta commented 4 years ago

If it helps, in Mexican classrooms in early Dec 2019 where these issues were uncovered:

Servers were 1GB RAM RPi 3 (and occasionally 2GB RAM RPi 4)
OS was "Raspbian With Desktop" from 2019-09-26, with all apt updates applied
kernel was upgraded to 4.19.88 using the rpi-update command, due to iiab/iiab#1993
OS and ZIM files on microSD card (no external HDD)

holta commented 4 years ago

CLARIF:

kiwix-serve version was 3.0.1-6 (but testing was also done with 3.0.1-8 the next week)

tim-moody commented 4 years ago

I wrote a puppeteer script to test the kiwix load on an rpi 4 with 2 G RAM.

storage is an sdcard. I think kiwix was as @holta says 3.0.1-8

I started with request, I used puppeteer to simulate a real request with subrequests.

mgautierfr commented 4 years ago

So, I've played a bit with kiwix-serve on a RPi3. There is few things that impact the kiwix-serve search performance.

FS cache.

First requests on kiwix-serve (RPi just started) take indeed around 40s to finish. But request's times slowly decrease to around 10s (in less that 10 requests). After that, all requests take around 10s to finish. If you clean the cache (echo 3 > /proc/sys/vm/drop_caches as root), the firsts requests take again 40s to finish.

There is nothing we can really do here (except not shutdown the RPi). And this cache is usefull in test context (were we have only one zim file), but it real case, the effect is probably reduced.

Generating the snippet.

To generate the snippet we have to read the articles content (so uncompress cluster, most of the time is spend here), parse it to remove the html and give the raw text to xapian to generate a text corresponding to the request. If we don't include snippet in the result page, request's times drop to less than a second.

I've tried to increase the internal libzim cache. It has some interesting effect but only for large cache (128 instead of 64 cached cluster) and with very few different urls (around 3s per request instead of 10s). But with a lot of different requests, a large cache increase the request time. It may have a bug here, but increasing the internal cache is not a solution for now.

The number of requests has few impact of the performance in itself. Requests are simply queued and kiwix-serve has to finish previous ones before. It is just that people now see that the search request takes time.

What can we do ?

A easy workaround to implement would be to add a option to kiwix-serve to generate search page without snippet. The search result would have only the title of the articles but it would greatly improve performance for highly used kiwix-serve (library.kiwix.org) or small hardware (RPi).

A more reliable solution is to rethink how we generate the snippet. This probably need another issue to track this. So I will be succinct here. We can display a snippet not relative to the search (first sentences of the article) or a relative one (so we need text to give to xapian to do a "selection"). In both case we need some text, easily accessible. We have to figure where to store it, in xapian database or in zim file itself (and where). I don't know what is the more efficient (in perf and memory), so we will probably have to implement the different solution and to testing).

We can also thing about our cache system. We cache the clusters. But should we cache the article instead ? Should we compress it (lz4 is really fast https://lz4.github.io/lz4/) to be able to cache more value ? Rewrite it to be more efficient in multi-thread environment ? ...

We could also "simply" improve the main decompression time by using another compression algorithme than lzma (zstd seems to have huge decompression speed x8 for the same compression speed and about the same ratio (+2%))

kelson42 commented 4 years ago

@mgautierfr To conclude, we have the few ft searches around 40s and then it will slowly come around 10s? All of them are successfully done?

kelson42 commented 4 years ago

@mgautierfr @holta @tim-moody Thank you very much to all of you for participating to this ticket.

Here are my conclusions:

The HTTP 5XX errors delivered for the search are due to a wrongly configures reverse proxy which close the browser connection before the kiwix-serve backend has given an answer. It is out-of-topic for this repo.
The fact that kiwix-serve on certain devices and under certain conditions takes up to 40s to give a response and that having n of this kind of requests blocking the whole kiwix-serve HTTP service is normal even if this is not user friendly.
So far we are informed, we don't have a recent regression here. @mgautierfr will make a test with a kiwix-serve version from June 2019 just to confirm this.
There is no quick way to improve the overall fulltext search speed, but a few things could be done in the future to improve the overall speed of fulltext searches:
- Implement Xapian Msets cache #97
- Implement fulltext search snippet API and AJAX retrieval #351
- Forbids IP to take all kiwix-serve open threads #106 (probably a topic for Kiwix Hotspot)
- Experiment xstd compression algorithm which decompress 8X faster than lzma https://github.com/openzim/libzim/issues/84

Considering that there is no bug and nothing we can do more here, I propose to close this ticket.

kelson42 commented 4 years ago

Tested: Kiwix Serve 1.2.1 has the same limitation, it is even slower.

holta commented 4 years ago

Kiwix load-testing stats published, thanks to @tim-moody:

https://github.com/iiab/iiab/issues/2152#issuecomment-581577976

kiwix / kiwix-tools

Fulltext search does not scale properly #345

FS cache.

Generating the snippet.

What can we do ?