kiwix / libkiwix

Common code base for all Kiwix ports
https://download.kiwix.org/release/libkiwix/
GNU General Public License v3.0
118 stars 56 forks source link

Multizim (suggestions) does not work at all #479

Closed kelson42 closed 2 weeks ago

kelson42 commented 3 years ago

If I search for suggestion in the welcome page, nothing is printed.

I would like to see the results and it would be great to have the logo of the ZIM beside to see in which content the content is available.

See kiwix/kiwix-tools#385 for the fulltext search multizim lack of scalability

JensKorte commented 3 years ago

If I search for suggestion in the welcome page, nothing is printed.

Strange, for me it works. I use kiwix-tools_linux-x86_64-3.1.2-4$ ./kiwix-serve -V 3.1.2

The library isn't in use, I start with "/path/kiwix-serve *zim". The menu line is broken. There are german and english results. Hope it helps. kiwix-global-search

kelson42 commented 3 years ago

@JensKorte This is the fulltext search, not the suggestions

JensKorte commented 3 years ago

@JensKorte This is the fulltext search, not the suggestions

Ahh, yes. When I try to use the suggestions, there are no results, there is even no storage IO.

maneeshpm commented 3 years ago

I tried to recreate this bug for a single zim file. In this case, the error occurs because of an empty content argument to the request, that causes a corresponding failure in getIdForName() method. Hence we get a 404 page via the catch block. https://github.com/kiwix/kiwix-lib/blob/803cb1c2c5b6c99b53bcc540bf6719b69d3552ad/src/server/internalServer.cpp#L395-L402 This is the generated request: http://localhost:8080/suggest?content=&term=berlin The solution is to fix the faulty request so that it includes suitable content from which bookName can be extracted.

kelson42 commented 3 years ago

@maneeshpm Sounds good but we need to think about the scalability as well. How can we secure a proper response, on time, with 2000 ZIM files?

JensKorte commented 3 years ago

This reminds me a little bit of a meta search engine. The meta search engine queries several search engines and doesn't know, when this will finish. In past some meta search engines provided an interface with a user selectable timeout and a list where search engines could be choosen grouped by categories or languages.

If you think of a timeout between http server and browser, then the server could send a line with a space once in a while, until the search is finished. If the search result page gets an anchor in the URL, the empty line could get ignored by placing the anchor at the begin of the results.

A caching could be helpful, when several people do the same search, e.g. a school class searches during a lesson. For single user this could be helpful, if the first search gets a short timeout and when the search is repeated the caches serves the full response. Maybe a line with the timeout avoiding spaces could be placed at the end of a fast search and when the server finishes the search the user gets a link with "Reload to see all results".

When the first browser request is made to the server, the server could response with a "dynamic" start page where the languages are selected, which the user activated in the browser eg. "DE(-ch), EN(-us)". The user could then enter the search phrase and modify the languages.

maneeshpm commented 3 years ago

According to this thread on Xapian, Xapian can handle search over multiple databases with a very small overhead compared to single database search. For that, all the databases should be added simultaneously using the Xapian::Database::add_database() method. This is already implemented in libzim. IMO the real bottleneck is in retrieving the indexes from the zim. An improvement here would be to go async and load all the title indexes using multiple threads. This way, we might be able to set up a Xapian::Enquire object faster and let it handle the search. This is limited by the CPU of the host machine, but largely a general solution. But this must be done as soon as the library is loaded since we can assume that the user is going to use search.

PS: I guess this ticket openzim/libzim#418 is well written and captures the issue very well. As far as suggestions not working is concerned, I believe we need to fix that piece of code in kiwix-lib.

kelson42 commented 3 years ago

retrieving the indexes from the zim

What do you mean exactly here? the IO overhead? Or simply what is reported in openzim/libzim#418?

maneeshpm commented 3 years ago

I meant the net cost of (reading a zim + getting the index + adding it to databases object)

maneeshpm commented 3 years ago

I think this issue is more suited for kiwix-lib instead of kiwix-tools since the bug is there.

handle_search() and handle_suggest() are somewhat similar routines. Both of them initially try to get a bookName from the request obj inside a try catch block. When searching from the input box on the welcome page, both the functions rely on content argument of the request to load a bookName which is generating an error and entering the catch block. handle_search() does nothing in the catch block and has a fallback method to get all open local zim using mp_library->filter(kiwix::Filter().local(true).valid(true)) and does not raise any error. Whereas handle_suggest() returns a 404 in the catch block, hence causing this behavior. We can implement the same fallback method in handle_suggest() to fix this issue.

I think till the issue of scaling up is sorted, we should hide this feature from the main page as it hurts the user experience for a high number of zims.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

maneeshpm commented 3 years ago

@kelson42 we can say as a fact that once a Xapian database is ready, search on it is quick(even on huge Xapian DB) and that is something we cannot improve on our side. Now our main concern is how to make the DB ready first time and how to keep it ready for further searches.

Answer to how to keep it ready is caching, which we have already started looking into in #509

Answering how to make it ready first time quickly is a bit more complicated. Currently in libkiwix side, we make a zim::Searcher only after receiving a query(we make it on each query, hence slow). We could prepare a zim::Searcher as soon as the user opens a multizim because we can expect them to do at least one search on the zim.

Now what to do till the zim::Searcher is being created? extracting the xapian entry from all the zim in case of multizim takes time. We could show a message "Searcher is preparing" and offer a simpler/stripped down search using zim index(which is quick) till the searcher is ready.

kelson42 commented 3 years ago

The topic of the cold start is already touched in https://github.com/openzim/libzim/issues/418. I would keep this topic outside this ticket. That said I still believe that if kiwix-serve has 2000 zim files open, then a multizim search won't give an answer in a reasonable time and memory consumption. This is IMO mostly what this ticket is about.

kelson42 commented 3 years ago

Here is how I would propose to proceed. First of all this is a quite lartge ticket, so I would first propose to split it in following tasks:

@maneeshpm @mgautierfr Do you agree? Have you comments?

kelson42 commented 3 years ago

Depends on #509

kelson42 commented 2 years ago

@maneeshpm Would you mine to tackle the multizim problem until we fix the last details of #509? Maybe you have a feedback obout my last comment?

kelson42 commented 2 years ago

@maneeshpm Any thoughts about the plan? Would you be ready to implement it?

kelson42 commented 2 years ago

@maneeshpm We need to move quickly now an this. Therfore, I have reassigned the ticket to @mgautierfr. Hope this is OK for you?

kelson42 commented 2 years ago

Fulltext multizim search is fixed with #731. The multizim suggestion work is left to do.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

kelson42 commented 1 year ago

I guess this is ticket fot openzim/linbzim meanwhile.

We should fix https://github.com/openzim/libzim/issues/734 forst IMO.

kelson42 commented 1 month ago

Moving to openzim/libzim where it belongs.

kelson42 commented 2 weeks ago

Kamino closed and cloned this issue to openzim/libzim