Online library filter/search API calls are not optimal

kelson42 commented 1 year ago

For the moment, a new request of type "https://library.kiwix.org:443/catalog/search?lang=amh&count=-1&tag=_category:gutenberg&notag=_sw:yes" is done each time a filter is changed in the sidebar. Then everything (book details, no pagination) is downloaded from https://library.kiwix.org. The "search" (currently library top bar) is then applied a posteriori on the local result list to have a match.

I'm not satified because that generates a lot of data transfered (slow and expensive) between local computer and library.kiwix.org.

I woud like to propose:

Free text search is handled like any other filter and generates a new API request if changing
We don't download more book details than what we can display on the screen (but we need to know how many results we have in total). We should use result pagination.
Libkiwix should cache properly things
- We should have a cache of requests/response that if we have twice the same requests we identify and avoid downloading twice the results
- Results list should be handled differently from books. Results reonses should only have the list of book id and book details should be requested (and cached) separatly. That way we can cache book details independently from the API request/response (and save bandwidth).

kelson42 commented 1 year ago

@mgautierfr @veloman-yunkan @juuz0 Does that make sense? Is the libkiwix ready for that?

juuz0 commented 1 year ago

We don't download more book details than what we can display on the screen (but we need to know how many results we have in total). We should use result pagination.

One interesting problem here is to handle the sorting of books. Current kiwix-desktop applies sort on all of the list. But obviously, if we use pagination, it's only going to apply on the list available to it and give false positives (eg: sorting in descending order by size will not give the book with biggest size in Global catalog). libkiwix should probably support a parameter to give out sorted data, if we need this.

kelson42 commented 1 year ago

libkiwix should probably support a parameter to give out sorted data, if we need this.

Yes, we have multiple tickets regarding sorting of results. But we could handle this with https://github.com/kiwix/libkiwix/issues/702. Please put the requirements you have in a comment so you can secure Kiwix-Desktop features can continue to work.

juuz0 commented 1 year ago

does the cache here have to be implemented in libkiwix? QT has QNetworkDiskCache which can help in this

kelson42 commented 1 year ago

@mgautierfr We urgently need your feedback here

mgautierfr commented 1 year ago

Free text search is handled like any other filter and generates a new API request if changing

I am somehow against that. By definition a free text filtering will only reduce the number of books and so we could do filtering locally. And so avoid data transfer between the client and the server. That why we implement it this way.

BUUUTTT if we go in the direction of using partial entries and pagination, we cannot do the filtering locally as we don't have the list of all books (and their details). So ... yes, by implementation need, all search/filtering/sorting have to be done on the server.

We don't download more book details than what we can display on the screen (but we need to know how many results we have in total). We should use result pagination.

Agree

Libkiwix should cache properly things We should have a cache of requests/response that if we have twice the same requests we identify and avoid downloading twice the results

Not so easy. And maybe not so useful. First of all, catalog content may changed. So we have to invalidate cache somehow. We should measure real use case. How much (duplicated) requests is made by real user (not us testing the catalog) ? And if we go using partial entries, even if we have duplicated requests, we may not need to spend our time on caching ALL entries.

Results list should be handled differently from books. Results reonses should only have the list of book id and book details should be requested (and cached) separatly. That way we can cache book details independently from the API request/response (and save bandwidth).

That's the purpose of partial_entries (to get list of books only) and entry (to get information about a specific book)

As suggested somewhere (I cannot find where), I think we should introduce a "RemoteLibrary" in libkiwix. This remote library should share the same (subset) api of Library :

getBookById
getBookCount
getBooksLanguages
getBooksLanguagesWithCounts
getBooksCategories
getBooksCreators
getBooksPublishers
getBooksIds
filter (?)
sort (?)
filter_sorted_paginated (to create)

However, the RemoteLibrary would do request to the server for each methods (we almost already have a specific endpoint for each method)

Request and caching strategy:

We will use partial_entries endpoint to get list of books (getBooksIds and filter_sorted_paginated). Theses methods already return bookId only (as partial_entries endpoint, so we are good).

When user call getBookById with a bookId, we do a request to entry endpoint and get the book information and create a Book. The Book itself is cached and so, no duplicate request to entry is made.

About Illustration: OPDS already give a url to the illustration instead of the data (already to reduce the size of the opds stream). Illustration class itself is downloading the data when we call Illustration::getData(). This is somehow duplicated with ThumbnailDownloader implemented in kiwix-desktop by @juuz0 (sorry, I missed that)

This lead to download management:

The problem here is that don't want the download to freeze the ui.

Easy solution: Don't care. If we do multiple small requests (about few kB), the io latency may be negligeable and if it is, fix it later.
Middle solution: Libkiwix don't care. It is up to kiwix-desktop (and other client ?) to call the remote library method in a different thread and not block its ui

Difficult solution: Build a two step system: methods don't return the results directly but return a tuple (url, callback). url being the url to request (by threaded requests). callback being a function to call with the response content and which actually return the "useful results". A blocking application would use :

auto tuple = library.getBooksIds();
auto bookIds = tuple.1(download(tuple.0));
do_stuff_with_bookids(bookIds);

A event driven application would use:

auto tuple = library.getBooksIds();
auto request = Request(tuple.0);
request.on_completion([](response) {
auto bookIds = tuple.1(response.content);
do_stuff_with_bookids(bookIds);
});
request.do();
do_other_things_without_waiting_request_to_finish();

Hard solution: Design a full async/callback api on libkiwix.

I would go with the easy solution and if not enough, move to the difficult solution. (The "Easy remote library" becoming a wrapper around the "difficult remote library" with the "blocking application" code)

I have played a bit with the api:

full catalog opds full entry content is about 5MB
full catalog opds with partial entries only (only the list of books) is 1.4MB
50 entries opds full entry content is 60K
50 entries opds partial entries only is 18K
One full entry 1.2K

(All done without filter)

kiwix / kiwix-desktop

Online library filter/search API calls are not optimal #957