kiwix / libkiwix

Common code base for all Kiwix ports
https://download.kiwix.org/release/libkiwix/
GNU General Public License v3.0
119 stars 56 forks source link

Use OPDS Feed with partial entries #209

Closed mgautierfr closed 3 years ago

mgautierfr commented 5 years ago

For now, we have only one OPDS feed : A "Acquisition Feeds" (http://library.kiwix.org/catalog/root.xml) This is what we could call the "Complete Acquisition Feed" (See https://specs.opds.io/opds-1.1.html#Complete_Acquisition_Feeds) The search API (http://library.kiwix.org/catalog/search) is "just" a way to filter the main feed.

The main feed is about 1.5Mb for now. This is less that ("library_zim.xml) (7.5Mb) because we do not include the icons but link to the icon url. However, this is still a bit big especially for people with small internet bandwidth.

OPDS Spec allow to reduce this size, at the price of a bit more complexity. It is possible to have a feed with partial entries instead of complete entries (https://specs.opds.io/opds-1.1.html#Partial_and_Complete_Catalog_Entries)

The feed would list only the entries' id and a link to download the rest of the metadata. The catalog would have a new API endpoints (something like "http://library.kiwix.org/catalog/entry/") to get other metadata (title, description, ...) The whole partial feed (all entries, but partial content) would be around 427Ko.

A client would have to do more request to get the whole content, but the size would be greatly reduced.

427Ko is for all entries. A partial catalog for English zim file only would be 60Ko only (instead of 232Ko). For french it would be 11Ko (instead of 40Ko)

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

automactic commented 4 years ago

The feed would list only the entries' id and a link to download the rest of the metadata.

Oh, please don't. A list API that only returns the IDs would be virtually useless, as no user facing interface would show user these UUIDs. The essentially info for helping user decide which ones are of interest and are worthy to dive into should be included in the list API. Otherwise, the consumer of the API would essentially need to do 1+N API call.

automactic commented 4 years ago

We can create a API that returns all IDs in the online library, so that apps that decide to keep a local snapshot of the online library can quickly figure out the diff. But I think we should still have an API that gives the consumer all book and their full meta data (it can be paginated).

kelson42 commented 4 years ago

@mgautierfr Can you please: 1 - Update ticket and ideally close it 2 - Split it in precise topics, this ticket seems far too generix 3 - Answer to @automactic

mgautierfr commented 4 years ago

1- Update ticket and ideally close it.

There is nothing to update nor close.

2 - Split it in precise topics, this ticket seems far too generix

Well, there is no real way to split this issue. Having a feed with partial information will be useless without the API to get the full information about a book.

3 - Answer to @automactic

Yes, I will do.

Oh, please don't. A list API that only returns the IDs would be virtually useless, as no user facing interface would show user these UUIDs. The essentially info for helping user decide which ones are of interest and are worthy to dive into should be included in the list API. Otherwise, the consumer of the API would essentially need to do 1+N API call.

It would be useless for you. Having a partial feed is in the OPDS standard, it is because it may be useful.

The main usage is allow of efficient cache system. If a client do a search for all book in french and then a new search for all wikipedia in french, it is inefficient to have a complete feed with all the information downloaded twice.

Pagination on search is pretty inefficient. A search has to be made for all requests, so if you do use N pages to get all books, you do N search and get only a paged result of the search each time. And if you do a search with all content with all book to avoid several request, you need to download all the content at once. If you do a search with all books (but with partial result). You can then do only get information about the book you want to display (and so, pagination) without having to do a new search.

On top of that, even if the client do not cache the result, we could set a proxy that cache the xml information for only one book. This way, even if the client do N requests, most of the requests could be served by the proxy and not handled by kiwix-serve.

In any case, I do not plan to remove api to have a full feed. So clients who don't use partial feed will still work.

kelson42 commented 4 years ago

To me it seems to make sense to do it. That said not a top priority I believe.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

kelson42 commented 3 years ago

I believe it is time to update this ticket because we are going to implement it soon. It is important that the OPDS API is optimized for better response times and smaller data amount transfered.

First point to notice IMO is that the the root.xml and the search end-points have been merged in the v2 of the OPDS API. This has been done, considering delivering all the data is only a special case of the search API.

Kiwix iOS is an early adopter of the OPDS feed and use it like following: download everything and then do searches locally. This explains why in that use case this feature is useless. It is important that:

Maybe one thing we should consider, to avoid a large amount of HTTP requests, is to provide an entry API able to deal with multiple ZIM ids?

@automactic @mgautierfr @veloman-yunkan @MananJethwani I hope we can go forward with this ticket implementation. Otherwise please put your remarks/comments quickly there.

mgautierfr commented 3 years ago

For a certain amount of time the v1 root.xml stays like it is

v1 will not be changed. (But maybe removed at a certain time).

The v2 search end-point provided and option (not activated per default) to deliver everything in one feed

ODPS speak about complete acquisition feeds : https://specs.opds.io/opds-1.2#25-complete-acquisition-feeds We should follow the spec here.

kelson42 commented 3 years ago

@veloman-yunkan Any progress on this?

veloman-yunkan commented 3 years ago

@kelson42 Not yet. I will start working on this ticket this week.