ukwa / ukwa-services

Deployment configuration for all UKWA services stacks.
Apache License 2.0
4 stars 5 forks source link

Possible ePub performance optimisations #93

Closed anjackson closed 1 year ago

anjackson commented 2 years ago

During testing for https://github.com/ukwa/ukwa-pywb/issues/92, some performance issues arose with ePubs, particularly large ones (the largest being 2.5GB!). It seems this is a known issue, and as per this documentation from the Readium project, is is recommended to serve ePubs in an unpacked form.

Unpacking whole files would be slow when pulling large epubs from the library store, so we could use e.g. remotezip to create an API that streams our the contents of the ZIPs.

This seems to be called a 'streamer' in Readium parlance, so there are possible implementations there that could be used instead, e.g. https://github.com/readium/r2-streamer-js and see remote-epub and the corresponding API

Alternatively, and with the advantage of perhaps making printing easier, a server-side process could be used to convert the ePub to PDF, like https://manual.calibre-ebook.com/generated/en/ebook-convert.html

anjackson commented 2 years ago

Put together a quick prototype ePub streamer here: https://github.com/ukwa/epub-streamer

An alternative that might make more sense would be to extract the viewers themselves from PyWB, and run them as one or more standalone services. e.g. keep the ePub.js front-end and the streamer back-end together?

mtgch3 commented 2 years ago

Quick note on option 3 - John G and I put using calibre to always convert epub to pdf forward as a solution to a few things a couple of years ago. At the time, that was turned down as it was felt important that patrons viewed the content in it's actual format. It's still the case that it would allow us to solve many rendering and print issues in advance but would probably need the same discussion as last time. Could being it up in issues chat this morning and see.

anjackson commented 2 years ago

Thanks @mtgch3 - As I said in the meeting, I lean towards letting our readers choose what version they would like (see e.g. this old blog on user-driven digital preservation). The downside is it means we have to support more software in the access stack, so I'll be curious to see where we end up.

anjackson commented 2 years ago

Current version includes experimental ePub unpacker/streamer. Makes is possible to use very large ePubs (although it's still a bit sluggish because the index and pages are still pretty large), Will review during deployment.