ukwa / ukwa-pywb

GNU General Public License v3.0
11 stars 3 forks source link

ePub playback lead to download rather than rendering #92

Closed anjackson closed 2 years ago

anjackson commented 2 years ago

Minor config issue I think, but access an ePub from the live system got a download rather than a render. This is because the item has no .epub extension, but rather a application/epub+zip content type. So that's just a config thing.

But also, it got downloaded because everything application/* is allowed to download. Is that needed? Seems like it might let too much leak?

anjackson commented 2 years ago

I need to specify this for every collection, right?

https://github.com/ukwa/ukwa-pywb/blob/ebbb52207d71ceb2a8596652a5e75740e3e99e84/config.yaml#L14-L36

anjackson commented 2 years ago

Hmm, hitting a lot of weird errors. Not sure how best to make them reproducible. Not sure the Docker build is building everything as some things (like ../build/pdf.js) seem to be missing.

anjackson commented 2 years ago

ePub viewing is also failing, something weird happening as it seems to try to open up a container.xml URL rather than get it from the ZIP. It creates a frame like:

/static/viewers/epubviewer/index.html?bookPath=http%3A//dev1.n45.wa.bl.uk%3A8209/doc/ir///staffaccess.dl.bl.uk/ark%3A/81055/vdc_100031420983.0x000001

Which looks about right, but then in that frame, rather than downloading the ePub, it requests:

http://dev1.n45.wa.bl.uk:8209/doc/ir_/staffaccess.dl.bl.uk/ark:/81055/META-INF/container.xml

which does not exist....

anjackson commented 2 years ago

Possibly the ePub thing is this issue: https://github.com/futurepress/epub.js/issues/479

anjackson commented 2 years ago

Ah, I see, pdf.js is in the token-auth branch. Unless you tell me otherwise, I'm going to assume it's okay to merge that branch in with this one....

ikreymer commented 2 years ago

Yes, sorry, that branch was the more up-to-date one! I should have opened a new PR for that

anjackson commented 2 years ago

Ah well, I’ve added the Docker action to this branch so lets keep things here now.

My current problem seems to be that the URL to download the raw item to the viewer is also redirecting to the viewer. Any idea what I’m doing wrong?

EDIT: I noticed the form of the direct access URL is a bit different for each, eg ir_ vs id_, if that matters?

anjackson commented 2 years ago

Yeah, so I'm still struggling. the ePub viewer is refusing to download the ZIP, and the 'contained' parameter does not appear to be part of the ePub.js codebase any more, so I'm not sure what's going on there. This comment indicates the need for some ZIP libraries,but I've not managed to find anything else. I fear it may be necessary to mock-up a .epub extension on the URL!

And the PDF viewer is having a different problem. The viewer page attempts to load the content from:

http://dev1.n45.wa.bl.uk:8209/doc/ir_///staffaccess.dl.bl.uk/ark:/81055/vdc_100047810264.0x000001

But this is not getting the PDF. It's getting a redirect to:

/static/viewers/pdfviewer/web/viewer.html?file=http://dev1.n45.wa.bl.uk:8209/doc/ir///staffaccess.dl.bl.uk/ark:/81055/vdc_100047810264.0x000001

i.e. we're caught in a loop. Perhaps I messed up the config? (see it here).

anjackson commented 2 years ago

Ah, seems like enable prefer = false might be a problem?

anjackson commented 2 years ago

Yay! That fixed PDF playback! ePubs still wonky, apparently due to lack of .epub file extension (!).

anjackson commented 2 years ago

I can verify that if you hack it all around so that there's an .epub extension then it works fine. Sheesh. Behold, the HACK

https://github.com/ukwa/ukwa-services/commit/609bb73e53c9d74fb9e9160ac9cc0a9b82df1ab2

anjackson commented 2 years ago

Although that's a horrible hack, it can probably stay, as I think long-term the best approach is going to be dynamically unpacking the ZIPs via a server-side helper (see ukwa/ukwa-services#93).

However, the current approach is fine for this first release.

The only actual problem I've found is, I think, a CSS issue, in that when in the playback iframe and when printing, the ePub does not show at all (all you can see is a v long scrollbar). If you view the iframe alone, you can print the current page(s) fine.

ikreymer commented 2 years ago

Thanks for taking a closer look at this! Looks like there's a few things to discuss?

Let me know if I missed anything!

anjackson commented 2 years ago

The Prefer system is needed because this is how the server knows to pass the unmodified file back to the viewer:

https://github.com/ukwa/ukwa-pywb/blob/76bb5d8dcf2261b83da6c07f3fb03bf5f0aba967/ukwa_pywb/ukwa_app.py#L198-L200

It works fine now. The only issue I was wondering was why the PDF viewer appeared to use id_ whereas the ePub viewer used ir_, but maybe I was wrong about that.

I think the iframe printing (which affects all playback) is the only outstanding problem.

anjackson commented 2 years ago

In fact, I think I'll close this as done, and have a separate ticket about printing.

ikreymer commented 2 years ago

It works fine now. The only issue I was wondering was why the PDF viewer appeared to use id_ whereas the ePub viewer used ir_, but maybe I was wrong about that.

Hm, it should also be using ir_, just checked. Should be getting set here: https://github.com/ukwa/ukwa-pywb/blob/custom-viewers/ukwa_pywb/ukwa_app.py#L251

The ir_ is just like id_ except the Location: headers are rewritten, so makes it easier to link to (Perhaps that should have been how id_ worked initially but don't want to break backwards compatibility).