kiwix / libkiwix

Common code base for all Kiwix ports
https://download.kiwix.org/release/libkiwix/
GNU General Public License v3.0
121 stars 57 forks source link

PDF content blocked by Chrome #916

Closed rgaudin closed 1 year ago

rgaudin commented 1 year ago

When opening a PDF in a ZIM in the kiwix-serve viewer, on Chrome, there is an unrecoverable error message. This looks like a regression.

Screenshot 2023-03-15 at 14 18 57

Go to https://dev.library.kiwix.org/viewer#lilote_fr_fo_2023-03/xay-va-%C3%A0-la-p%C3%AAche then click the Lire l'histoire button.

rgaudin commented 1 year ago

Might be related to #604

mgautierfr commented 1 year ago

I have no problem with firefox, chrome and chromium.

rgaudin commented 1 year ago

@Popolechien reported it ; I can reproduce on latest Chrome

rgaudin commented 1 year ago

Sorry the link is incorrect ; you have to use the viewer. I need to open a ticket about that as well

rgaudin commented 1 year ago

https://dev.library.kiwix.org/viewer#lilote_fr_fo_2023-03/xay-va-%C3%A0-la-p%C3%AAche

mgautierfr commented 1 year ago

Indeed. Removing the sandbox attribute solves the problem. This is probably introduced by https://github.com/kiwix/libkiwix/pull/906 (issue https://github.com/kiwix/kiwix-tools/issues/604)

Jaifroid commented 1 year ago

The code I give in that issue works around this specific problem with the pdf viewer in Chrome/Edge. It is not recognised by the browser as same origin, hence PDFs must be opened in a new window or tab. https://github.com/kiwix/kiwix-tools/issues/604#issuecomment-1470013026

rgaudin commented 1 year ago

That's quite a consequence in term of UI. If it's not possible to render it in the same iframe, maybe we should use an in-zim pdf.js but that's not as convenient

Jaifroid commented 1 year ago

Personally I think this is a Chromium bug, because the in-browser PDF viewer should clearly be same-origin when loading a PDF from the same origin. Or maybe it's a feature because PDFs can have active content that can contact external servers (?).

I didn't find any other workaround in Kiwix JS than to make click on a PDF open a new window/tab. Having said that, the new window uses the same in-built PDF viewer, so it's not too big a deal. Epubs have to be downloaded anyway because no browser provides a custom viewer.

It's a balance between the security of the iframe (in terms of not leaking info out: a particular concern with Zimit archives) and (in)convenience... Otherwise, there's really nothing to stop a script in the iframe navigating elsewhere.

NB sandboxing the iframe doesn't stop a determined malicious attacker, but it stops accidental redirects, accidental (or well intentioned) attempts by scripts to break out of iframes, and accidental contacting of external sites for font files, images and (potentially) scripts.

mgautierfr commented 1 year ago

See https://github.com/whatwg/html/issues/3958 which give some information.

PDF viewer is implemented as a plugin in chrome and it is deactivated in sandboxed iframe.

danielzgtg commented 1 year ago

Please also add allow-downloads. Some people may have a Firefox setup where "Settings > General > Files and Applications > Applications > Portable Document Format (PDF)" is set to "Save File" instead of "Open in Firefox". I heard there was a recent proposal to use ZIMs for serving .apk files, and this would be necessary to support their use case.

The sandbox attribute needs to stay to stop the Wiktionary zim from breaking. What's wrong with pdf.js? It would only make the experience more consistent across platforms.

veloman-yunkan commented 1 year ago

I guess that a fix addressing #912 but not this ticket doesn't make much sense. On the other hand, I think that a fix to this issue should also automatically fix #912.

Jaifroid commented 1 year ago

This is what we went for in Kiwix JS (for the sandbox attribute of iframe, or can be served as part of CSP response header):

allow-same-origin allow-scripts allow-modals allow-forms allow-popups allow-downloads

Jaifroid commented 1 year ago

@veloman-yunkan Is this sandbox attribute actually in force at library.kiwix.org yet? Because if you go to https://library.kiwix.org/viewer#wiktionary_en_all_nopic_2023-02 , you clearly see that a top-level navigation occurs and the iframe is destroyed..., and if I inspect the iframe just before the offending script runs that breaks out of it, I see what's in the screenshot below.

Now, if the sandbox isn't implemented at library.kiwix.org, it means that the issues with the PDF viewer are not directly down to #906, unless I'm missing something (or my browser cache is VERY persistent, despite clearing it)...

(Though I do think #906 will block PDFs in Chrome, because I already had to patch that in Kiwix JS.) Confused 😖...

image

Jaifroid commented 1 year ago

And, btw, that referrerpolicy should probably be none, rather than same-origin.

veloman-yunkan commented 1 year ago

@Jaifroid https://library.kiwix.org runs the latest release of kiwix-serve. #906 has not been released yet. Its side effects are currently demonstrated at https://dev.library.kiwix.org (to which this ticket refers).

Jaifroid commented 1 year ago

@veloman-yunkan Thank you for clearing that up!

veloman-yunkan commented 1 year ago

If it's not possible to render it in the same iframe, maybe we should use an in-zim pdf.js but that's not as convenient

@rgaudin Another option is to embed pdf.js in kwix-serve. However, I wonder what kind of inconvenience you referred to and whether it applies to the proposed solution too.

Jaifroid commented 1 year ago

Personally, I'd say there's no real inconvenience in the simplest solution of opening the PDF in a new tab or window. In many ways, it's better UX, because the user can keep that tab to read later and carry on browsing in the iframe. It also parallels the experience with EPUBs, which download separately rather than opening in the iframe. But I get that people have different opinions about such things!

rgaudin commented 1 year ago

Bundling pdf.js inside a ZIM means displaying PDF ourselves, via pdf.js. It means creating a host webpage which won't feel exactly as the in-browser pdf.js. Also, for those with other PDF reader configured, it means rendering in PDF.js then potentially downloading (pdf.js allows this) to trigger the default PDF reader. It's far from being as comfortable as the normal situation.

As for including pdf.js in kiwix-serve, I don't see how that would help… Do you want to add a kiwix-serve specific API to use it? Do you want to intercept application/pdf responses and render them as text/html with pdf.js ?

Then you have the SW discussion all over again (reader-ZIM dependency that's not part of the spec)

rgaudin commented 1 year ago

But I get that people have different opinions about such things!

Exactly.

We can look at this from both angles though:

The trick is that kiwix-serve is both a first-class reader and a Web party so boundaries are blurred.

I don't care much about the outcome of the tab thing but I am worried that we may be starting to drop support for regular web features. @kelson42 @Popolechien what do you think?

Jaifroid commented 1 year ago

For anyone coming to this longish thread at this point, the executive summary is as follows:

mgautierfr commented 1 year ago

There is a vulnerability in Kiwix Serve and Kiwix JS related to the use of iframes to display articles: scripts in these articles can accidentally (or on purpose) navigate to remote sites and leak user data to remote servers, or they can break out of the iframe and destroy the app's controls;

No. The vulnerability is that we display a "falsely offline" version of website which can still phone home and leak user data to remote servers. This is how web works and except by blocking all connections to server, it can always append.

Before iframe, kiwix-serve was simply displaying the content in the top frame. "falsely offline" websites were obviously able to phone home. And as we were inserting the top bar in the content of the page, it was even simpler for the website to break out the app's controls. The iframe is the occasion to add some kind of protection against accidentally "falsely offline" website when it was impossible to do before. The change to iframe is not the source of the leak. However, we have exchange the website css breaking the css of the app controls against links with target=_top breaking the app controls.

There is no more user data leak than before. You can use the "no viewier/iframe" with the /content/zim_name/path/to/article to see that website can phone home all the time.

Blocking this requests was never a goal of kiwix-serve. It may change, but if it became a purpose, the solution is probably more in service worker (to control all connection going out of the displayed website) than in iframe.

This is not the behaviour the creator of the ZIM may have intended, so there is a balance to be struck between security and respect for the intentions of the ZIM creator / trusted web resource.

There is a balance between adding a new security level and not inferring in the website content (in chrome browser)


Blocking any link with target=_top (with sandboxed iframe) may not be a good idea neither. A website may be internally composed of several iframes. On iframe may display a menu with links changing the top iframe content. Allowing target=_top link breaks the (viewer) ui, but at least the website is usable. And this is difficulty patchable as other player than kiwix-serve/kiwix-js don't use iframe and so target=_top is totally valid. It would be to the viewer itself to dynamically patch (or intercept) links with target=_top and replace them with target=framename.

However, if the target=_top is not really needed, we can consider that as a bug in the scrappers and they should remove it.

Jaifroid commented 1 year ago

@mgautierfr in https://github.com/kiwix/kiwix-tools/issues/604#issuecomment-1460035994 I suggested that instead of using the sandbox attribute on the iframe, Kiwix Serve can serve all content with a CSP sandbox response header. I agree with you that this would be conceptually more elegant. However, we would still have this issue with PDF content not being rendered in Chromium browsers in the iframe, and also the issue with external links being blocked. Like in Kiwix Android, these will still have to be intercepted and opened in an external window. Adding the sandbox attribute in the response header or adding it in the iframe are exactly equivalent in terms of functionality (I've tested this in the PWA).

EDIT: It might be possible to serve only HTML with a CSP sandbox response header, but not serve PDFs with this header. It would need testing as to whether this would allow them to render in the iframe.

danielzgtg commented 1 year ago

No. The vulnerability

There are multiple versions of the bug. There is more information on Slack.

website css breaking the css of the app controls

Shadow DOM was designed for this problem.

a balance between adding a new security level and not inferring in the website content (in chrome browser)

And between fixing the website content (in Wiktionary)

A website may be internally composed of several iframes

Have we seen how archive.org does it? archive.org edits the HTML of all webpages to insert its header with the controls. It rewrites links, and this is to ensure multimedia paths work and external links are intercepted. With that strategy, we can either use the CSP HTTP header like you suggested, or we can inject a <meta> tag. archive.org does not edit the data of multimedia such as images, so PDFs should be rendered properly if we switch to that strategy.

Jaifroid commented 1 year ago

@danielzgtg I used to use only a <meta> tag CSP in Kiwix JS PWA, but unfortunately you can't use the sandbox attribute in <meta http-equiv...>. See Content-Security-Policy/sandbox:

This directive is not supported in the <meta> element or by the Content-Security-policy-Report-Only header field.

And without sandbox we can't stop top-level navigation. The other disadvantage of CSPs via <meta> is that you have to alter the HTML to inject the tag, whereas adding a CSP response header or iframe sandbox does not require altering the HTML in any way.

Rewriting links isn't necessary if you intercept the user's click on the iframe and inspect the target.

mgautierfr commented 1 year ago

There are multiple versions of the bug. There is more information on Slack.

Please put it somewhere on github (here or on https://github.com/kiwix/kiwix-tools/issues/604) has we need trace of the discussion. Slack is not public.

Shadow DOM was designed for this problem.

Give me more

Have we seen how archive.org does it? archive.org edits the HTML of all webpages to insert its header with the controls. It rewrites links, and this is to ensure multimedia paths work and external links are intercepted. With that strategy, we can either use the CSP HTTP header like you suggested, or we can inject a tag. archive.org does not edit the data of multimedia such as images, so PDFs should be rendered properly if we switch to that strategy.

We were modifying the content in kiwix-serve but we changed that especially to not modifying it. The idea is to interfere the less possible with the content in the zim file as this content should be "just working". (If not, fix the scrapper).

danielzgtg commented 1 year ago

There are multiple versions of the bug

Please put it somewhere on github

The 3 versions I had in mind are:

  1. Wiktionary not working. A zim was changing window.top.location. CLOSED FIXED by adding <iframe sandbox="[everything except allow-top-navigation]">.
  2. poc2 from https://github.com/kiwix/kiwix-js-windows/issues/377 . The <meta> wasn't being inserted if <head> is missing (possibly also affects svg or xslt if someone goes and tests) due to the regex not considering that. CLOSED FIXED by repeating the <meta> in the <head> outside the `