ukwa / ukwa-pywb

GNU General Public License v3.0
11 stars 3 forks source link

Update to new implementation of NPLD limitations #54

Open anjackson opened 4 years ago

anjackson commented 4 years ago

We are moving to a new, simpler implementation of the NPLD limitations. Rather than going through a remote desktop, clients will access Wayback directly. This means we need to do a few things:

  1. Modify the single-concurrent-use locking
  2. Limit how much text can be copied at once
  3. Use cache control headers to limit how much content gets stored on the client
  4. Prevent download of non-web content

Single-Concurrent-Use

The will be no login/logout hooks, so a simple alternative locking mechanism is proposed.

The default behaviour is that all 'top-level' URLs will be lock to a user's cookie session, set to time-out at midnight later that day. As before, transcluded items should not be locked. These locks are managed server-side.

To enable the lock to be released earlier, the lock can be polled and repeatedly updated from the Wayback JavaScript client, with a time-out set to a few minutes in the future. While a page is being viewed, it will still be locked to the current user, but once they move on it should time out in a few minutes as the lock is no longer being updated.

This means files that get downloaded will be locked for the whole day, but most pages should be released promptly.

Limit cut-and-paste

The client-side JavaScript should intervene during cut/copy events and limit the text to a configurable amount.

Limit local caching

The server should add headers to limit local caching, as per https://stackoverflow.com/questions/9884513/avoid-caching-of-the-http-responses -- this may be better done via NGINX?

Prevent downloads of non-web content

We need to try to prevent content being downloaded to local machines, and use a secondary service for rendering some formats to HTML.

First step is to intercept direct downloads of content other than HTML. These will then either be blocked (probably with a custom 451 error) or passed to an external service for rendering.

We will need some lookup table that maps Content Types to URL templates, e.g.

application/msword, http://service.things.com/url={url}

Or similar. When we hit a non-web type, we should open up the block page, and if there's a mapping, offer to redirect the user to that URL for access. For all types, we should ensure the Content-Disposition header is blocked so downloads can't be forced that way.

i.e. this is similar to the old Interject idea (source code & tech docs here).

anjackson commented 4 years ago

Updated following clarification of download limits.

ikreymer commented 4 years ago

To clarify, the default behavior is that a resource remains locked, unless it is explicitly unlocked by the same client, right? Otherwise, the default lock will only be for a few minutes, until it is no longer being polled.

Eg. If client with cookie A locks https://example.com/, and then client with cookie A visits https://example.com/foobar, moving its lock to that url, and unlocking https://example.com/. If the client then closes the browser, https://example.com/foobar remains locked until midnight?

OR

Each https://example.com/ is locked by A, then client moves to https://example.com/foobar and locks that. The lock for https://example.com/ expires after a few minutes. When user closes the session, the lock for https://example.com/foobar expires as well. If a user attempts to download a file (which would trigger an interstitial behavior as outlined), the lock is acquired and then expires. But under this approach, the lock would never last until midnight, since it would expire after no longer being polled by the client?

Or perhaps I am missing something?

ikreymer commented 4 years ago

Regarding 4), one possible tricky edge case may be if a certain type of resource could be either downloaded or embedded in the page. I guess maybe the only example is PDF, unless there is a custom viewer for ms-word somewhere... We know that for certain that if Content-Disposition was present, it is a download, otherwise I think it is not possible to tell for sure...

And for PDFs, I think if you use the default PDF plugin the browser provides, it is still possible to download the PDF from there.. I don't think there's a way to prevent that from the default PDF viewer.

anjackson commented 4 years ago

On locks, it's the second case. The idea was:

So, the lock-till-midnight should rarely happen, as it is simply a fall-back in case for some reason the client-side locking protocol fails.

On (4), I was imagining we'd sniff the Content-Type from the WARC record on the server side and block/redirect if it's not HTML or an embed. If that makes sense?

ikreymer commented 4 years ago

Yep, kept the initial lock mechanism, but ping shortens the lock for the referring url. First pass in the above commit.

Updated docs for the new features: https://github.com/ukwa/ukwa-pywb/blob/2.4.0-beta/docs/locks.md#ping-session-refresh

anjackson commented 4 years ago

Generally, looks good. Unfortunately, I think the content-type block will need some way to act more like a allow list than a block list. e.g. unknown or unspecified formats should not be downloadable, so we'll need some way of saying 'web formats allowed' (html/jpg/css/js/png/etc.).

Which is unpleasant but necessary.

ikreymer commented 4 years ago

How about something like this?

        content_type_redirects:
          # allows
          'text/': 'allow'
          'image/': 'allow'
          'video/': 'allow'
          'audio/': 'allow'
          'application/javascript': 'allow'

          'text/rtf': 'https://example.com/viewer?{query}'
          'application/pdf': 'https://example.com/viewer?{query}'
          'application/': 'https://example.com/blocked?{query}'

          # default redirects
          '<any-download>': 'https://example.com/blocked?{query}'
          '*': 'https://example.com/blocked?{query}'

The content-disposition is checked first so always takes precedence, then exact match, followed by the mime prefix (eg. text/) match, followed by '*' wildcard. With '*' set to redirect, all unlisted mimes will be redirected.

If this makes sense, can expand it with more mime types. May be more convenient to move to separate file from config.yaml