Determine what is missing capability wise (if anything) for good enough rewriting

hypothesis / via

Proxies third-party PDF files and HTML pages with the Hypothesis client embedded, so you can annotate them

https://via.hypothes.is/

BSD 2-Clause "Simplified" License

20 stars 7 forks source link

Determine what is missing capability wise (if anything) for good enough rewriting #154

Closed jon-betts closed 4 years ago

jon-betts commented 4 years ago

Taking the most capable and easiest to extend rewriter produced in the POCs we should:

Perform a visual inspection of each URL
Give a quick usable/un-usable rating for each page
For each un-usable page (if any) we should spend a small amount of time attempting to work out which extra capability would fix it (see https://github.com/hypothesis/via3/issues/152)
If the fix is easy to implement we should try and re-test

Output

A list from our example set of URLs which are not usable with the current best POC
A guess at why, or "unknown" if it can't be determined quickly
A percentage of URLs usable at this point

What counts as usable?

The main text is visible and not obscured in any place
No obnoxious scrolling to find or read the content (horizontal or vertical)
The main content is not hard to distinguish from surrounding content (where it is normally easy)

(edit) I think the "best" solution here is a bit debateable. The most fully featured, which this meant, it probably the lxml solution. But I think the one we'll carry on with is the lxml_streaming, so I'm going to conduct the investigation with that. I might dip into both to get an idea of whether it's easily fixed by something I've already done.

jon-betts commented 4 years ago

Ok, we have a serious cookie problem. We're accumulating cookies like mad as you browse about. Eventually we accumulate so many that the header gets big enough for NGINX to reject it with:

400 Bad Request

Request Header Or Cookie Too Large

I'm not sure where we'd start with this one.

robertknight commented 4 years ago

We're accumulating cookies like mad as you browse about.

Aren't you stripping the Set-Cookie header out of the response? I believe legacy Via strips this header for all requests, so we can do the same as a starting point. We really don't want to have cookies intended for one proxied origin sent to requests made to another proxied origin.

I think we will eventually need to support cookies in order to make certain sites work, but we would need to look into some way of isolating them to the proxied origin, which would very likely involve modifying the cookie (eg. to add Domain= or Path= scoping).

jon-betts commented 4 years ago

The majority of cookies are added by advertising Javascript, so we can't do much about it from the backend. We are now proxying certain cookie headers in the rewriters by appending a modified path to them. The direct NGINX proxy strips certain Hypothesis cookies and allows others (with if statements and regexes... hmm).

To support this the URLs look a bit goofy now. If you have a url at scheme://host/path?params

You end up with a url like: http://localhost:9093/html/scheme/host/path?url=<urlencoded url again>

This allows us to set cookies by path, and is more readable for the user. I don't think we can get away without the full url again, as we have learned from Canvas that there's no way to preserve it accurately once it's been through the wringer of WSGI and friends if it had encoded chars in it. For some URLs we must keep the exact characters as they are signed with an authentication param.

jon-betts commented 4 years ago

So I've had a decent shot at patching this up and the final numbers are in for now.

There's one odd discrepancy which is www.mytimes.com, which loads and looks perfect, but really murders your CPU for a while. I don't know if we count that as usable or not. I'll include both.

There are two numbers presented. One is the direct number of the 100 tested that worked, the other takes into account the frequency with which they are used in assignments at the present time (of the top 100).

Counting mytimes as un-usable

Sites usable: 86%
Assignments usable: 84%

Counting nytimes as usable

Sites usable: 87%
Assignments usable: 91%

(reminder) - I should add the percentage of the top 1000 we have represented here. Add caveats about "medium.com" and small sample size.

jon-betts commented 4 years ago

Some comparatives with original flavour Via. The percentages are percentage count of the top 100 and top 1000.

Legacy Via

ok - Site: 96, Count: 1975 (95.27%) (54.04%)
bust - Site: 3, Count: 80 (3.86%) (2.19%)
other - Site: 1, Count: 18 (0.87%) (0.49%)

New Via

ok - Site: 84, Count: 1712 (82.59%) (46.84%)
bust - Site: 14, Count: 327 (15.77%) (8.95%)
other - Site: 2, Count: 34 (1.64%) (0.93%)

Comparative

better 9 - Both render the page, we render it better (small errors)
same 77
worse 2 - Both render the page, we render it worse (small errors)
newly_broken 11 - Via renders it successfully, we don't
other 1

Of note: Via is seriously broken in dev. I don't know why (SSL?) but it renders absolute gibberish for a lot of sites.

jon-betts commented 4 years ago

Some page specific ideas about what we need

Medium

https://medium.com/@Bhasingursifath/graffiti-art-or-vandalism-4ecda1f780ab

Certain sites (like Medium) construct URLs based on the scheme of the window. In dev we are running with http not https and so the wrong URL is constructed.

This isn't the reason Medium is failing, but it could be for other sites.

NPR

https://www.npr.org/2017/03/21/520820142/soy-marr-n

The initial response from NPR is a redirect. We need to handle this as a redirect ourselves, or the base doesn't match the content etc.

Then it sets cookies with javascript. We don't do anything about this and it's a total mess. We are leaking state between sites as well.

Finally it calls window.location to change the URL, which need to intercept somehow. This is apparently done in Via by actually regexing the JS content to insert a custom function which intercepts the call. This has side effects, as the regexing sometimes rewrites things it shouldn't.

Other sites

Some random things that are failing or won't work. Not sure these are the causes of anything:

Web workers won't register on the wrong domain
Anything that relies on the window location to do routing, like single page apps
We aren't currently rewriting in page CSS so inline fonts will probably fail

jon-betts commented 4 years ago

Super complex sites with no obvious failure reason

A surprisingly large number of sites seem to load fine, and then spot something is wrong and hide their content (e.g. https://www.fastcompany.com/3024273/infographics-lie-heres-how-to-spot-the-bs?) I'm not sure what tips them off.

In general quite a few of these aren't throwing any errors or failing to retrieve any content. They just inspect the general state and decide they don't like it.

This will be a power of guesswork and investigation to work out why. Which is to say, we honestly don't know what the gap we are missing is yet.

robertknight commented 4 years ago

I looked into medium.com and Medium-platform sites and was able to get them working by making window.location/location return the original URL rather than the proxied URL. See https://hypothes-is.slack.com/archives/C4K6M7P5E/p1591614783488800 for details and the html-poc-window-proxying branch for the code.

In that Slack thread I also identified that the principle issue with the npr.org site is the fact that, for EU visitors, it redirects to a GDPR login page and the proxied URL's <base> tag reflects the original URL rather than the redirected URL and hence various JS/CSS assets fail to load. An important thing to be aware of with GDPR banners is that many of those screens/popups won't be shown in production as requests from QA/production Via always come from a US location. It is still worth figuring out how to handle these though in case US privacy law changes result in US visitors seeing these in future.

robertknight commented 4 years ago

A surprisingly large number of sites seem to load fine, and then spot something is wrong and hide their content (e.g. https://www.fastcompany.com/3024273/infographics-lie-heres-how-to-spot-the-bs?) I'm not sure what tips them off.

This URL is now working on the html-poc branch for me. I didn't implement any specific changes to resolve this URL, so one of the other changes (most likely the window/window.location patching) got it working.

jon-betts commented 4 years ago

Some of the previous broken sites are now working, but other previously working sites are now broken. The new approach relies on spotting things with regexes, which is fragile in a way intercepting and monkey patching fetch isn't.

On the plus side, this gives us our first scalp over Via (BBC now works), but I honestly don't have the heart to go through all 100 again. I think the conclusion wouldn't be changed even if I did.

We can do this at the cost of it not currently being reliable, and any attempt to work out if we have made things better or worse is very costly. In general I suspect this pattern of increased complexity and not knowing if we are close or far from the end is likely to continue. This approach of patching could be one tweak away from being perfect, or fundamentally flawed. I doubt you could know until you conclusively got it working for every site (which there is no way to be sure of).

List of missing capabilities

Client side cookie rewriting solution - Required
Location interception - Required - We have something here, but it causes errors
Inline JS rewriting - Required (as above, same solution but needs to be applied inline)
CSS inline rewriting - Nice to have (makes fonts always work)
??? - Unknown, unknowns - We do not know what is required to fix remaining sites, the above could do it, or not?

Unlisted in the above is making everything we have work correctly and be robust. So we "have" many capabilities but also not really: they aren't finished. For example there have been updates to the Python side which I haven't crossed over into the JS side.