Closed jon-betts closed 4 years ago
Ok, we have a serious cookie problem. We're accumulating cookies like mad as you browse about. Eventually we accumulate so many that the header gets big enough for NGINX to reject it with:
400 Bad Request
Request Header Or Cookie Too Large
I'm not sure where we'd start with this one.
We're accumulating cookies like mad as you browse about.
Aren't you stripping the Set-Cookie
header out of the response? I believe legacy Via strips this header for all requests, so we can do the same as a starting point. We really don't want to have cookies intended for one proxied origin sent to requests made to another proxied origin.
I think we will eventually need to support cookies in order to make certain sites work, but we would need to look into some way of isolating them to the proxied origin, which would very likely involve modifying the cookie (eg. to add Domain=
or Path=
scoping).
The majority of cookies are added by advertising Javascript, so we can't do much about it from the backend. We are now proxying certain cookie headers in the rewriters by appending a modified path to them. The direct NGINX proxy strips certain Hypothesis cookies and allows others (with if statements and regexes... hmm).
To support this the URLs look a bit goofy now. If you have a url at scheme://host/path?params
You end up with a url like: http://localhost:9093/html/scheme/host/path?url=<urlencoded url again>
This allows us to set cookies by path, and is more readable for the user. I don't think we can get away without the full url again, as we have learned from Canvas that there's no way to preserve it accurately once it's been through the wringer of WSGI and friends if it had encoded chars in it. For some URLs we must keep the exact characters as they are signed with an authentication param.
So I've had a decent shot at patching this up and the final numbers are in for now.
There's one odd discrepancy which is www.mytimes.com
, which loads and looks perfect, but really murders your CPU for a while. I don't know if we count that as usable or not. I'll include both.
There are two numbers presented. One is the direct number of the 100 tested that worked, the other takes into account the frequency with which they are used in assignments at the present time (of the top 100).
Counting mytimes as un-usable
Counting nytimes as usable
(reminder) - I should add the percentage of the top 1000 we have represented here. Add caveats about "medium.com" and small sample size.
Some comparatives with original flavour Via. The percentages are percentage count of the top 100 and top 1000.
Legacy Via
New Via
Comparative
Of note: Via is seriously broken in dev. I don't know why (SSL?) but it renders absolute gibberish for a lot of sites.
Some page specific ideas about what we need
https://medium.com/@Bhasingursifath/graffiti-art-or-vandalism-4ecda1f780ab
Certain sites (like Medium) construct URLs based on the scheme of the window. In dev we are running with http
not https
and so the wrong URL is constructed.
This isn't the reason Medium is failing, but it could be for other sites.
https://www.npr.org/2017/03/21/520820142/soy-marr-n
The initial response from NPR is a redirect. We need to handle this as a redirect ourselves, or the base doesn't match the content etc.
Then it sets cookies with javascript. We don't do anything about this and it's a total mess. We are leaking state between sites as well.
Finally it calls window.location
to change the URL, which need to intercept somehow. This is apparently done in Via by actually regexing the JS content to insert a custom function which intercepts the call. This has side effects, as the regexing sometimes rewrites things it shouldn't.
Some random things that are failing or won't work. Not sure these are the causes of anything:
A surprisingly large number of sites seem to load fine, and then spot something is wrong and hide their content (e.g. https://www.fastcompany.com/3024273/infographics-lie-heres-how-to-spot-the-bs?) I'm not sure what tips them off.
In general quite a few of these aren't throwing any errors or failing to retrieve any content. They just inspect the general state and decide they don't like it.
This will be a power of guesswork and investigation to work out why. Which is to say, we honestly don't know what the gap we are missing is yet.
I looked into medium.com and Medium-platform sites and was able to get them working by making window.location
/location
return the original URL rather than the proxied URL. See https://hypothes-is.slack.com/archives/C4K6M7P5E/p1591614783488800 for details and the html-poc-window-proxying
branch for the code.
In that Slack thread I also identified that the principle issue with the npr.org site is the fact that, for EU visitors, it redirects to a GDPR login page and the proxied URL's <base>
tag reflects the original URL rather than the redirected URL and hence various JS/CSS assets fail to load. An important thing to be aware of with GDPR banners is that many of those screens/popups won't be shown in production as requests from QA/production Via always come from a US location. It is still worth figuring out how to handle these though in case US privacy law changes result in US visitors seeing these in future.
A surprisingly large number of sites seem to load fine, and then spot something is wrong and hide their content (e.g. https://www.fastcompany.com/3024273/infographics-lie-heres-how-to-spot-the-bs?) I'm not sure what tips them off.
This URL is now working on the html-poc
branch for me. I didn't implement any specific changes to resolve this URL, so one of the other changes (most likely the window
/window.location
patching) got it working.
Some of the previous broken sites are now working, but other previously working sites are now broken. The new approach relies on spotting things with regexes, which is fragile in a way intercepting and monkey patching fetch isn't.
On the plus side, this gives us our first scalp over Via (BBC now works), but I honestly don't have the heart to go through all 100 again. I think the conclusion wouldn't be changed even if I did.
We can do this at the cost of it not currently being reliable, and any attempt to work out if we have made things better or worse is very costly. In general I suspect this pattern of increased complexity and not knowing if we are close or far from the end is likely to continue. This approach of patching could be one tweak away from being perfect, or fundamentally flawed. I doubt you could know until you conclusively got it working for every site (which there is no way to be sure of).
Unlisted in the above is making everything we have work correctly and be robust. So we "have" many capabilities but also not really: they aren't finished. For example there have been updates to the Python side which I haven't crossed over into the JS side.
Taking the most capable and easiest to extend rewriter produced in the POCs we should:
Output
What counts as usable?
(edit) I think the "best" solution here is a bit debateable. The most fully featured, which this meant, it probably the
lxml
solution. But I think the one we'll carry on with is thelxml_streaming
, so I'm going to conduct the investigation with that. I might dip into both to get an idea of whether it's easily fixed by something I've already done.