Allow publishers to inform Hypothesis about permanent URL changes for content

robertknight commented 7 years ago

Feature Request Form

Problem you are trying to address with this feature

A publisher moves content from one URL (say /articles/foo) to another (say /foo) and wants annotations made on the original URL to appear on the new URL.

We don't provide clear documentation for publishers on how to achieve this. Hypothesis does support using canonical links to 'teach' the service about the equivalence of different URLs, but it doesn't solve this problem.

Your solution

There is a standard mechanism in the web to inform search engines and browsers about the move if they encounter the old link - an HTTP redirect. Perhaps we can leverage this, at least for publicly accessible web pages:

Publisher moves content from URL A to URL B
Publisher sets up a 3xx redirect from URL A to URL B
Publisher submits URL A to a web-based interface or a Hypothesis API endpoint. Hypothesis then performs a HEAD request against each URL and if it returns a redirect, records a URL equivalence between URL A and URL B.

Why require a redirect instead of just allowing the publisher to submit equivalences directly? It provides proof that the redirect is valid - ie. that URL A really does redirect to URL B.

seanh commented 7 years ago

Hmm. And in theory, if we ourselves discovered some URL(s) that were now redirecting, we could submit them to this API ourselves.

As a matter of fact there seems to be nothing (other than scale) preventing me from running a script on my laptop that slowly works its way through all annotated documents in Hypothesis fetching each document and, if it gets a 3xx, submitting it to this API.

Seems like a nice idea.

robertknight commented 7 years ago

And in theory, if we ourselves discovered some URL(s) that were now redirecting, we could submit them to this API ourselves.

Right. Relying on redirects having been set up means that the interface/endpoint for submitting the original URLs doesn't require any authentication - other than perhaps for rate limiting submissions.

judell commented 7 years ago

+1. Very helpful to publishers who can control the redirection from an old to new namespace. What do you think could/should be done for those who can't, for one reason or another, control the redirection? (Prior thinking: https://trello.com/c/pm4l5DqP/17-allow-admins-to-update-document-equivalence-data)

robertknight commented 7 years ago

What do you think could/should be done for those who can't, for one reason or another, control the redirection?

Can you think of some examples of why this might be the case? Relying on redirects is important in allowing us to trust that url A really has moved to url B.

judell commented 7 years ago

Example: A site hosted on wordpress.com.

robertknight commented 7 years ago

Surely the user can set up redirects on Wordpress?

jeremydean commented 7 years ago

can we train the client to recognize these situations and make the equivalences automatically? one issue is that publishers will not always know about hypthes.is annotations and wouldn't then have the foresight to submit changes/equivalences.

judell commented 7 years ago

I am not recommending the following, just noting a possibility for client-side intervention:

https://docs.google.com/document/d/1SrtcRXO_Ib_Ka8ysYY07UqBVUSzSxh33CW72F2JIkak

In theory we could provide a client setting that enable a user to override a stale canonical when the user knows what it should now be. On the one hand, it's crazy to expect anyone would do that. On the other hand, a user who has just lost the use of a batch of annotations is highly motivated to fix that and might be willing to configure H with an old->new domain mapping.

Such a mechanism would not replace the proper solution Robert proposes, but could complement it, and be a fallback for those cases where publishers are unwilling/unable to make adjustments on their end.

sean-roberts commented 7 years ago

This still seems like it would be an admin tool for us to use and is an improvement to the current database update model. Because if we are wanting to make this self-service, how do we know someone is a publisher? And would that work of research + assigning that role to account be worth it just to make it self-service? I'd say no, but I am interested in thoughts on that.

Why require a redirect instead of just allowing the publisher to submit equivalences directly? It provides proof that the redirect is valid - ie. that URL A really does redirect to URL B.

This approach, which I have no problem with, only shows that a redirection happened. It doesn't actually tell us if the redirect still contains the annotations.

In addition, these approaches are more for "publishers helping their users" which is valid but that's only one side of it. There are users who run into this all the time and that publisher may have no idea what Hypothesis is.

So, since you have given a solution which is pretty simple and a much-needed improvement improvement to manual sql writes, I'd like to brainstorm a more comprehensive solution. Here's what I am wondering, can we build something that can anchor annotations on a page from the server. If we can do that, we can identify if page X has all annotations we expect to see on it. That would allow us to be better at what you suggested or completely automate this process (ultimately providing a better experience for user and publishers). That automation could be triggered by a few things. A manual request via domain, the client seeing 300s + mismatched document, or a schedule that checks annotations automatically for 300s.

I have thought about that service a bit. I think a two phase approach would be good. Phase 1 is a simple request to the document to get the html back as a string. Then use the anchoring libraries that we have to try to find the values. The phase two - if we have no anchoring, would submit the url to a phantomjs server that will request the page + inject our annotation layer in. If it finds the annotations then we can have confidence that the change is good. Phase 1 is quick and will likely provide a very high success rate but it will fail to find annotations if there is some js causing them. Phase 2 is much slower but comes at it as if it were an actual user from the client browser.

Having a service like that would be really useful for a lot of reasons but having a tool to make confident decisions (if only for a sample of known affected uris) on changing document locations is huge.

All that said, I am not trying to blow up the scope of your suggestion. Just wondering, since the team seems to like the suggestion you have made, what it would look like to have a more automatable tool.

sean-roberts commented 7 years ago

can we train the client to recognize these situations and make the equivalences automatically? one issue is that publishers will not always know about hypthes.is annotations and wouldn't then have the foresight to submit changes/equivalences.

Just saw that comment, I 100% agree.

judell commented 7 years ago

"can we build something that can anchor annotations on a page from the server?"

I've had that thought too, and agree it is interesting and useful for a number of reasons.

"submit the url to a phantomjs server that will request the page + inject our annotation layer in."

I did a version of that when I first started at H. It used Selenium WebDriver, ran on a client, injected the H app into the client, and would then use JS to compare what the API reported to what actually anchored. It helped us get through a major rewrite of the anchoring subsystem, but wasn't well suited to automation. @robertknight subsequently looked at this and made it much better, but it never found its way into our system, I still think something like it could be valuable.

That said, I also think telemetry from the client could take us a long way, as I think you have elsewhere suggested, if we can arrange to be alerted when, say, a client that was formerly reporting that 74 annotations anchored at a given URL starts reporting that 0 are anchoring.

robertknight commented 7 years ago

This approach, which I have no problem with, only shows that a redirection happened. It doesn't actually tell us if the redirect still contains the annotations.

To be pedantic, you mean that it doesn't tell us if the content returned after following the redirect is the same as, or close enough to, the original annotated content.

That's a good point. A concrete example would be an article that is behind a login form or paywall. If the Hypothesis service fetches the original URL, it may get redirected to the login page which does not contain the annotated content. Let's further assume that the login system is not well designed and the login page URL doesn't include the original requested URL (eg. in the query string).

Setting up server-side rendering of pages and loading of the client into them is probably going to be a substantial task - both to set up and maintain. Performing a sanity check on the content to test whether the quotes from annotations on the original URL appear within the text of the fetched page on the other hand sounds like a good idea and somewhat simpler - providing the content appears in the HTML of the initial URL fetch rather than requiring client-side JS execution.

robertknight commented 7 years ago

"can we build something that can anchor annotations on a page from the server?" I've had that thought too, and agree it is interesting and useful for a number of reasons.

Agreed, but I think we would want to approach that in phases - initially just matching quotes against static text extracted from the HTML. Later we could potentially implement a more sophisticated solution that actually executed/rendered the page on the server.

I think both of these are out of scope for an initial implementation of this idea - a much dumber solution would be that the system prints out a list of original and new URLs, a count of annotations on the original URL and asks the user to eyeball the results and approve them.

dwhly commented 6 years ago

a much dumber solution would be that the system prints out a list of original and new URLs, a count of annotations on the original URL and asks the user to eyeball the results and approve them.

Agree w/ this. How about something that just allowed a trusted insider to submit a URL substring to map for another URL substring (since most URL changes preserve at least part of the original path) without needing to fetch the original page at all (since often pages will be behind firewalls, either in school systems or in paywalled journals, media).

judell commented 6 years ago

An example we heard about today.

http://viatourism.revues.org/{id} redirects to http://journals.openedition.org/viatourism/{id}

For example:

http://viatourism.revues.org/1339 -> http://journals.openedition.org/viatourism/1339

Several batches of annotations on several openedition.org journals are now stranded.

jeremydean commented 6 years ago

Here's another example:

digitalpedagogylab.com/hybridped/. --> hybridpedagogy.org.

for example:

http://www.digitalpedagogylab.com/hybridped/teaching-in-our-right-minds/

-->

http://hybridpedagogy.org/teaching-in-our-right-minds/

this is the second time that the publisher has changed their domain. previously, christof created this PR to help annotations re-anchor:

https://github.com/hypothesis/h/pull/3416

@robertknight is the above repurposable for this particular (digped/hybridped) update? or is more work required?

klemay commented 6 years ago

There are users who run into this all the time and that publisher may have no idea what Hypothesis is.

Another example: https://hypothesis.zendesk.com/agent/tickets/2283

klemay commented 6 years ago

@judell has built a proof of concept for the extension as described in 202:

When navigating to a page with Via or a browser extension, both could be aware of the fact that they have arrived at the current URL via a redirection and they could provide that information to our service.

judell commented 6 years ago

@klemay Note that so far I've only tried the approach in https://github.com/hypothesis/vision/issues/202, which is client-side redirection. It's somewhat helpful, but only when you follow a link from, say, www.digitalpedagogylab.com/hybridped/teaching-in-our-right-minds/, and are redirected to http://hybridpedagogy.org/teaching-in-our-right-minds/. The approach outlined here is more important, because it will yield the result people will expect. Even without the browser seeing a redirect, we'll know about the change and handle it in the server.

The approach proposed here would be implemented in the server. The site owner would point us at, say, www.digitalpedagogylab.com/hybridped/teaching-in-our-right-minds/, we'd see that it redirects, and record the equivalence for later use. This would be the general solution, the issues/202 method would only come into play for people visiting old URLs that had not been updated by this method.

ajpeddakotla commented 5 years ago

Also this issue, and @robertknight's comment in particular: https://github.com/hypothesis/vision/issues/222

Some very good questions! To keep things focused, I'll give a short answer to your initial question based on the status today and some thinking we've done in the past.

Today, if a document moves from one URL to another then Hypothesis will not automatically learn about the change. Documents served from a particular URL can however declare their equivalence to other URLs by including appropriate metadata. Once our backend learns about the equivalence, it will serve up the same annotations regardless of which URL in a set of equivalent URLs is requested.

Documents can declare persistent URL-independent identifiers (eg. DOIs) which the client will pick up from standard meta-tags and use for fetching annotations. This allows the same set of annotations to be surfaced when the user visits the same document at different URLs, even if no equivalences between those URLs have previously been established in our database.

We like to re-use existing features of the web where appropriate, in keeping with our "mission" of enhancing the capabilities of the web. Once such approach to handling moved content would be to leverage redirects. The site owner establishes a redirect from the old to the new URL and Hypothesis provides an interface where a user (anyone) can submit a URL and our service will follow the redirect and establish an equivalence between the old and new URLs based on that. There are some challenges to deal with, such as what happens if the redirect is dynamic, but I think those are solvable.

hypothesis / product-backlog