Webmention verification

hvdsomp commented 8 years ago

The requirement to do an HTTP GET on the source and to verify whether it indeed references the target excludes important use cases, for example in web-based scholarly communication. I will explain by means of a very hot topic: linking publications with datasets. Other cases exist.

For publication/dataset linking, the publication (source) would use Webmention to inform the dataset (target) that it is being referenced in the paper. Typically:

Both the publication and the dataset are identified by a DOI, say, respectively http://dx.doi.org/12.34/567 and http://dx.doi.org/76.54/321. These are the URIs one would be inclined to use as source and target.
When dereferencing these URIs and following all redirects, one ends up on a so-called landing page that provides an abstract of the DOI-identified resource. The actual content is "somehow" linked from that page.
In many cases, the publication (source) is a PDF that sits behind a paywall.

It would be very hard (or even impossible) to perform "Webmention verification" as described in the spec because:

The receiver would have to determine where the actual content (eg PDF file) can be found when ending up on the landing page
The receiver may not be able to access the actual content (eg PDF file) because of the paywall
If the receiver would be able to penetrate the paywall, it would have to parse the PDF file (not impossible, but hey ...)

Even if one were to use the URIs of the actual content (PDF file, dataset) instead of the DOIs as source/target URIs, two of the above problems would remain.

I very much understand that this problem is to a large extent related to the fact that web-based scholarly communication does not necessarily operate in a manner that aligns very well with the way other pockets of the web do. Then again, I assume paywalls and landing pages exist beyond scholarly communication. And, most importantly, I would love if webmention could be used in scholarly communication, see eg slides 45-52 of [http://www.slideshare.net/hvdsomp/reminiscing-about-interoperability].

Hence a suggestion to consider an additional aspect regarding "Webmention verification", which could be along these lines "if the receiver has a trust relationship with the sender, verification is optional".

Cheers

Herbert Van de Sompel Los Alamos National Laboratory

kevinmarks commented 8 years ago

Extending this to PDF is challenging, yes. Many publications of this nature do have HTML version too, though they are still under access control. For example http://onlinelibrary.wiley.com/doi/10.1002/asi.21571/abstract (which I found by searching google scholar for webmention) has HTML versions with controlled access.

If you are part of an institution that has gateway access to these kinds of publication, you could run the webmention verifier on a server that can use the institutional proxy, and verify the HTML versions of the documents that way.

The previous work here may be useful for you: http://lombardpress.org/2016/04/16/iiif-webmentions/

hvdsomp commented 8 years ago

My comment was about (scholarly) publisher-to-publisher use of Webmention. An institutional subscription has nothing to do with the problem I describe. This is not about a user having access to a paper or not. This is about the receiving publisher not having access to sender publisher content. The PDF issue is kind of secondary in the problem I describe. The core issue is the paywall.

dissolve commented 8 years ago

The requirement for an HTTP GET would just mean that the publisher of the webmention would need to have some page which is not behind a paywall that could simply list the datasets it references. A simple abstract or bibliography page could easily be made available in html without the PDF content. If memory serves correctly, many already do that.

tantek commented 8 years ago

@hvdsomp "core issue is the paywall" - this is an astute observation, and not unique to webmention. Paywalls break all sorts of Web Architecture. Hyperlinks, img src, script src, style sheets, iframes, pretty much all web hypertext / hypermedia. I suggest you consider raising this as an issue with the W3C TAG (@w3ctag), something like "Paywalls break web architecture, what is to be done about this?" should kick-off a good discussion. Perhaps you can convince the Web Payments WG to take on "pay walls" as a use-case as well. It's definitely a reality of current (attempted) use of the web, and something worthy of further cross-group discussion.

hvdsomp commented 8 years ago

Hey tantek, that's one big can of worms you suggest I open ;-) The thing is, I have been an Open Access advocate from the early days of the movement. So, I don't want paywalls in scholarly communication. And I definitely don't want to spend time in that wormhole. But paywalls are a fact of life and I would love to work towards establishing increased web-centric interoperability for scholarly communication (e.g. using Webmention) in the current environment. Hence my "trust" suggestion as an alternative to an actual HTTP GET on the source.

rhiaro commented 8 years ago

Hence a suggestion to consider an additional aspect regarding "Webmention verification", which could be along these lines "if the receiver has a trust relationship with the sender, verification is optional".

I actually +1 this general idea, but having the 'trust relationship' out of band is kind of awkward. ~Or maybe not? As how to do verification is out of scope of the webmention spec, and actually up to the receiver, you could choose to "do verification" by consulting an internal list of domains you trust.~

ActivityPub's method of doing notifications using the ActivityStreams2 vocabulary (summarised here) allows you to include an authentication token of some kind (to be determined I think) in the payload with the notification, so you might not need to GET to verify based on that. It also lets you send more than just the source and the target as part of the notification, so if you can't GET it and you do trust the source, you can take that data at face value and use it to decide how to display it (or whatever else you might want to do with it).

sknebel commented 8 years ago

The paywalls I regularly deal with "only" hide the main content but expose things like citation lists -> as long as those are properly linked Webmentions between the paywall pages via the doi-links could work.

Also, paywall-pages are a similar issue to silo pages like Twitter in that they hide content and don't include nice markup. For those there are services like https://brid.gy/ that provide nicely formatted metadata for Webmention endpoints. Something similar could be made paywalls as well, but that is a stop-gap solution that requires extra work. (And requires extra trust relationships to those services as well, but no explicit authentication)

aaronpk commented 8 years ago

how to do verification is out of scope of the webmention spec

That's incorrect. The spec describes specifically how to verify the link for HTML and JSON documents here: https://www.w3.org/TR/webmention/#webmention-verification

I think @dissolve's suggestion is on the right track. If the publishers show the list of other articles they mention on the "landing page", then normal webmention verification will work.

hvdsomp commented 8 years ago

Hi sknebel, yes, some paywalls provide eg reference lists for free, others don't, yet others don't even provide a metadata record describing the content itself without providing credentials. Lots of variations on the theme.

As suggested by dissolve and you, I agree that things could be done to make webmention work even for paywalled environments but that would require extra effort beyond just the implementation of webmention, eg exposing special-purpose resources. I am not feeling it ...

rhiaro commented 8 years ago

@aaronpk Sorry, I guess I was thinking about verification beyond string-matching (deciding whether it's a type of mention you're interested in) that's out of scope, but you're right the string matching is actually that matters in this case.

dissolve commented 8 years ago

I could see sending some sort of auth token along with webmention being an extension to webmention. But the fact that webmentions can normally be sent by anyone means it has to be an auth token, not just 'trusting' / whitelisting some other location.

Yes it would require some extra work, but not much. Especially since, in your example case, there is already a landing page which is not actually the PDF. since the landing page is where the source resolves to, that is where the verification would have to be done, not in the PDF thats behind a firewall.

aaronpk commented 8 years ago

Yeah you'd have to do this as an authenticated request, since any server can post a source and target to you. You can't simply write a rule that trusts a source domain, since any sender could send webmentions with that source domain. I also doubt you want to set up your system to "trust" webmentions sent from a specific IP address.

We actually already added a little note hinting at the potential for authenticated webmention requests here: https://www.w3.org/TR/webmention/#cross-site-request-forgery

kevinmarks commented 8 years ago

Bridgy really is the best analogy here. You could construct an equivalent service that has authorization to see the papers, and makes proxy public Webpages that marks them up with citation links. OpenLibrary, Google scholar could make these. Then you are deciding to trust that service's mapping, not the webmention sender.

On Wed, 25 May 2016, 11:27 Aaron Parecki, notifications@github.com wrote:

Yeah you'd have to do this as an authenticated request, since any server can post a source and target to you. You can't simply write a rule that trusts a source domain, since any sender could send webmentions with that source domain. I also doubt you want to set up your system to "trust" webmentions sent from a specific IP address.

We actually already added a little note hinting at the potential for authenticated webmention requests here: https://www.w3.org/TR/webmention/#cross-site-request-forgery

— You are receiving this because you commented.

Reply to this email directly or view it on GitHub https://github.com/aaronpk/webmention/issues/42#issuecomment-221663848

sandhawke commented 8 years ago

First off, I think the specific proposal ("if the receiver has a trust relationship with the sender, verification is optional") is okay. I also think it's redundant. Specifications exist in lieu of special agreements between parties. If two parties agree, they are always free to vary a protocol in the privacy of their own data streams. That's just how standards work. It's similar to how specifications say how parties have to behave, but not how they have to implement that behavior.

Second, on the issue of authentication, as we move to an HTTPS-everywhere Web, I wonder if we can't say that if a webmention is performed using the TLS certificate of the source, it need not be verified. I'm 75% sure TLS can be used that way. In some quick searching I was unable to find any reports of it being done, though.

Finally, on the specific use case, it really seems best to point out the advantages of having a public landing page and having that public landing page include the references, with links. By doing that, and supporting webmention, sites will not only provide a better service to end users, but increase pagerank, increase traffic, and draw in customers. The first landing page I tried almost worked, except the references were fetched via ajax, so they don't occur in the HTML by default. There's a flag to fix that ("show on one page"), so http://dl.acm.org/citation.cfm?id=383071&preflayout=flat should work fine with webmention.

hvdsomp commented 8 years ago

Thanks for the feedback, sandhawke.

Regarding (1), things are a bit more complex:

We are not really talking about direct publisher-to-publisher trust. There's just too many academic publishers (many hundreds) and I am pretty sure many don't trust one another ;-) I think that, in this scenario, trust could be derived from the parties involved using DOIs and the ability to lookup metadata about the DOI-identified resource via the CrossRef API.
My intention with proposing that bit of language is really about avoiding that publishers have an excuse not to implement Webmention. These publishers will not just implement out of their own desire. It will take community pressure. And, if there is anything in the spec that would suggest a publisher can't implement in a compliant manner, it might be used as an excuse not to implement. Yes, I know, it's a weird world out there.

Regarding (2): I can't really comment on your proposal. But I do know from implementing HTTPS for some Memento "Web Time Travel" services (actually in the context of a collaboration with the W3C) that it's a rather messy endeavor.

Regarding (3): Obviously I agree with all the benefits you mention regarding implementation of open landing pages with references, etc. It's just that many publisher will not be convinced. Reality. As I mentioned, there's even publishers that don't allow downloading an eg BibTex record describing a paper without the required credentials. Regarding your ACM example: it worked but it actually didn't work, right? And there's many many more publishers out there, of course. Bottom line: many proposals I have seen in this thread require small or big technical and conceptual changes to publisher platforms in order for them to be able to implement the very simple and very useful Webmention protocol. Let's just say I am utterly skeptical ...

dissolve commented 8 years ago

I was thinking more about this last night. What I would point out is that, there is not requirement that the HTTP GET request on source cannot have an auth token or other such data. Indeed this will be needed when doing any sort of private webmentions.

Perhaps a note in the text specifically calling out that the specifics of that GET are not defined in the webmention spec and may include additional auth mechanisms, etc.

kevinmarks commented 8 years ago

@hvdsomp I just realised you may not have got my reference to Bridgy. It's this site: https://brid.gy/ What you do is authenticate with your silo credentials, and then it will map the proprietary APIs and formats into HTML and webmention you with them. So it will map a tweet like this:

https://twitter.com/jlew8/status/735449055485165568

into this

https://brid-gy.appspot.com/post/twitter/kevinmarks/735449055485165568

which can then be parsed and added to the orignal post:

http://known.kevinmarks.com/2016/according-to-api-docs-you-cant-edit-people-out-of

This approach could work for the academic citation case, if you can create 'library cards' for the papers with abstract and references that send the webmentions.

sandhawke commented 8 years ago

@hvdsomp Sure, I share your scepticism about publishers really participating in the Web. But what else can we do, but make it as easy and rewarding as possible for them, and as far as I can see the current Webmention spec does that. Maybe it could be explained in a way that would resonate with them more, perhaps as a use case in the spec? Like, have a use case that is landing pages for scientific publications, where a paper can learn about citations to it via webmention? I don't think any normative changes would help these folks, though, as I understand the problem.

hvdsomp commented 8 years ago

sandhawke: a "landing page" use case might indeed be a good idea. along the lines of: increase inlinks to your landing page by: (1) making references available in the landing page (2) sending webmentions to referenced papers (3) have referenced papers link to your landing page

aaronpk commented 8 years ago

We discussed this during the f2f meeting and agreed to add a section describing the "landing page" use case. https://www.w3.org/wiki/Socialwg/2016-06-07-minutes#webmention-42-resolution

hvdsomp commented 8 years ago

Thanks! I'm very happy with that resolution.

Greetings

Herbert

On Jun 20, 2016, at 22:39, Aaron Parecki notifications@github.com wrote:

We discussed this during the f2f meeting and agreed to add a section describing the "landing page" use case. https://www.w3.org/wiki/Socialwg/2016-06-07-minutes#webmention-42-resolution

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

w3c / webmention

Webmention verification #42