w3c / webmention

Webmention spec
https://www.w3.org/TR/webmention/
112 stars 46 forks source link

webmention verification should specify using a HEAD request #46

Closed bear closed 8 years ago

bear commented 8 years ago

Section 3.2.2 of https://www.w3.org/TR/webmention/#webmention-verification says

If the receiver is going to use the Webmention in some way, (displaying it as a comment on a post, incrementing a "like" counter, notifying the author of a post), then it must perform an HTTP GET request on source, and follow any HTTP redirects (up to a self-imposed limit such as 20) and confirm that it actually links to the target.

The suggestion of using a GET is wrong IMO because it 1) is doing a full resource request when a HEAD request will suffice at this stage and 2) by requiring a GET my implementation really cannot perform a HEAD (for the first reason)

My suggested change would be to say

" If the receiver is going to use the Webmention in some way, (displaying it as a comment on a post, incrementing a "like" counter, notifying the author of a post), then it SHOULD perform an HTTP HEAD request on source, and follow any HTTP redirects (up to a self-imposed limit such as 20) and confirm that it actually links to the target."

wilkie commented 8 years ago

That's interesting. How are you doing link verification with HEAD? Are you just looking for page existence? I'd imagine your implementation will have to support GET since the things linking to you are going to be arbitrary? I'm a bit curious about your use-case.

edit: "link verification" may be a bit overloaded. I mean, "how does one check to see if your url exists on their page with HEAD?"

dissolve commented 8 years ago

Maybe text like section 3.1.1. you MAY perform a HEAD first

aaronpk commented 8 years ago

It sounds like this might need to be broken into two sentences/phrases. Obviously you can't check whether the source links to the target unless you do a GET request, since links only appear in the body. However an initial check of whether the source exists (e.g. it returns a redirect or HTTP 200) is a good idea to do before making a GET request.

What about this new expanded text, which includes changes suggested in #45?

If the receiver is going to use the Webmention in some way, (displaying it as a comment on a post, incrementing a "like" counter, notifying the author of a post), then it SHOULD verify that the source URL links to the target URL. To verify this, the receiver SHOULD first make an HTTP HEAD request to the target URL to follow redirects and check whether the target URL eventually returns HTTP 200. (The receiver SHOULD limit the number of redirects it follows.) Then, the receiver MUST make a GET request to fetch the body of the document.

I think it's better to start by saying SHOULD make a HEAD request rather than first saying MUST make a GET request and qualifying that with SHOULD make a HEAD request.

wilkie commented 8 years ago

@dissolve that makes sense

bear commented 8 years ago

yes, I was trying to suggest that the source existence check SHOULD first be done with HEAD and then if it is found to be present, followed up with a GET

the reasoning is two fold - HEAD is saner from a server-friendly point of view and it will follow redirects, so instead of, for example, doing 4 GETs because of redirects you do 4 HEADs and 1 GET

sandhawke commented 8 years ago

In what situation could doing HEADs possibly save you any work? If there's a redirect, HEAD and GET are exactly the same. If there's no redirect, you have to do a GET to get the content and see if the target URL string is present.

bear commented 8 years ago

if the HEAD request has redirects but is still not present at the final endpoint then you will have saved n-1 GETs

At the core of it is, IMO, just plain web server politeness. The "is the URL valid" is a check for presence and doesn't require retrieving the content (or causing the content to be generated.) For my specific use case I use the HEAD to determine if the webmention is even added to my worker queue for further processing as I run a static site and all the real work happens later. So getting a result from a HEAD request means my round-trip time is so much faster and doesn't cause task queue churn.

voxpelli commented 8 years ago

So if a HEAD request fails then how should the receiver act? Not all servers implement proper support for HEAD requests and thus may return some kind of error on such requests while ordinary GET requests are supported by the server. Is there a need for a note on what to do if a HEAD return eg. a 5xx error or a 405 error?

I would prefer a wording that say that a client MAY try to do a HEAD request before doing a GET to note that it's a valid route to do what @bear does while still keeping the existing GET route to also be a valid option.

I do GET:s right away in my Webmention endpoint currently and haven't felt a need to implement HEAD requests so far. I don't see a need to check for "URL validness" prior to fetching that URL. In cases like WordPress and other simple apps the "URL validness" check will be pretty much as expensive as a GET request so rather than saving resources it will cause more resources to be consumed.

sandhawke commented 8 years ago

In general, doing a HEAD before a GET is waste of network resources. It is not "more polite". Doing a head before a GET is like driving to the store to make sure it's still there, then driving home, then driving to the store again to get your groceries. If the store is closed or has moved, it still makes sense to just drive to the store to get your groceries, and at that point maybe encounter the redirect. The only waste is that maybe you brought your shopping bags.

In HTTP, the only added cost of a GET over a HEAD when you end up with a redirect is the body like

<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx/1.9.10</center>
</body>
</html>

which is tiny and much less than the overhead of having to do the GET again if the HEAD returned 200.

As far as I can tell, the only time a HEAD makes sense is when you're probably not going to do a GET. I can see your use case, @bear, although I question it. A GET on tantek.com takes me 320ms on average. A HEAD takes 290ms. A server that does both is using a lot more network resources in total. What do you gain by doing the HEAD immediately, with a sense of time pressure, that you need the result as soon as possible? Why would there be time pressure on partially validating a webmention?

bear commented 8 years ago

IMO it boils down to the fact that RFC2616 lists one of the reasons for HEAD as

The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response. [...] This method is often used for testing hypertext links for validity, accessibility, and recent modification.

Which describes exactly what is happening at this point of a webmention flow and the reason I use the term "more polite" as web servers don't return the full body.

So that means that it should be faster than a GET, the case that one implementation doesn't show gobs of time saved doesn't mean it shouldn't be done. The analogy of driving to the store is a strawman against doing two requests - meatspace analogies don't make any sense in the realm of bits.

Everyone seems to be saying, in essence, that they don't do HEAD first because they immediately validate webmentions in the code -- great, then the wording can be changed to say that this is an option. I just didn't want the wording to be such that doing a HEAD request is considered invalid for the offline processing reason I presented.

aaronpk commented 8 years ago

Just for clarification since some people seem to be unclear, this issue is specifically about whether to allow/recommend HEAD requests when making the HTTP request to verify that a webmention source URL does in fact link to the target URL.

This is not about the HTTP request used to discover the webmention endpoint in the first place. There are definitely obvious benefits to doing a HEAD request first when discovering the endpoint, since the webmention endpoint may be advertised in the HTTP header, which means that the sender can avoid making a GET request at all.

When verifying the link back, the receiver is going to have to make a GET request eventually, in order to check the document body for the link. The question is whether we should recommend that a receiver first make a HEAD request to the source URL before making the GET request.

sandhawke commented 8 years ago

@bear What does your implementation do differently in the cases where the HEAD returns 200 and 404? How is that different behavior useful? My problem here is I can't think of why that would be useful.

Also, I believe saying clients MAY do a HEAD before they do a GET is vacuous. Isn't that the default for the web? You can always do a HEAD before a GET, so there's no point in saying it.

aaronpk commented 8 years ago

As per https://www.w3.org/wiki/Socialwg/2016-06-07-minutes#webmention-46-resolution I've updated the text to include security considerations clarifying that a receiver is allowed to use HEAD request during verification, and when verifying the link, suggesting to include an Accept header.