Do not require rel=self for discovery

cweiske commented 9 years ago

The discovery phase currently requires that a document has two relation links:

rel=hub
rel=self

What is the reason for rel=self?

In my eyes, rel=hub should suffice since rel=self will be the URL itself. It should be made optional.

cc @aaronpk @tantek - http://indiewebcamp.com/irc/2015-03-18#t1426690743557

themel commented 9 years ago

The problem is canonicalization/feed aliasing. Most feeds can be accessed under many URLs (HTTP vs HTTPS, multiple hostnames, infinite spaces of ignored query parameters). The publisher can't/won't ping all of them when there's an update to the feed. The self link is an explicit promise to ping the self link topic when the feed changes, and this is the topic that subscribers should use. If we drop the self link requirement, we can either let subscribers that ended up on a feed via a URL that is not the canonical wait for updates in vain (bad) or make the hub's job much more difficult because it needs to understand that a ping to http://example.com/feed.xml might also affect subscribers to https://example.com/feed.xml?foo=bar. This fits the overall "center complexity in the hub" design approach, but it would probably lead to a worse user experience because it's hard to do this kind of aliasing detection reliably.

I also expect the gains from this simplification to be small since adding two links to a feed is basically the same amount of work as adding one link.

On Tue, May 26, 2015 at 2:00 PM, Christian Weiske notifications@github.com wrote:

The discovery phase http://pubsubhubbub.github.io/PubSubHubbub/pubsubhubbub-core-0.4.html#discovery currently requires that a document has two relation links:

rel=hub

rel=self

What is the reason for rel=self?

In my eyes, rel=hub should suffice since rel=self will be the URL itself. It should be made optional.

cc @aaronpk https://github.com/aaronpk @tantek https://github.com/tantek

— Reply to this email directly or view it on GitHub https://github.com/pubsubhubbub/PubSubHubbub/issues/36.

cweiske commented 9 years ago

Actually, adding the hub link in Apache is a single configuration line only:

Header append Link '<http://phubb.cweiske.de/hub.php>; rel="hub"'

Adding the self URL is difficult because it's a dynamic URL. So it's not the same amount of work; quite the contrary.

I understand the issue about the same file being available under multiple URLs. But if there is no self link, the publisher could have to take care that the URLs are only available under one URL.

tantek commented 9 years ago

I agree with not requiring rel=self.

re: canonicalization - there is prior art here we should be re-using, that is, rel=canonical - which is already well deployed and in use.

Thus here is a specific proposal.

Change: Publishers MUST have a rel=self link at their URL ("the URL") To: Publishers SHOULD have a rel=self link, but MAY instead:

provide a rel=canonical link (which they might have already) OR
assume rel=self same as the URL

Thus consuming code:

looks for a rel=self link, if not found
looks for a rel=canonical link, if not found
uses the current URL

Regarding: "since adding two links to a feed is basically the same amount of work as adding one link." - absolutely not true in experience. Example 1: what @cweiske said. Example 2: watching numerous users try to add the TWO links required for OpenID and screwing one of them up (in contrast to people trivially adding one rel=me link required for IndieAuth).

Basically, requiring two links instead of one for the very common case unnecessarily increases publisher responsibility and fragility of the whole system.

julien51 commented 9 years ago

I'm very strongly against this because this would bring one more case of silent failure. There's http vs https, there's also case issues and a bunch of other examples. Feedburner is pretty famous for this and f you subscribed to this URL instead of this one, you'd never get pings.

The worst case is for redirects and in this specific case, the hub has no way of matching the ping-ed URL and the actual feed resource.

Again, this is a particularly bad idea because this will silently fail. A subscriber who subscribes to a URL different from the one that is actually pinged to the hub will never receive notifications, and never be able to tell why (because he cannot know which URL is being pinged). THAT makes the protocol fragile.

I'm all sorry for anyone working with Apache in general, but I don't think it's a good idea to base a spec on the difficulty of implementing something with a specific web server. I believe most web frameworks will make it trivial to add one Link header vs. 2 (or 100).

Now, if the whole debate is to say that "canonical" is better than "self", I'll let you fight around this. We can easily change the spec to tell to subscribers:

Use self if there is one
Use canonical if you can't find one And to publishers:
put either self of canonical.

romkatv commented 9 years ago

On Fri, May 29, 2015 at 9:34 AM, Julien Genestoux notifications@github.com wrote:

Feedburner is pretty famous for this and f you subscribed to this URL http://feeds.feedburner.com/TechCrunch/ instead of this one http://feeds.feedburner.com/Techcrunch/, you'd never get pings.

Minor correction: subscribing to any of these will work:

http://feeds.feedburner.com/*Techcrunch*

http://feeds.feedburner.com/*TechCrunch*

https://feeds.feedburner.com/Techcrunch

http://_feedproxy.google.com http://feedproxy.google.com_/Techcrunch

etc.

This doesn't invalidate the point Julien is making. Topic aliasing is a real problem. Correct self links are vital for ensuring that subscribers are listening to the exact topics that the publisher is pinging.

Roman.

julien51 commented 9 years ago

I stand corrected, but that was a large painpoint for along time. I'm glad you guys fixed it :)

pubsubhubbub / PubSubHubbub

Do not require rel=self for discovery #36