w3c / webmention

Webmention spec
https://www.w3.org/TR/webmention/
112 stars 46 forks source link

Consider allowing caching endpoint discovery #113

Open snarfed opened 1 year ago

snarfed commented 1 year ago

Right now, the spec doesn't allow webmention senders to cache discovered endpoints:

3.1.2 Sender discovers receiver Webmention endpoint

The sender MUST fetch the target URL (and follow redirects [FETCH]) and check for an HTTP Link header [RFC5988] with a rel value of webmention. If the content type of the document is HTML, then the sender MUST look for an HTML and element with a rel value of webmention.

We've been talking about relaxing this, eg by changing MUST fetch to SHOULD fetch, and maybe including more explicit language around what kind of caching might be ok.

This generally won't matter to small/medium individual web sites, but it definitely does for large services like Bridgy, which currently sends 1-2 webmentions per minute.

image

Bridgy currently caches endpoints for 2h, per domain, with home pages as a special case since they sometimes don't advertise an endpoint. Specifically, Bridgy's cache key is [domain, http or https, home page or not].

The first small, hopefully non-controversial change we could make here would be to allow caching at all, for individual target URLs. After that, there's a larger conversation around whether and how to allow more broad caching, eg by domain. The counterargument is per-URL webmention endpoints, which the community has experimented with, eg to expire them after a brief time window, but that idea didn't pan out, and I don't know that we've found a compelling need for them yet otherwise.

snarfed commented 1 year ago

Great in-depth conversation about this in #dev today, lots of big ideas and details. @tantek @aaronpk @sknebel et al, what are the next step(s) here?

# 09:32
[tantek] speaking of caching, I'm curious what's the latest "best practice" for caching webmention endpoints, either from a particular URL or for a whole domain? Like I believe Bridgy does some degree of caching / inference of wm endpoints it has discovered, and AFAIK such behavior is beyond (outside the bounds of) the Webmention spec
# 09:33
[tantek] I am particularly curious in the context of "large" systems that may want to "turn on" sending Webmentions, but would require them to do potentially LOTS of Webmention discovery traffic, perhaps repeatedly on the same URLs / domains for heavily "shared" links/domains
# 09:34
[tantek] Two part question really: what are current Webmention senders doing in terms of caching discovered wm endpoints (either page or domain level caching or both) ?
# 09:35
[tantek] and second: is there a way to indicate in your Webmention link rel that that discovery may be cached by the discoverer for some amount of time, like a week or a month?
# 09:35
[tantek] and 2a: is there a way to indicate in your Webmention link rel that that discovery may also be re-used for OTHER permalinks (or your entire domain), and cached as such, for some amount of time?
# 09:38
[tantek] ^ obviously both of these questions could be applied to link rel discovery methods in general, however, Webmention is a good specific example of link discovery which has numerous implementations of varying scales that we can learn from
# 09:38
[snarfed] https://github.com/w3c/webmention/issues/113
superkuh joined the channel
# 09:41
[tantek] thoughts on whether / how to generalize caching link rel endpoint discovery in general? e.g. for Micropub, IndieAuth, etc.?
# 09:56
aaronpk would it need to be different than indicating the caching of the rest of the page?
# 09:56
aaronpk e.g. use existing http cache headers
# 09:57
[tantek] yes absolutely. you might edit the contents of a blog post in minutes certainly over hours. whereas it's very unlikely the wm endpoint for that blog post will change in minutes or hours.
# 09:57
[tantek] or even days/weeks/months
# 09:57
aaronpk alternate question, since there are a ton of other link rels, has anyone proposed a caching mechanism for any of the others?
# 09:58
aaronpk e.g. favicon or app manifest and such
# 09:58
btrem Wouldn't you again use http cache headers?
# 09:58
aaronpk can http cache headers target specific parts of the page?
# 09:58
[tantek] btrem ^ already answered, caching behavior for a page vs links a page points to are different things
# 09:58
[tantek] aaonpk, not AFAIK
# 09:59
aaronpk me either, but i am always surprised when i look up the latest state of http stuff
# 09:59
[tantek] hah true
# 09:59
[tantek] aaronpk, it feels more like a need for an HTML attribute on link/a, that indicates a cache directive on the rel-ness of that link/a
# 10:00
aaronpk that was my first thought
# 10:00
aaronpk i'm curious if anyone has done that for other link values
# 10:00
[tantek] it's something that overlaps both HTML and HTTP
# 10:00
[tantek] aaronpk, I don't think anyone has done it for any other link rel discovery values in particular
# 10:02
[tantek] some rel values, like "stylesheet" the consuming code has always already retrieved the source HTML, so there's no need for such a caching directive
# 10:03
[tantek] other rel values, like most XFN values (i.e. all but 'me'), use-cases access them infrequently enough that no caching is needed
# 10:04
btrem For favicon, why wouldn't you rely on the cache headers returned with example.org/favicon.ico? Same for manifest.json?
# 10:05
aaronpk because those cache headers talk about the *content* of the icon, but we're talking about caching the location of the icon
# 10:05
[tantek] favicon doesn't use rel discovery so it's inapplicable to this topic
# 10:05
aaronpk well there's the new favicon link rels
# 10:05
aaronpk that don't use the hardcoded favicon.ico path
# 10:05
[tantek] if you mean rel="icon" sure
# 10:06
[tantek] I wouldn't conflate that with a particular use-case "favorite" which is not even a dominant use-case
# 10:06
btrem If the location changes, then http status codes 30x would be appropriate, no?
# 10:06
aaronpk btrem: that might work for icons on the same site, but doesn't work for webmention endpoints
# 10:06
btrem I only use favicon.ico as an example.
# 10:06
[tantek] again, this is about the link rel discovery step, not the retrieval step
# 10:07
aaronpk e.g. if someone's webmention endpoint is webmention.io, i'm not going to start returning a 301 redirect on that endpoint 😂
# 10:07
[tantek] step 1 retrieve HTML, step 2 link rel discovery in that HTML to find a URL, step 3 process that URL. we are talking about step 2. http status codes, cache headers are about step 3
# 10:07
[tantek] ^ btrem
# 10:08
[tantek] this is about caching the result of steps 1 & 2 so that in the future you can skip those steps and jump straight to step 3
# 10:09
[tantek] and http status codes, cache headers while helpful for step 1, are inapplicable to step 2 because of what I wrote to aaronpk above https://chat.indieweb.org/dev/2023-07-17#t1689613045880400
# 10:09
btrem So the question is what happens when the page changes its endpoint. That seems like the same thing as when the page changes its content.
# 10:09
Loqi [preview] [[tantek]] yes absolutely. you might edit the contents of a blog post in minutes certainly over hours. whereas it's very unlikely the wm endpoint for that blog post will change in minutes or hours.
# 10:09
aaronpk btrem: that was my first question, which tantek gave a good answer to there ^
# 10:10
[tantek] while "technically" it's the same thing, bytes in the page change, in practice no it's not the same thing
# 10:12
btrem If your page changes in minutes, and you expect a ua to retrieve a new copy (using e.g. cache-control: max-age: 120 etc.), is there a large cost to looking at the rel attributes again?
# 10:13
aaronpk no this is about someone finding the webmention endpoint and wanting to cache it, e.g. the bridgy issue linked previously
# 10:13
aaronpk the consumer of the webmention link rel doesn't care about the rest of the content of the page
# 10:14
btrem I see. So the problem is how does an author control the cache for wm when using a third-party wm endpoint (like webmenion.io)?
# 10:15
aaronpk third party endpoints is probably the 80% problem, but it's not exclusive to that
# 10:15
aaronpk like you can just change your own webmention endpoint URL within your own first party tooling whenever you want
# 10:18
[tantek] btrem, see my steps 1,2,3 above. this is about step 2
# 10:19
[tantek] "your page changes in minutes, and you expect a ua to retrieve a new copy" is about step 1
# 10:23
sknebel also "HTTP caching" has very limited meaning for API endpoints
# 10:23
sknebel like a WM endpoint
# 10:23
sknebel there is no way for an endpoint to signal itself "please check if I'm still the right thing to use every X <time>"
# 10:25
[tantek] also I don't believe you can meaningfully use that for the *existence* of the endpoint which is the problem we are trying to solve here, vs. the *results* of using the endpoint (which is what HTTP caching applies to)
# 10:25
sknebel thats what I mean, yes
# 10:25
[tantek] nevermind that caching the results of a POST are also disallowed I believe
# 10:26
[tantek] e.g. Webmention, Micropub, IndieAuth endpoints
# 10:26
[tantek] so you aren't even allowed to cache "step 3" in those cases
# 10:27
sknebel post responses can be cached if explicitly indicated as such AFAIK, but its rarely implemented
# 10:28
sknebel potentially one could have a pattern to rerun discovery for a cached value if the endpoint starts to return errors, but that would mean the endpoint would need to stop working once it isnt responsible anymore. which means it'll go wrong in practice
# 10:29
aaronpk that sounds like activitypub account migration 😂
# 10:29
sknebel ?
# 10:30
aaronpk the account you're migrating from has to send the redirect to the new account
# 10:30
aaronpk so if I wanted to give people a way to migrate their webmention.io endpoint, i'd need to have it return their new endpoint
# 10:30
sknebel but thats not needed, you can rerun discovery
# 10:31
aaronpk or more to your point, i would need to return an error for people who have shut down their hosted account
# 10:31
sknebel yep
# 10:33
sknebel question of the scope of caching is also fun. per-page? - limited use, but ok for things like bridgy probably. per-domain? likely to cause issues as soon as setup get a bit unique
gRegor joined the channel
# 10:40
[snarfed] wish I could hang out for more of this conversation! sadly I'm afk for most of the day. all this ^ sounds reasonable so far. my one concern is, whatever mechanism we end up with, it may only see a small amount of adoption, like anything opt-in
# 10:41
[snarfed] so, the one thing I'd ask for is a provision for _default_ caching, for wm receivers that don't specify this new caching mechanism
# 10:41
[snarfed] eg by default, wm endpoints may be cached for 2h, per [domain or something]
# 10:41
[snarfed] (could be per-protocol)
# 10:42
aaronpk if we end up with something like `<link rel="webmention" href="..." cache="30d">` i would be happy to add that with some reasonable default value to the webmention.io setup docs
# 10:42
aaronpk ultimately a webmention sender might decide to ignore the hint anyway, like bridgy might want to set its own minimum caching time to avoid someone saying only cache this for 1 minute or something
# 10:52
[tantek] [snarfed] agreed that we need to solve this even only for link rel discovery consuming code, without requiring a new mechanism for link rel discovery publishers
# 10:53
[tantek] aaronpk, I like that use of "cache" as an attribute, presumably we can treat it similar to "type" on links which is advisory but not canonical
# 10:54
[tantek] is there an existing cache time string format we can re-use for the attribute value rather than make up something new? e.g. the "type" attribute on link/a uses the same (or a subset of?) HTTP Content-Type
# 10:54
aaronpk html datetime uses the "P*" format
# 10:54
aaronpk e.g. "P2D" "PT15H"
# 10:55
[tantek] that looks oddly like iCal DURATION format
# 10:57
[tantek] oof I had forgotten we asked for / proposed duration in HTML <time> as an option, and lacking anything better I'm fairly certain I would have referenced re-using iCal there
# 10:57
[tantek] I think since this affects HTTP more than content semantics, I'd prefer to re-use HTTP cache header syntax than HTML time element
[jacky] joined the channel
# 11:00
aaronpk is there anything in the set-cookie header that might use a duration syntax?
# 11:01
aaronpk max-age, that's the one
# 11:01
aaronpk looks like it's just an integer seconds
# 11:04
aaronpk simple enough
# 11:11
[tantek] One question is, is there a need in any of the use-cases for the more complex vocabulary of cache control directives? https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control
# 11:12
[tantek] and should we use that same one attribute for generalizing discovery to other permalinks or the domain as a whole, or another attribute?
# 11:13
[tantek] as in, yes you may cache this discovery for a week, vs you may re-use this discovery on other posts on this domain, or domain home page (or both) for a week
# 11:19
aaronpk i suppose the way to avoid the opt in problem is to instead recommend some default behavior for senders, and then if a publisher knows the default is wrong for some reason, can use the cache indicator to suggest an alternative
# 11:20
aaronpk like if I know that bridgy (and most others) are caching a webmention endpoint for my whole domain for 24 hours, and I have a particular page where I want to use a separate special endpoint, I could add an attribute that says "actually don't use this endpoint for the rest of the domain"
# 11:21
[tantek] yes, that *and* I believe POST can support redirect too right?
# 11:22
[tantek] so even if a sender does use your default wm endpoint for a "particular page", you can redirect it serverside to the special endpoint as a repair action
# 11:22
aaronpk i forget the status code but one of them is meant for that, where the HTTP client is supposed to re-send the post body parameters to the new url
# 11:22
aaronpk rather than the browser behavior which is to make a GET to the redirected-to location (like after you submit a form you get redirected somewhere)
# 11:24
sknebel 307
# 11:25
aaronpk oh right, i should have remembered that, because there's a specific note to not use that in OAuth https://www.ietf.org/archive/id/draft-ietf-oauth-security-topics-15.html#name-307-redirect