Restructure the report's section hierarchy

From #33 and https://github.com/swicg/activitypub-html-discovery/issues/33#issuecomment-2480722239

section hierarchy should generally follow the expected user flow

section hierarchy should also place related topics or branches at the same level, instead of mixing concerns

general notes

"URL as input" should probably be lifted out of each section and become a new top-level input step. a URL is https: and is not immediately clear what the content-type of the payload will be.
"Document as input" should be broken up by the type of the input document, as well as the intended discovery target

this is kind of like a state transition graph or DFA (discrete finite automata) where you have the following rough connections

http(s) URL -> AS2 document
http(s) URL -> HTML document
HTML document -> AS2 alternate document
AS2 document -> HTML alternate document

note that, depending on the discovery method, the latter 2 might pass intermediately through "http(s) URL" again. for example getting the url or attributedTo, or using the target of a Link header for conneg where the type is not specified.

possible structure

Given {"an http(s) URL", "an HTML document", "an AS2 document"} for a resource, I want to discover {"an HTML representation", "an AS2 representation"}.
Given {"an http(s) URL", "an HTML document", "an AS2 document"} for a resource, I want to discover the author as {"an HTML representation", "an AS2 representation"}.

hence:

Given an HTTP(S) URL, discover an AS2 document
- Via content negotiation (HTTP Accept header for AS2 type)
- Success: Response is 2xx with AS2
- Failure: Response is 2xx with non-AS2
- Failure: Response is 406 Not Acceptable
- Via web linking (HTTP Link header)
- Success: Headers contain Link rel=alternate type=AS2
- Failure: Headers do not contain Link rel=alternate type=AS2
- Via resource descriptor (WebFinger)
- Success: JRD contains link with rel=self and type=AS2
- Success: JRD contains link with rel=alternate and type=AS2
- Failure: JRD does not contain any relevant links
- Failure: No JRD in response
Given an HTML document, discover an AS2 document
- Via web linking (HTML <link> tag or <a> tag)
- Success: Document contains link with rel=alternate type=AS2
- Failure: Document does not contain link with rel=alternate type=AS2
- Via embedded JSON-LD (HTML <script> tag)
- Success: Document contains script with type=application/ld+json
- Failure: Document does not contain script with type=application/ld+json
- Via HTTP(S) URL discovery?
- Against <link rel="self">
- Against <link rel="canonical">
- Against <base>
Given an AS2 document, discover an HTML document
- Via url
- Success: url is a Link with mediaType of text/html (explicit or assumed default)
- Failure: url is a Link with a mediaType that is not text/html
Given an HTML document, discover an AS2/AP author
- Via web linking (HTML <link> tag or <a> tag)
- Possible success: Document contains link with rel=author
  - Success: Author is of type=AS2
  - Possible failure: Author is of different or unknown type
  - GO TO: HTTP(S) URL discovery
- Failure: Document does not contain link with rel=author
- Via metadata (HTML <meta> tag)
- OpenGraph properties
  - GO TO: HTTP(S) URL discovery
- fediverse:creator tag
  - convert to acct: URI then GO TO: Via resource descriptor (WebFinger)
- Failure: No relevant author metadata found

expand or build upon this structure as needed to fill in the rest of the report

The document already has a similar structure. It has 3 top-level types of discovery:

HTML to ActivityPub
ActivityPub to HTML
HTML to author ActivityPub

For each, there is a sub-section for starting with an URL, and another sub-section for starting with a document (like in a browser, or when an object is delivered via the ActivityPub protocol). Each of the applicable techniques is then listed in its own sub-sub-section.

Each sub-sub-section describes the technique in detail, gives an example or two, and then lists out some ways it can fail.

The exception is for author discovery. Going from resource HTML to author ActivityPub can be direct, or it can pass through the resource ActivityPub or the author HTML. Each of those other paths has its own sub-section and has examples.

There's a paragraph in the introduction on how to switch from an URL to a document, which is trivial except in the case of an HTML document, in which case you'll probably need some contextual information such as from the browser environment.

I appreciate the offer, but it took a long time to get to this level of clarity, and I don't think reorganization at this point would be helpful.

Respectfully, looking at the current report structure, I don't see the level of clarity that you're seeing. I filed this issue because I found it sufficiently confusing. It would be a lot clearer if each section was devoted to a specific goal, and then the subsections were devoted to how to accomplish that goal.

In particular, the "URL as input" and "Document as input" distinction doesn't clearly fit into any of the current sections. You can see this in how the current structure actually duplicates and repeats information:

The key point that arises here is that "URL" is actually a separate class of discovery entirely. Given an arbitrary HTTP(S) URL, you don't know whether the resource at that URL is HTML or AS2. This means that "HTTP(S) URL" ends up having its own considerations before you even get to the part where you have HTML or AS2. It's the sort of "Step 0" in the discovery process, as well as also being an intermediate step for several other discovery processes, like taking a link href of unknown or mismatched type and being expected to do... something? with it.

Separating out the URL considerations from the HTML/AS2 considerations significantly simplifies and clarifies the overall structure of the report. Leaving it mixed in with the document considerations is creating a lot of the confusion, because as previously pointed out, a URL is not known to be HTML or AS2 until you actually try to do something with it. It generally doesn't make sense to talk about an "HTML URL" or an "AS2 URL"; the reality is that it is an "HTTP(S) URL" instead.

Doing this kind of restructuring also reduces the complexity because you don't need as many levels of nesting to represent the same information. You can eliminate "Document as input" because it is redundant with your starting point already being an HTML document or an AS2 document. A lot of current headings are 4 levels deep... The report could be mostly 2 levels deep and occasionally dipping into 3.

The key point that arises here is that "URL" is actually a separate class of discovery entirely. Given an arbitrary HTTP(S) URL, you don't know whether the resource at that URL is HTML or AS2.

That's actually not true; you'll often know from context. You know that the URL is for AS2 if it's ActivityPub id; you know that it's an HTML URL if you get it from document.location within a browser environment.

The purpose of the document is listed at the top: discovery of HTML from ActivityPub, ActivityPub from HTML, and ActivityPub author from HTML. It's not about general discovery on arbitrary URLs.

I think we could add a section on doing discovery when you don't know what kind of URL you have and what direction of discovery you're doing, though. I'm trying to think of a user story, though. Do you have any ideas?

I also think it's useful to be specific on the direction of discovery. Yes, discovering the HTML from ActivityPub and ActivityPub from HTML is practically the same when you use the Link method, but I really think the text and examples are clearer if each section starts from the purpose (HTML URL -> ActivityPub URL or vice versa) and stays focused on that purpose. It's problem-first, not solution-first.

you'll often know from context

If you have context, you're already further along in the discovery process and have already arrived at a document of some sort.

I sympathize that the report's main focus is on HTML <-> AS2, but the fact that HTTPS URLs are a significant intermediate step means that they would benefit from having a section like the introduction. It's not completely irrelevant, either -- there are 3 discovery methods associated with HTTPS URLs that are "pre-requisite information", especially when you land upon a URL as part of the discovery process (e.g. via rel=author or rel=alternate without type= being specified.)

It's problem-first, not solution-first.

This is what I'm trying to say as well --

Given {"an http(s) URL", "an HTML document", "an AS2 document"} for a resource, I want to discover {"an HTML representation", "an AS2 representation"}.

Given {"an http(s) URL", "an HTML document", "an AS2 document"} for a resource, I want to discover the author as {"an HTML representation", "an AS2 representation"}.

Essentially, if you want to describe HTTP HEAD, HTTP Accept, or WebFinger, then you need to do this in the context of an URL. Otherwise, the alternative I'd propose is removing "URL as input" entirely from the report. But this seems like useful information to have on hand, so this is why I'm advocating instead to keep it, but promote it to a top-level section. Sure, it's a rehash of what's in ActivityPub or what's in RFC 8288 Web Linking or what's in RFC 7033 WebFinger, but it has illustrative value to the reader rather than forcing them to open 3 other auxiliary documents to get the information they need.

swicg / activitypub-html-discovery

Restructure the report's section hierarchy #35

general notes

possible structure