w3c / activitystreams

Activity Streams 2.0
https://www.w3.org/TR/activitystreams-core/
Other
285 stars 60 forks source link

Declaring `summary` to have markup other than `text/html`? #620

Open trwnh opened 1 month ago

trwnh commented 1 month ago

Description of issue

name is defined as "A simple, human-readable, plain-text name for the object. HTML markup MUST NOT be included."

summary is defined as "A natural language summarization of the object encoded as HTML."

content includes in its definition that "By default, the value of content is HTML. The mediaType property can be used in the object to indicate a different content type."

So to synthesize these three definitions:

But there are cases where a producer might want to signal a different content type for summary; for example, text/plain or text/markdown. Recently, https://github.com/mastodon/mastodon/pull/32538 came up as an example of wanting to produce a summary that is NOT text/html. So the question is, might it make sense to provide a mechanism for declaring that summary is something other than text/html?

Potential solutions

Action items

nightpool commented 1 month ago

I'm not seeing any justification in https://github.com/mastodon/mastodon/pull/32538 for why any content type other than HTML would be useful or preferred. I don't think Claire is expressing any sort of preference for a markdown summary type, just saying that she initially misunderstood its type.

Adding the requirement to have to process different mime types would have made the PR much more complicated, anyway, than just adding HTML sanitization, since in any case—regardless of what you're producing—you'll have to handle incoming HTML.

this just feels like overcomplicating the spec for no additional benefit. I'm still not convinced there's even a good justification for allowing content to be different media types—are there any major non-HTML implementations?

github-actions[bot] commented 4 weeks ago

This issue has been labelled as potentially needing a FEP, and contributors are welcome to submit a FEP on the topic. Note that issues may be closed without the FEP being created; that does not mean that the FEP is no longer needed.

evanp commented 4 weeks ago

So, I think the problem with adding flags to indicate that the summary is not HTML is that it's not backwards compatible; consumers will expect summary to always be HTML as documented.

I agree that a primer page makes sense.

I'd also suggest a FEP for defining a new description or other property that can have different media types. Using a new property instead of summary allows us to define new semantics for that property, that aren't encumbered with the pretty strict requirement that summary be HTML.

evanp commented 4 weeks ago

One thing about the primer page is that there is the question of when an object does not have a name and should have a summary without HTML. Not all plain text is valid HTML; for example, text that uses unescaped characters that are meaningful in HTML like <>'".

trwnh commented 2 weeks ago

why any content type other than HTML would be useful or preferred.

minimal example where HTML parsing is destructive:

{
  "summary": "I am trying to serialize the RDF statement <Alice> <knows> <bob> into plain-text, but a naive HTML sanitizer is stripping the statement completely"
}
{
  "summary": "I am trying to serialize the RDF statement  into plain-text, but a naive HTML sanitizer is stripping the statement completely"
}

the workaround is to HTML-escape the angle brackets which might not be unescaped by every consumer