w3c / activitypub

http://w3c.github.io/activitypub/
Other
1.24k stars 78 forks source link

Appropriate subset of HTML for `content` and `summary` #419

Open evanp opened 10 months ago

evanp commented 10 months ago

Daniel Hernandez asked an interesting question about the HTML content of Activity Streams 2.0 objects.

The only two properties that can contain HTML markup are summary and content.

Mastodon has documentation on HTML sanitation giving the elements and attributes it supports.

Does it make sense to write additional documentation for this for the entire network?

evanp commented 10 months ago

I made this issue for ActivityPub rather than Activity Streams 2.0 because of the need for sharing HTML between untrusted partners. Other uses for AS2, such as archiving, might use a different HTML profile.

snarfed commented 10 months ago

I'd hesitate to ever make this normative, but guidance or a part of a profile could definitely be helpful.

aschrijver commented 10 months ago

In the thread you mention that you would like to see more use of Article. Question here is if that object type has a richer subset as a recommendation. It would make sense to me, e.g. like a Note subset does not support/recommend headings, but an Article does, and maybe a whole range of semantic html tags to format articles with.

danielhz commented 10 months ago

I'd hesitate to ever make this normative, but guidance or a part of a profile could definitely be helpful.

@snarfed, why not have a normative for this? I think it would be easier for developers to know what is expected to find in the content of each type of object.

Should the HTML elements allowed for the content of each type of object be the same, or should they vary, for example, between an Article and a Note? In RDFS the range of a property, like content, does not depend on the subject type, but on the property itself. So, maybe there is a need to consider different of elements by object type, but this contradicts the RDFS design of property ranges.

danielhz commented 10 months ago

In the thread, @evanp says that the Article object was down scaled to a Note object. Also, in this transformation, some elements as <h2> were replaced by other elements as <strong>. This suggests that an Article and a Note may allow different sets of elements.

danielhz commented 10 months ago

It is the set of elements what we want to restrict or also the way they can be combined. Recall that HTML also restricts how elements can be nested.

snarfed commented 10 months ago

@snarfed, why not have a normative for this? I think it would be easier for developers to know what is expected to find in the content of each type of object.

It would definitely be easier! A profile can definitely help with that. Enshrining it into the normative core spec feels too heavy handed to me though:

nightpool commented 10 months ago

In addition to the reasons Ryan gave (which are all very good reasons), I think there's a more fundamental one, which is that the ActivityPub spec is built to be open to extension. Specifying a normative list of "allowed" HTML tags or attributes would make it impossible for implementations to extend the types of content their users are allowed to publish.

In effect, such a restriction would have no value, since developers would just violate it any time they needed to support a new type of novel content (For example, there's already an FEP for potential MathML support. Such an FEP would violate the core spec if we added such a restriction). Instead, a whitelist would only serve to produce a spec that is not followed in the real world and therefore would defeat the purpose of specification.

On Thu, Jan 18, 2024, 1:43 PM Ryan Barrett @.***> wrote:

@snarfed https://github.com/snarfed, why not have a normative for this? I think it would be easier for developers to know what is expected to find in the content of each type of object.

It would definitely be easier! A profile can definitely help with that. Enshrining it into the normative core spec feels too heavy handed to me though:

  • Implementations will still receive content with HTML tags outside the allowlist, along with HTML that doesn't validate, non-HTML, and even binary, due to bugs, old implementations, attacks, etc. Developers will still need to sanitize incoming activities.
  • Implementations vary in whether/how they can handle or render HTML at all. Most will still need to choose their own set of tags, if any, to sanitize down to.
  • There's a mature existing ecosystem of web apps, CMSes, and related tools that emit HTML content and are gradually adopting ActivityPub. It'd be prohibitive to require all of them to change their output markup.
  • The web and HTML evolve over time. It'd be nice to support that gracefully and not lock out new technologies like web components that use novel tags.
  • Attacks also evolve over time. Tags and markup that are "safe" now may not stay that way. It'd be nice to be agile and address those changes quickly, in guidance or profiles, instead of waiting years for normative spec updates.

— Reply to this email directly, view it on GitHub https://github.com/w3c/activitypub/issues/419#issuecomment-1899100520, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABZCV65B3ZPV6TUNIOHIE3YPF3OHAVCNFSM6AAAAABB7CEL5KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJZGEYDANJSGA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

evanp commented 9 months ago

As a start to this process, I'm going to add a page to the ActivityPub Primer with guidance on the best practices for each of the properties summary and content: https://www.w3.org/wiki/ActivityPub/Primer/HTML

evanp commented 9 months ago

I've started the document, but there's still a lot to do. I'm going to self-assign and come back to this in the near future.