w3c / activitypub

http://w3c.github.io/activitypub/
Other
1.25k stars 78 forks source link

Conflicts with JSON-LD specification regarding object identifiers for anonymous objects #476

Open zotanmew opened 4 weeks ago

zotanmew commented 4 weeks ago

The AP spec states the following:

3.1 Object Identifiers All Objects in [ActivityStreams] should have unique global identifiers. ActivityPub extends this requirement; all objects distributed by the ActivityPub protocol MUST have unique global identifiers, unless they are intentionally transient (short lived activities that are not intended to be able to be looked up, such as some kinds of chat messages or game notifications). These identifiers must fall into one of the following groups:

  1. Publicly dereferencable URIs, such as HTTPS URIs, with their authority belonging to that of their originating server. (Publicly facing content SHOULD use HTTPS URIs).
  2. An ID explicitly specified as the JSON null object, which implies an anonymous object (a part of its parent context)

The JSON-LD spec (version 1.0) states the following:

7.4.3) If expanded property is @id and value is not a string, an invalid @id value error has been detected and processing is aborted. Otherwise, set expanded value to the result of using the IRI Expansion algorithm, passing active context, value, and true for document relative.

The JSON-LD spec (version 1.1) states the same thing:

13.4.3) If expanded property is @id: 13.4.3.1) If value is not a string, an invalid @id value error has been detected and processing is aborted. When the frameExpansion flag is set, value MAY be an empty map, or an array of one or more strings.

Since these are in conflict, it is not possible to comply with both the JSON-LD specification and the ActivityPub specification simultaneously.

This was noticed as AP implementer Akkoma has recently started federating anonymous objects in accordance with the AP specification (explicit nulls), which has broken federation with implementations performing JSON-LD expansion (for example, Iceshrimp.NET).

Some solutions were proposed in this Akkoma PR thread.

trwnh commented 3 weeks ago

pretty sure this is an error in the text of the AP spec. the bit about "ID explicitly specified as the JSON null object" is clearly an error and should be reworded or removed. the correct behavior is to omit the id entirely, which triggers the "anonymous object" or "blank node" behavior that one would expect.

the section 3.1 text should read something like:

All Objects in [ActivityStreams] should have unique global identifiers. ActivityPub extends this requirement; all objects distributed by the ActivityPub protocol MUST have unique global identifiers, unless they are intentionally transient (short lived activities that are not intended to be able to be looked up, such as some kinds of chat messages or game notifications) or otherwise anonymous objects (an embedded node that is part of its parent context). These unique global identifiers SHOULD be HTTPS URIs for publicly facing content that is intended to be publicly dereferenceable. These identifiers must fall into one of the following groups: [...]

side note: @id should always be an IRI or expand to an IRI against the @base. the use of a string as @id (with no corresponding @base) might be fine on the JSON side, but it will cause all triples with that object as the subject to be removed from the output if converted to RDF (since at best it is interpreted as a "relative URI reference", which is not allowed for subjects)

other side note: this came up in #396 as well regarding "partial updates", except for properties instead of ids.

TheOneric commented 3 weeks ago

the correct behavior is to omit the id entirely, which triggers the "anonymous object" or "blank node" behavior that one would expect.

This removes the ability to distinguish transient from anonymous objects unless they occur on the top-level (cannot be anonymous). I’m fine with this and in fact felt like transient objects only really make sense on the top-level anyway, but to make sure: is there any reason why this distinction should be preserved considering the current spec revision makes an explicit effort for it?

trwnh commented 3 weeks ago

distinguish transient from anonymous objects

“transient” and “anonymous” are different aspects of the same functionality. a “transient activity” is also an “anonymous object”, because activities are objects, and because the thing that makes the activity transient is that it’s anonymous.

example of a transient activity:

{
  “actor”: “https://someone.example”,
  “type”: “InGameNotification”,
  “content”: “The payload is nearing the checkpoint!”
}

example of embedded anonymous objects for attributedTo and attachment, part of the parent context of the Note:

{
  “id”: “https://imageboard.example/19387428939”,
  “type”: “Note”,
  “attributedTo”: {
    “name”: “Anonymous”
  },
  “content”: “>>19387428935 >>19387428938 take a look y’all”,
  “inReplyTo”: [“ https://imageboard.example/19387428935”, “https://imageboard.example/19387428938”]
  “attachment”: {
    “type”: “Image”,
    “name”: “IMG_4634.jpeg”
    “url”: {
      “href”: “https://imageboard.example/attachments/3847374.jpg”,
      “mediaType”: “image/jpeg”,
      “width”: 375,
      “height”: 667
    }
  },
  “tag”: [
  {“type”: “Mention”, “name”: “ >>19387428935”, “href”: “ https://imageboard.example/19387428935”},
  {“type”: “Mention”, “name”: “ >>19387428938”, “href”: “ https://imageboard.example/19387428938”}
  ]
}
TheOneric commented 3 weeks ago

a “transient activity” is also an “anonymous object”, because activities are objects, and because the thing that makes the activity transient is that it’s anonymous.

While their effect for receiving servers may usually amount to the same thing, the way current AP spec describes them they are distinct. Anonymous objects are defined as being "part of its parent context" (and thus not able to be looked up on its own), while transient objects are “short lived activities that are not intended to be able to be looked up”.

The described purpose and intent are different and importantly, anonymous objects cannot exist on the top-level, since there is no parent context to be part of. Your example transient activity therefore is not an anonymous object.
The quoted bit above also suggests only activities can be transient, though later on it also refer to general “transient objects”.

If there’s no reason to ever distinguish between them, I’d suggest to further amend the wording to actually merge “transient” into “anonymous” (E.g. allow omitting the id for anonymous objects and then just mention embedded objects and transient activities as examples of anonymous objects)

trwnh commented 3 weeks ago

the way current AP spec describes them they are distinct

the way current AP spec describes them is wrong and misleading. the “id:null” mechanism is invalid should never have been written.

the purpose of the paragraph is to require dereferenceability except in cases where you explicitly don’t want this. in such cases, you leave out the id.

TheOneric commented 3 weeks ago

the “id:null” mechanism is invalid should never have been written.

But it was written and provided a distinction between transient and anonymous. This distinction is also the only motivation I can come up with why it was written the way it is in the first place. That’s why I’m asking about whether it is safe to drop the ability to distinguish transient and anonymous objects.

If it is safe to drop, note that your proposed wording still keeps "anonymous" and "transient" distinct in purpose eventhough they’re no longer distinguishable for receivers, thus my suggestion to explicitly merge the description of those categories.

trwnh commented 3 weeks ago

so something like this, then?

All Objects in [ActivityStreams] should have unique global identifiers. ActivityPub extends this requirement; all objects distributed by the ActivityPub protocol MUST have unique global identifiers, unless they are intentionally transient (short lived activities that are not intended to be able to be looked up, such as some kinds of chat messages or game notifications) not intended to be looked up or referred to. In other words,These identifiers must fall into one of the following groups:

  1. Publicly dereferencable URIs, such as HTTPS URIs, with their authority belonging to that of their originating server. (Publicly facing content SHOULD use HTTPS URIs).
  2. An ID explicitly specified as the JSON null object that is explicitly omitted, which implies; for example, an anonymous object (a part of its parent context) or a transient activity (short lived activities that are not intended to be able to be looked up, such as some kinds of chat messages or game notifications) would omit its ID.

(somewhat un/related, but i think the bit about "authority belonging to that of their originating server" also should be changed, since it's not actually logically implied by "unique global identifier" and it makes all objects owned by an HTTPS server instead of by actors. that's a separate issue, though.)

TheOneric commented 3 weeks ago

seems good; thx

silverpill commented 3 weeks ago

I think "transient activities" should be removed from the spec. It sounds like "looking up" is the only purpose of an identifier, but identifiers can also used for authentication, authorization, de-duplication of incoming activities and synchronization of collections. There is no good reason for a top-level object to not have an identifier. "Short lived" makes it even more confusing, implying that activities have a duration or a lifetime.

TallTed commented 3 weeks ago

@zotanmew — Please edit your initial post, and code fence each instance of @id (like `@id`), so that GitHub user isn't spammed with notifications about this discussion in which they did not choose to participate.

zotanmew commented 3 weeks ago

@TallTed I'm told that editing it won't remove the mention, though I'm happy to edit it regardless.

evanp commented 3 weeks ago

I'd like to test this with JSON-LD parsers to see what the actual behaviour is. I'm particulary interested in if there's any daylight whatsoever between the @id property and the id property that would allow this different behaviour for the latter.

The JSON-LD playground does show a null id value as an error: https://json-ld.org/playground/#startTab=tab-expanded&json-ld=%7B%22%40context%22%3A%22https%3A%2F%2Fwww.w3.org%2Fns%2Factivitystreams%22%2C%22id%22%3Anull%7D

I think we have two possible paths forward:

  1. Publish an erratum that this invalid syntax should never have been specified.
  2. Publish a deprecation, accepting the fact that some of our implementers do not use JSON-LD compliant parsers, so a null value is acceptable for those consumers.

There are a few other ways that we could represent "anonymous" or "transient" or otherwise unidentified objects:

  1. Just don't provide an id value; leave it undefined.
  2. Have a specified term for an anonymous object, such as https://www.w3.org/ns/activitystreams#Anonymous.
evanp commented 3 weeks ago

I think an Erratum is necessary here. Taking out the reference to using null, we could have something like the following:

...all objects distributed by the ActivityPub protocol MUST have unique global identifiers, unless they are intentionally transient or anonymous ([examples]) in which case the identifier MAY be omitted. The identifiers must be a publicly dereferencable URIs, such as HTTPS URIs, with their authority belonging to that of their originating server. (Publicly facing content SHOULD use HTTPS URIs).

zotanmew commented 3 weeks ago

Sounds good to me (I'd vastly prefer the erratum option over a deprecation).

evanp commented 2 weeks ago

We could also add something like this?

Consumers MAY treat a null value for the id property as if the property was not defined. Publishers SHOULD NOT use null for the id property, as it is not valid JSON-LD.

This gives us a little Postel resilience.

evanp commented 2 weeks ago

And, honestly, I hate the "MUST unless you don't want to" phrasing. Is it too late to just do this?

...all objects distributed by the ActivityPub protocol SHOULD have unique global identifiers. The identifiers must be a publicly dereferencable URIs, such as HTTPS URIs, with their authority belonging to that of their originating server. (Publicly facing content SHOULD use HTTPS URIs).

zotanmew commented 2 weeks ago

Given that most implementations do not do LD processing for most or even all activities, I’d worry people who are aware of that fact might interpret that as it being fine to send null values, so I’d go for a MUST NOT here, as it makes federation with any implementations that do process activities as JSON-LD impossible when the activity contains such a null @id.

trwnh commented 2 weeks ago

I hate the "MUST unless you don't want to" phrasing. Is it too late to just

i don't think saying "AS2 says you SHOULD have unique global identifiers; AP extends this such that all objects SHOULD have unique global identifiers." makes sense. the thing is, it's already a SHOULD in AS2. why have the language about "extending the requirement" in that case?

by contrast, the "MUST but MAY" is not simply "you don't want to". that's what a SHOULD is. "SHOULD" is "do this unless you have a reason not to." "MUST but MAY" is "do this in every circumstance, but we have the following enumerated exceptions." in other words, the anti-fulfillment argument changes from "i have a good reason not to" and becomes "i specifically qualify for this exception".

Postel-wise, the behavior we're trying to go for here is "don't do this, ever; and if you're doing this right now, stop it." otherwise, it sounds like the bit about how consumers MAY strip null ids would perhaps be good guidance for the primer, but it's pretty clear that on the spec level we should just go ahead and remove this unfortunate error once we have a WG.