w3c / activitypub

http://w3c.github.io/activitypub/
Other
1.25k stars 78 forks source link

Partially anonymise question and like responses #379

Open penguin42 opened 1 year ago

penguin42 commented 1 year ago

Hi, My understanding is that a ActivityPub server receives responses to 'Like' and 'Question' (i.e. 'poll') containing the enduser's ID. This would enable the maintainer of a malicious or compromised server to perform analytics on endusers on apparently trivial operations; while posting a contentious post feels like you're expressing public views, 'Like' or a 'question' vote doesn't feel like it should.

I can think of two partial improvements: a) The users end server could anonymise the user in the Actor entry in the response (maybe in a way traceable to the owner of the users end server, but not directly to the owner of the original post). b) A new Activity type could be used to represent the aggregate responses from a given server; i.e. 'this server has 15 users who 'like' post ....' or 'this server has 5 votes for choice 'Admit Coridan', 4 votes for 'deny Coridan''

snarfed commented 1 year ago

Hmm! I don't entirely follow. Could you post a specific example of where you see these end user ids?

AP does specify both likes and liked collections that expose each post's and user's likes, respectively, but those are both just MAY in the spec. Implementations are welcome to omit them in practice, and often do, eg afaik Mastodon doesn't provide either.

As for polls (Questions), afaik AP doesn't specify any behavior for them at all, definitely not collections of the individual actor ids who voted for each answer. Mastodon returns total vote counts for each answer in its AS2 Question objects, but afaik not individual voters.

Higher level, the AP protocol fundamentally isn't architected to prevent analytics or tracking. Most activities (including individual likes and poll responses) are fully public and delivered to inboxes across thousands of instances. We should expect that multiple actors are storing and archiving them all, verbatim. The fediverse does successfully discourage people from publicly announcing projects that do this kind of thing, but that's social, not technical. At this point, imho the analytics/tracking cat is out of the fediverse bag, at least for private uses, and it's too late to prevent it with technical band-aids.

penguin42 commented 1 year ago

Hmm! I don't entirely follow. Could you post a specific example of where you see these end user ids?

This is by spec reading rather than protocol inspection; so I don't have server-server messages to hand.

AP does specify both likes and liked collections that expose each post's and user's likes, respectively, but those are both just MAY in the spec. Implementations are welcome to omit them in practice, and often do, eg afaik Mastodon doesn't provide either.

But I do see '... favorited your post' - that's a 'Like' isn't it?

As for polls (Questions), afaik AP doesn't specify any behavior for them at all, definitely not collections of the individual actor ids who voted for each answer. Mastodon returns total vote counts for each answer in its AS2 Question objects, but afaik not individual voters.

But that information is currently sent to the server that originated the Poll - correct? (Please correct me if I'm wrong) Another AP implementation that wanted to could collate it.

Higher level, the AP protocol fundamentally isn't architected to prevent analytics or tracking. Most activities (including individual likes and poll responses) are fully public and delivered to inboxes across thousands of instances. We should expect that multiple actors are storing and archiving them all, verbatim. The fediverse does successfully discourage people from publicly announcing projects that do this kind of thing, but that's social, not technical. At this point, imho the analytics/tracking cat is out of the fediverse bag, at least for private uses, and it's too late to prevent it with technical > band-aids.

IMHO we have to do anything reasonable to improve privacy - as you say Mastodon only displays vote counts, not individual voters - so for it, there's no good reason that it's AP messages should contain public IDs.

My point I was trying to make in the first paragraph was that to the end user a 'poll' response and posting something feel quantitatively different in the privacy they might want from them; if we can give them that extra privacy without too much pain then I think we should - someone somewhere is bound to misuse it if we don't.

snarfed commented 1 year ago

This is by spec reading rather than protocol inspection; so I don't have server-server messages to hand.

I highly recommend looking at those! Relatively easy to do with curl, eg:

curl -vL -H 'Accept: application/activity+json' https://tech.lgbt/@nelson/110101971009516952

You learn a ton by seeing how the interop happens in practice, arguably more than by reading the spec.

But I do see '... favorited your post' - that's a 'Like' isn't it?

Ah yes! Sorry, I should have elaborated. Mastodon absolutely does expose per-user likes, but via their proprietary API (afaik), not via AP. You could definitely petition them to change that, but not here.

But that information is currently sent to the server that originated the Poll - correct? (Please correct me if I'm wrong) Another AP implementation that wanted to could collate it.

Both true! The sending server is where the voting user is. It has to know and send the voting user's id to perform the vote, and the receiving server (which hosts the poll) needs to know who sent the vote to validate the request signature, and maybe to de-dupe.

Other servers aren't likely to see the verbatim vote activity, but you're right that they might! That's due to AP's fundamental design. Very difficult to change with band-aids, as I mentioned.

My point I was trying to make in the first paragraph was that to the end user a 'poll' response and posting something feel quantitatively different in the privacy they might want from them; if we can give them that extra privacy without too much pain then I think we should - someone somewhere is bound to misuse it if we don't.

Agreed! We should expect multiple someones have already been "misusing it" in bulk, more or less from the beginning of the fediverse. That horse has left the barn.

I'm definitely with you in spirit! I'm not questioning the motivation, I'm questioning the method. I don't believe AP protocol tweaks will change its fundamental "activities are public by default, broadcast them far and wide" design. If that's a core goal, we likely need to look to either an AP 2.0 or other protocols like Bluesky's AT Protocol.

(In this case specifically, the AP protocol doesn't require exposing per-object likes, per-user likes, or poll responses - Mastodon exposes likes outside AP and not poll responses at all - so protocol changes wouldn't do anything for those.)

penguin42 commented 1 year ago

This is by spec reading rather than protocol inspection; so I don't have server-server messages to hand.

I highly recommend looking at those! Relatively easy to do with curl, eg:

curl -vL -H 'Accept: application/activity+json' https://tech.lgbt/@nelson/110101971009516952

You learn a ton by seeing how the interop happens in practice, arguably more than by reading the spec.

Yeh, I need to set myself up a couple of toy instances so I can see the comms between the instances rather than just between client and instance.

But I do see '... favorited your post' - that's a 'Like' isn't it?

Ah yes! Sorry, I should have elaborated. Mastodon absolutely does expose per-user likes, but via their proprietary API (afaik), not via AP. You could definitely petition them to change that, but not here.

Oh! What does Mastodon do if it receives a AP like to a post it made; and I assume it will send a 'like' if someone hits the * on a post originating from a non-Mastodon AP?

But that information is currently sent to the server that originated the Poll - correct? (Please correct me if I'm wrong) Another AP implementation that wanted to could collate it.

Both true! The sending server is where the voting user is. It has to know and send the voting user's id to perform the vote, and the receiving server (which hosts the poll) needs to know who sent the vote to validate the request signature, and maybe to de-dupe.

OK, so this is where I think I disagree; I'm suggesting that it can do this purely with an aggregate from the server that the voting user is at, without an individual id. There's a requirement there that the voting users server doesn't flip back and forward; it just sends aggregates. This would also lower the load on both servers if the users are on large instances.

Other servers aren't likely to see the verbatim vote activity, but you're right that they might! That's due to AP's fundamental design. Very difficult to change with band-aids, as I mentioned.

I was kind of assuming it wasn't carved into granite! I think my suggestion of using an aggregate (for questions) would work without changing any fundamental of the design. Please explain if you think it causes other problems.

My point I was trying to make in the first paragraph was that to the end user a 'poll' response and posting something feel quantitatively different in the privacy they might want from them; if we can give them that extra privacy without too much pain then I think we should - someone somewhere is bound to misuse it if we don't.

Agreed! We should expect multiple someones have already been "misusing it" in bulk, more or less from the beginning of the fediverse. That horse has left the barn.

I'm definitely with you in spirit! I'm not questioning the motivation, I'm questioning the method. I don't believe AP protocol tweaks will change its fundamental "activities are public by default, broadcast them far and wide" design. If that's a core goal, we likely need to look to either an AP 2.0 or other protocols like Bluesky's AT Protocol.

Thanks I'll have a look at Bluesky's protocol - but I would appreciate an explanation of why an aggregate for question responses wouldn't work.

(In this case specifically, the AP protocol doesn't require exposing per-object likes, per-user likes, or poll responses - Mastodon exposes likes outside AP and not poll responses at all - so protocol changes wouldn't do anything for those.)

I think what I'm saying is that if AP doesn't expect something to be exposed to a user, then it's great if we can avoid exposing it to other servers as well; users view of privacy is what they see at the UI - so if they see a Poll that looks private they have to think quite hard to realise they're actually exposing their views to the other server; that's what makes it different from the users experience posting a normal reply.

snarfed commented 1 year ago

Oh! What does Mastodon do if it receives a AP like to a post it made; and I assume it will send a 'like' if someone hits the * on a post originating from a non-Mastodon AP?

Right. The receiving Mastodon server stores it locally, increments the post's like count, serves it to users and the (non-AP) API, etc.

I was kind of assuming it wasn't carved into granite! I think my suggestion of using an aggregate (for questions) would work without changing any fundamental of the design. Please explain if you think it causes other problems.

In a vacuum, sure! De-anonymizing would still be surprisingly easy in many cases - for example, this proposal wouldn't protect the privacy of people on very small or single-user instances, or (probabilistically) for polls posted by users who only have a small following on a given instance, which is the common case - but I definitely understand the idea, and it makes sense in theory.

Here are two key reasons it would be difficult in AP:

I think what I'm saying is that if AP doesn't expect something to be exposed to a user, then it's great if we can avoid exposing it to other servers as well; users view of privacy is what they see at the UI - so if they see a Poll that looks private they have to think quite hard to realise they're actually exposing their views to the other server; that's what makes it different from the users experience posting a normal reply.

Yes! This is a well known and widely discussed issue in the fediverse, and it's broader than polls. New users sometimes think their posts are private to their followers, or to their instance, or to the fediverse, and don't understand that they're actually public to the entire internet. That misconception may be more common with polls, but it's definitely not unique to them.

evanp commented 1 year ago

Thanks @snarfed for the great discussion. I think we have a couple of options here:

I think it's an open question whether this guidance should be included in the Primer. I'm going to leave this issue open for now.