Privacy Considerations for service endpoints

agropper commented 4 years ago

Prior to PING and linked to our many layering discussions in SDS, our Privacy Considerations section and 10.4 Herd Privacy in particular might be reviewed.

5.7 says: "One of the primary purposes of a DID document is to enable discovery of service endpoints." The Privacy Considerations section does a good job of discussing correlation risks in general but is light on the risks and mitigations related to service endpoints.

As I understand it, the relationship between pseudonymous DIDs, herd immunity, and service endpoints implies that, in all "Privacy by Design" use-cases, the number of service endpoints SHOULD be one in order to reduce the risk of correlation and enhance the effectiveness of Herd Privacy.

Requesting Parties who discover a pseudonymous DID by whatever means, especially a search of "de-identified" metadata in various directories, SHOULD be directed to some kind of consent-based access control service with a minimum loss of entropy. This is especially important for DID methods that consider mitigation the GDPR Right to be Forgotten and in light of the fact that GDPR, CCPA, and HIPAA all allow unlimited use of de-identified data with little concern on enforcement of re-identification.

My suggestion would be to enhance the Privacy Considerations section with a discussion of how access control services could be proxied or otherwise "tumbled" by a herd privacy intermediary that could be chosen by the DID controller or data subject independently of the access control service or not. It would then be up to the DID controller to decide whether to use a dumb proxy mechanism like TOR or an access control service that provides herd privacy as the one service endpoint in a pseudonymous DID.

kdenhartog commented 4 years ago

This is what PR #232 was attempting to address as well. I'd say in general that consensus of the WG is that people agree these concerns should be considered and addressed, and I am still a bit fuzzy about what language you'd like to add. Would you mind submitting a PR with proposed changes for this or some additional language here in the comments?

agropper commented 4 years ago

The privacy considerations for DID are very well described in SIOP but I am inexperienced at writing spec text to capture them in a PR by myself. I could work together with someone familiar with both SIOP and SDS to strengthen this privacy section.

OR13 commented 4 years ago

Control

Controller Scenario 1

Controller is the Subject

"I create and control my own DID"

Controller Scenario 2

Controller is fiduciary to the Subject

"Someone/thing I trust manages my DID"

Controller Scenario 3

Controller is hostile to the Subject

"Someone who I don't trust manages a DID about me"

Disclosure

Information in `method-specific-id`

It might be useful to encode information directly into the identifier... for example:

did:example:network:123... here network is used to distinguish between different networks that might support the example did method.

did:example:Ei123... here Ei indicates something about the encoding of the method-specific-id... (self describing).

did:example:MS123... here MS is a tag which might be used to quickly grab all the DIDs assumed to be associated with Microsoft / or to be software packages... https://github.com/decentralized-identity/ion/issues/77

did:github:OR13 in some cases the identifier might be reused intentionally, with obvious privacy / reputation issues.

Information in Fragments

did:example:123#key-0 and did:example:123#fingerprint seem fine, but what about did:example:123#physical-address-front-door-key ?

Information in Services

Personal websites / blogs / social media accounts... there are some really obvious reasons why you might want to have this all be public and crawl-able...

Information in Verification Methods

Either by reusing ones that have been used elsewhere, or by including keys which include metadata (such as email addresses)... or by directly encoding information in areas that are supposed to look "random".

Summary

Depending on how Control and Disclosure are combined, different properties emerge... its worth noting that the bad guys are not going to read the normative statements or privacy recommendations sections... As we've seen in social network botnets, they will simply appropriate high volume information channels for lulz and profit.

Example:

As a software developer, I publish packages with dids... did:example:<npm>123... and the service endpoint is used to link to content in IPFS.... here npm could be any set of bytes that is used to convey software package, by convention...

As a botnet developer, I publish command and control configuration with dids...did:example:<npm>003... I mine for identifiers that are slightly longer than known tags, creating my own tag, and ensuring that my did documents are indexed as software packages... I then use the service endpoint section to host encrypted connection and authentication material... network observers will believe that any software resolving these did is "looking for software packages"... and I might even be able to make use of public indexes like search engines or other "package lookup services".

Here is an example, of some old malware that did something very similar:

Why was MiniDuke using Twitter for C2?

Because reading public information seems like honest network traffic...

Does MiniDuke get any benefit from using Twitter over MySpace?

Yes, 2 benefits... hiding in a larger crowd (sorry myspace), and making use of faster infrastructure... twitter can't provide really fast answers to "give me all the cat pictures", and not also provide the attacker with "given me all my compromised hosts"... the index is a neutral thing, but we live in a world where people hide malware and c2 in cat pictures.

Recommendations

As a WG we can provide guidance on embedding information in did documents (and we do)...

IMO, don't do it... keep PII / encrypted PII / security through obscurity out of the did document... seriously don't do it.... but we can't stop you.

As a WG we can provide guidance on whether we think indexing should be promoted in the verifiable data registry... this would allow for valuable features for both the "good" and "bad" guys.

IMO, recommend making building indexes of decentralized infrastructure expensive... especially if there is no trust in the index... prefer systems that are designed to not be indexed ( did:peer / monero / tor / i2p ) vs (bitcoin)

In computing, linked data (often capitalized as Linked Data) is structured data which is interlinked with other data so it becomes more useful through semantic queries. It builds upon standard Web technologies such as HTTP, RDF and URIs, but rather than using them to serve web pages only for human readers, it extends them to share information in a way that can be read automatically by computers. Part of the vision of linked data is for the Internet to become a global database.

https://en.wikipedia.org/wiki/Linked_data

Google and Microsoft make money off the fact that the internet is filled with "untrustworthy" data, and only with massive compute and machine learning can they distill the noise into "truth"... inviting readers of this specification to populate the decentralized web with a similar level of noise, is inviting the same computational attack and information asymmetry.

IMO, we should recommend against including any unauthenticated semantic information in did documents that is not related to cryptographic self certification, authentication, authorization.

If you want to build an index of decentralized content, get consent first.

OR13 commented 4 years ago

@talltree I would love to hear your thoughts on how Privacy and Governance are related.

agropper commented 4 years ago

@OR13 says:

If you want to build an index of decentralized content, get consent first.

and frames the issue in terms of information asymmetry. +1

It's up to us in this WG, however we decide to handle the rubrics of decentralization, to make it clear that asking the subject for authorization to index is the same as asking for authorization to access. Let Moore's law intersecting the value of high quality personal information take care of the decreasing relative cost of this added friction.

agropper commented 4 years ago

A survey for everyone, please: https://docs.google.com/forms/d/e/1FAIpQLSc8Z8FklORke1iPRoyo90GNWqqXkmdbgQLNvHvU-v4XvLxO0A/viewform?usp=sf_link

Also, a draft PING document is now available and will be put on the agenda for broader discussion soon.

talltree commented 4 years ago

@talltree I would love to hear your thoughts on how Privacy and Governance are related.

@OR13 Whew, that is a seriously deep topic. I share your concern about publicly writable verifiable data registries (VDRs) and the potential that they could be: a) overwhelmed with spam DID documents, or b) suffer "GDPR attacks" by having DID documents with personal data written to them that can then mire the VDRs in GDPR erasure requests. (For a deep discussion of the latter, see the Sovrin Foundation white paper on Data Privacy Regulation and Distributed Ledger Technology.)

I wish I had a magic wand to wave over these problems, but they are very real and to the best of my knowledge there is no magic bullet. So I think we just have to point out these challenges in our Privacy and Security Considerations sections.

On the 2020-07-28 DID WG call, we also discussed the possibility of the WG authoring a separate note about these specific concerns, since they may be fundamental to broad adoption of DID infrastructure. Providing we do that after we go to CR, I'm willing to volunteer to help with that paper.

kdenhartog commented 4 years ago

There was extensive discussion on the DID WG public mailing list on this topic as well. It seems there's additional language that we can add at this point. Last I remember, @agropper was looking for additional language above and beyond #232 and would be willing to work with others to draft that language, but was hoping someone else could step in and help with submitting the PR to make the edit or to write the note as mentioned in the comment right above this.

agropper commented 4 years ago

@talltree I took a scan of the Sovrin Foundation paper. It seems intended as a compliance aid rather than rubric for decentralization. Nothing wrong with that but it makes privacy analysis difficult and generally as ethically speculative as the pablum: "Privacy by Design".

I hope we're designing DID Core not just for compliance with today's GDPR / PIPEDA / CCPA but for a world where individuals have a less asymmetric relationship with our benevolent platform operators. One way to do that is to design DID Core for Privacy by Default.

One strategy for achieving Privacy by Default is to keep all data worth "registering" behind a mediator and/or an authorization server. I understand that this strict definition of DID Core might be unacceptable to operators of private DLTs and some federations. The question then becomes how do we define interoperability (across methods and federations) and can we word DID Core to achieve both Privacy by Default and satisfy the full range of methods and federations?

agropper commented 4 years ago

w3c / did-core