Hashing (bbs-proof-2023)

OR13 commented 1 year ago

Describe if / how hashing is relevant to this ciphersuite.

tmarkovski commented 1 year ago

This would boil down to the vc-di-bbs cryptosuite electing one of the two cipher suites defined in the BBS spec and explaining how the procedures "Map to Scalar as Hash" is used.

I believe all implementations thus far support both, so I don't think it matters, which one. If the ciphersuite doesn't make a choice, it would need to introduce additional parameter in the proof which would indicate which ciphersuite is used.

Relevant issue https://github.com/decentralized-identity/bbs-signature/issues/278

OR13 commented 1 year ago

@tmarkovski no this applies to how to get RDF messages to the BBS algorithm, for example

https://github.com/transmute-industries/vc-di-sd/blob/main/src/di-sd/urdna-2015/canonize.ts

It's how to take an RDF document and get a list of n-quads that have had their subject blinded with an HMAC.

Its currently not defined in the spec, but obviously people have ways to do it... none of them are "standard".

dlongley commented 1 year ago

Once PR https://github.com/w3c/vc-di-ecdsa/pull/19 is merged, this can be addressed by adding text that calls appropriate functions from it.

brianorwhatever commented 1 year ago

@dlongley can you describe which functions from https://w3c.github.io/vc-di-ecdsa/#selective-disclosure-functions would be used and what their privacy trade offs would be?

example n-quads would go a long way as well

dlongley commented 1 year ago

@brianorwhatever,

There's a possibility that @Wind4Greg can help out with examples or more information here from a draft implementation he's been experimenting with. Greg -- is there something you can share around this?

Wind4Greg commented 1 year ago

Hi all, I've generated test vectors for ECDSA-SD and you can find the code here. As part of this effort I quickly modified the procedures for use with BBS see the sub-directory on BBS. I used my own JavaScript only BBS implementation based on the noble/curves library (BLS12-381).

Now there are a number of places that "hashing" is used:

BBS does its own hashing of "messages" using a "hash-to-curve" procedure based either on SHA-256 or SHAKE-256.
ECDSA-SD does hashing on proof configuration, mandatory messages, Not a problem with privacy.
RDF canonicalization uses a hash deep inside it (I don't know the details)
The SD-primitive canonizeAndGroup uses a SHA-256 based HMAC as a PRF to randomize the order of what are known as blank node ids (JSON-LD) to prevent data leakage from these ids and their order. Issue for linkability see below.

My concern is only with item 4 when it comes to unlinkability, but that is only in how its currently done for ECDSA-SD. In the current implementation an HMAC (keyed PRF) is performed on each blank node id to produce a new blank node id that is very random looking. The HMAC key is shared between the issuer and the holder so the holder can recreate these new blank node ids. The holder then comes up with a derived document based on selective disclosing portions of the original document. The holder then sends the "signed derived document" to the verifier. To verify the signature the verifier needs to know the "randomized blank node ids" that correspond to blank nodes in the derived document. These are very unique and hence linkable across verifiers.

However, this seems like it can be readily changed. @dlongley what do you think? Note that unlinkability is something that needs to be considered from the highest layers on down. At the lowest layer BBS can handle very long lists of messages but when going from the holder to the verifier the list of disclosed message indexes must be sent which is something that could be used to track (fingerprint). We (BBS folks) will be adding more guidance in the IETF document.

Hope this helps. Happy to see BBS and VCs coming together.

dlongley commented 1 year ago

Yes, we mentioned the desire for a more unlinkable label mapping function (vs. a less data leakage version, i.e., you have to choose) for BBS. The base subfunctions / primitives are designed to allow a different label mapping function to be swapped in. We can add a very simple one for BBS that does the same thing as the one used with ECDSA-SD, but instead of using the HMAC results directly, it sorts and assigns labels using a short prefix and small counter.

So, instead of labels like this: u<big random HMAC output> we get this: b<small int>, e.g.:

_:u3Lv2QpFgo-YAegc1cQQKWJFW2sEjQF6FfuZ0VEoMKHg => _:b0 _:u4YIOZn1MHES1Z4Ij2hWZG3R4dEYBqg5fHTyDEvYhC38 => _:b1

Wind4Greg commented 1 year ago

Thanks @dlongley. So using the full long HMAC outputs would prevent information leakage by completely obscuring the number of blank node ids generated? That is not leak the total number of blank node ids. While in the BBS case we want a very restricted range to prevent linking/fingerprinting. In the BBS case the indexes of the selected "BBS messages" (non-mandatory nquads) must be sent from holder to verifier anyway so the verifiers will have some rough idea of number of "BBS messages".

So I need to come up with a replacement for createHmacIdLabelMapFunction or do you already have one in mind? I like the HMAC (PRF) method for randomizing order, so I could keep all that stuff and just add a counter/mapping piece...

dlongley commented 1 year ago

@Wind4Greg,

I was thinking the new function would be something like createShuffledIdLabelMapFunction (name can be bikeshedded) and it would take an hmac API just like createHmacIdLabelMapFunction. It would return a function that would call createHmacIdLabelMapFunction internally, sort the resulting map, and relabel using a simple prefix and a counter instead.

Pseudo code something like this:

createShuffleIdMapFunction(hmac) {
  return function(canonicalIdMap) {
    // get hmac ID map (canonical label => hmac'd label)
    hmacIdMap = createHmacIdLabelMapFunction(hmac)(canonicalIdMap)
    // reverse map: hmac'd label => canonical label
    reversedMap = reverse(hmacIdMap)
    // sort hmac'd labels
    sortedKeys = keys(reversedMap).sort()
    // output new map using `'b' + counter` labels in order of sorted hmac'd labels
    bnodeIdMap = new Map()
    counter = 0
    sortedKeys.each(key => bnodeIdMap.set(reversedMap[key], 'b' + counter++))
    return bnodeIdMap
  }
}

Wind4Greg commented 1 year ago

Thanks @dlongley will give this a try will my BBS related test vector code.

Wind4Greg commented 1 year ago

Well that went well! Needed to modify the verify label map/compress label map stuff in the derive step and derive verify in very simple ways but now the "disclose data" from my test vectors example is clear of easily linkable information. See below. The bbsProof is guaranteed unlinkable. The adjSelectiveIndexes are required by BBS. This leaves the mandatoryIndexes and the labelMap as potential indirect points of linking/fingerprinting.

{
  "bbsProof": "869b8a61b7b6bede36ce69d8c78401e6f5556ce53876d7fbae204da1f6e3ea31a3b15c1aef27e95ab081637cf7ee1778a35f1f94dafc4d6634a39ab7b7369eea24f24d26cc0e0c06825b30152eebb58f91fb6d0c72bb595018ad832c882c9a73333421fcd40a7ba332ffe3376964e52b6d3b71c4ad272f86a994dfbb478775a559938b79c3a420e3a9d1f054bfdbd1c2bc59daf85fa31514c3db371f97f9bf6270596a3b8f9df780712b8fee9621b8087017d66423406293515eb880eadfbf6911c7444eb53f003135cbf3a6ad4d560e969ea09118429e37cf0fe80477978d202e4f066f2a89f2ffaca2615c7ee3142ae729d6b42bc94217dc2800543261d6c169e03b8562030e4ad8b86d26a01eda0df35bfb97e5f07c48f796fd552933e9a66d3ac9d7cef976e834f7184499d74fe01d13e7ee0eb93b67823052d4c53df10054939641269ba827dc31f33bdd6ae197b9039a308fbb91d4213fd3120b3cd43201998fb0211a6bebd2e3d967c8746bc1ff76cf091ac328250f436dcd915a3d6a16874ac1cbf28b9cba4858d2d7d9a13a2deec5770c8351ef081d7226a3ac274b1656983fa22e882e95130405506a913c5e65c7e5d8f5e9b80fcbdd9e991c8c36",
  "labelMap": {
    "dataType": "Map",
    "value": [ ["c14n0", "b1"],  ["c14n1",  "b5" ],  ["c14n2", "b3" ], ["c14n3",  "b6" ],  ["c14n4", "b2"] ]
  },
  "mandatoryIndexes": [2, 3, 4, 5, 9, 10, 12, 13, 14, 15, 16, 17],
  "adjSelectiveIndexes": [3, 4, 5, 6, 7, 11 ]
}

Wind4Greg commented 1 year ago

@dlongley every time the holder produces a new presentation they get a new independent BBS proof value. Could they also reorder their JSON-LD derived document to produces different labelMap and indexes? Just thinking out loud... Would this really matter if verifiers collude to share the revealed data too.

Wind4Greg commented 1 year ago

It turns out my example VC/VP isn't a terribly good one for unlinkability (windsurf racing scenario) as it requires a highly visible unique identifier, a windsurfers sail number has to be registered and visible across the water. In addition long lists of equipment used and such would tend to produce rather unique artifacts in the selective index list and such.

brianorwhatever commented 1 year ago

@Wind4Greg isn't the label map not required as the hmac + sorting + indexing to rename the blank nodes can be done by the issuer as well as the holder to ensure the blank node names are applied consistently?

dlongley commented 1 year ago

@Wind4Greg,

@dlongley every time the holder produces a new presentation they get a new independent BBS proof value. Could they also reorder their JSON-LD derived document to produces different labelMap and indexes? Just thinking out loud... Would this really matter if verifiers collude to share the revealed data too.

In ECDSA-SD, the indexes are relative to what is revealed so it changes per reveal in that way, but my understanding is that BBS requires the individualized messages to be mapped to some specific indexes using a BBS-specific message structure, i.e., meaning I don't think this stuff could change there. Is that right? I don't think we can do better than what is required by the low-level BBS primitives.

dlongley commented 1 year ago

@brianorwhatever,

@Wind4Greg isn't the label map not required as the hmac + sorting + indexing to rename the blank nodes can be done by the issuer as well as the holder to ensure the blank node names are applied consistently?

The holder does recompute those things, but they have to send them (or a "reveal-version" of them) to the verifier as the verifier does not (and should not) have access to the HMAC or all of the other data required to produce the proper mapping/indexes.

Wind4Greg commented 1 year ago

Hi all, good conversation. I've been thinking about two general issues and trying to come up with good definitions of them

Information Leakage: A verifier learning more information than the holder intended to disclose (in a selective disclosure setting)
Finger printing/linkability: the loss of anonymity due to artifacts in the signature scheme in addition to the disclosed information that can occur when there is "verifier to verifier" or "verifier to issuer" collusion.

Different schemes may start at different baselines with respect to information disclosed. For example ECDSA-SD will never leak the number of statements (nquads). One of the ways this could have happened is via blank node ids but these are randomized via a HMAC (PRF). In BBS the verifier knows the number of "BBS messages" (nquads) in order to verify the signature (they get the disclosed messages and commitments for the undisclosed messages). One can add dummy "BBS messages" to obscure this if it is a problem for a specific application.

BBS proofs (what the verifier receives) are unlinkable "artifacts", each is guaranteed to be independent from each other when BBS is correctly implemented (this presented a challenge for BBS proof test vectors). However other artifacts or the information disclosed can completely link or partially link (correlate) a holder across verifiers. For example due to their strong cryptographic properties, when properly used, EdDSA and ECDSA signatures are just like UUIDs (universally unique identifiers), and similarly for HMAC outputs. Hence why we want to scrub any traces of them going into BBS.

Now here is where things seem to get a bit tricky to me and where we may need to provide guidance to issuers. The ability of online entities to bring together information to "fingerprint" an online visitor is well know and quite powerful for an overview see Cover Your Tracks. If the issuer runs an HMAC to randomize the order of the nquads (statements) it seems that somewhat unique reordering gets reflected in the mandatory index list and the selective disclosure index list (required by BBS). Intuitively, the longer these lists the more this would come through. Even without HMAC uniqueness a long selective disclosure index list can act as a fingerprint.

@dlongley there was a discussion of info leakage via blank node identifiers before I was active with verifiable credentials can you point us to that and help us understand if/when we would want or not want to run the HMAC randomization in the BBS case?

@brianorwhatever I thought the label map was required between holder and verifier even without HMAC randomization. Maybe it could be combined with the selective disclosed indexes. I'll take a look.

dlongley commented 1 year ago

@Wind4Greg,

there was a discussion of info leakage via blank node identifiers before I was active with verifiable credentials can you point us to that and help us understand if/when we would want or not want to run the HMAC randomization in the BBS case?

There are other issues in this repo that have some discussion on that -- however, the HMAC is necessary to decouple the generation of blank node identifiers from the information used to generate them (some of which may not be selectively disclosed during reveal), i.e., it's always necessary. In the case of BBS, we just want an additional step to add unlinkability by using a smaller value space (a small counter) that is much more likely to be shared amongst many users than the "large" HMAC digest is.

I thought the label map was required between holder and verifier even without HMAC randomization.

Yes, because the identifiers for the blank nodes are first generated from all of the data itself (some of which may not be revealed) to ensure that every implementation starts from the same identifiers. Then, they are reshuffled by an HMAC to remove this linkage to the data. These two processes ensure, first, the same base identifiers will be used by any implementation and then, second, these identifiers are transformed in a confidential way (not revealed to the verifier) such that they are unlinked from the data. For BBS, a third process maps the HMAC'd labels to a smaller value space to add unlinkability prior to revealing to the verifier.

Regardless, some kind of mapping must be given to the verifier for this -- to map the identifiers they produce by running a canonicalization algorithm on only the data that was selectively revealed to ... to what the identifiers would be had all of the data been input into the algorithm.

brianorwhatever commented 1 year ago

@dlongley I believe that means only the blank nodes relating to the currently disclosed claims would need to be included in the labelMap that is sent to the verifier?

I think I'm missing something in my mental map and I think I might have closed this loop. It wasn't clear to me why the map was required to be sent to the verifier as I thought they would be able to verify based on receiving just the new labels but I guess what I was missing is that the signature is on the claims while they are still in their blank node form NOT on the new labeled version. So the verifier needs to know how to reverse the process. Is that correct?

brianorwhatever commented 1 year ago

If my above reasoning is correct, why can't the issuer just sign on top of the new labels and the process would be to simply verify on the verifier side?

dlongley commented 1 year ago

@brianorwhatever,

@dlongley I believe that means only the blank nodes relating to the currently disclosed claims would need to be included in the labelMap that is sent to the verifier?

Yes, that's right.

I think I'm missing something in my mental map and I think I might have closed this loop. It wasn't clear to me why the map was required to be sent to the verifier as I thought they would be able to verify based on receiving just the new labels but I guess what I was missing is that the signature is on the claims while they are still in their blank node form NOT on the new labeled version. So the verifier needs to know how to reverse the process. Is that correct?

Let me know if this matches what you're saying above or helps clarify:

The verifier will receive some JSON-LD that is a partial disclosure of the original data. They will run this through canonicalization which will produce some blank node labels (if there are any blank nodes) based on only the data they have, which will not be all of the data since it's a partial disclosure. This can result in blank node labels that are different from what would be produced if they had all of the data. Since the data was signed using labels that were generated starting with all of the data, the verifier will need to know what the labels should be so they can verify the signature (or the zero--knowledge proof that such a signature exists in the case of BBS).

Therefore, the holder also canonicalizes the partial disclosure (the same operation the verifier will perform), but also produces a selective mapping of the labels the verifier will produce back to the signed labels. The holder transmits this selective mapping along with the partial disclosure, enabling the verifier to map the labels they produce to the signed ones -- and to verify the signature (zkp of the signature).

There's another little bit that modifies the canonical labels to reduce data leakage and then, in the case of BBS, add unlinkability, but that's an orthogonal problem to ensuring the verifier can produce the proper labels for the blank nodes that are revealed to them.

dlongley commented 1 year ago

@brianorwhatever,

If my above reasoning is correct, why can't the issuer just sign on top of the new labels and the process would be to simply verify on the verifier side?

The verifier receives JSON-LD (or potentially even some other syntax in the future) and they need to know what to map to what. To do what you're suggesting above would require the issuer to send over the N-Quads syntax directly, locking that syntax in and changing the data format for the purpose of security, i.e., this approach eliminates transforms to simplify the crypto lib parts, but adds complexity everywhere else in the ecosystem.

This is the tradeoff data integrity makes -- it decouples VCs from the data structures they contain (including security proofs) such that they will be "more flexible in the face of future evolution" and decentralization (e.g., acceptance of different proofs by different parties or decentralized and distributed ecosystem upgrades).

The alternative is to preference the security layer and security developers over other layers and other (greater in number) developers for simplicity by restricting security to a "transport envelope". However, each such envelope that is invented is a new format that must be understood by consumers, including consumers that do not have the means nor need to verify the security. This approach is more ideally suited to scenarios where the identity of the signer is known by the software, there is a specific target for the signed message, and the envelope can be discarded once checked. VCs have core use cases where the target is not known, VCs are persistent or long-lived, and where digital wallets may not know the signer's identity or may not even be able to understand or process every possible security envelope now or into the future -- yet they must enable indexing and processing of the data.

brianorwhatever commented 1 year ago

Yep this closes the loop for me thanks @dlongley. It's unfortunate but I don't think it overly complicates the proof and allows the content to be json like the rest of the data integrity suites.

Wind4Greg commented 1 year ago

Hi all, I've still got some concerns about the HMAC and trying to quantify things (in a worst case sense) a bit. In trying to analyze "linkability" I like the approach of Cover Your Tracks in accessing the uniqueness of browser fingerprints. In our case it is the uniqueness of the artifacts that go with a BBS proof and credential.

For BBS what is leaked or disclosed by the algorithm is: n (number of messages) and selected indexes. How unique? Issuer can pad out n to prevent this from being as unique. The holder chooses selected indexes. If they choose to reveal k messages there are, worst case, C(n, k) possible choices so a particular choice would represent a 1 out of C(n,k) uniqueness.

Example in the windsurf racing example we had 14 non-mandatory messages (nquads) (n = 14). This gives us the following C(14,i) as i ranges from 1 to 13:

[14, 91, 364, 1001, 2002, 3003, 3432, 3003, 2002, 1001, 364, 91, 14]

This leads to simple rules to reduce uniqueness: keep n small, and either reveal a little (small k) or a lot (k close to n). This is a fundamental limit of BBS see BBS issue 256.

When we apply the HMAC to the blank node ids we are producing a permutation of the BN blank node ids which worst case is 1 out of BN!. In my windsurf racing example there were only 6 blank nodes so we have 1 in 720 uniqueness. Note the permutation isn't directly revealed but shows up in three exposed artifacts: (a) mandatory indexes, (b) label map, (c) selective indexes.

Because permutations could lead to such extreme uniqueness (factorial growth), I wanted to understand clearly what information might get leaked if we don't reshuffle via HMAC or other alternatives that could be used to prevent this such as issuer assigning node ids instead of leaving nodes blank. @dlongley does this make more sense as to why I'm so concerned about the HMAC of the node ids?

It seems to me that having good informative text explaining how to achieve "unlinkability" and the limits of the "unlinkability" is key to a successful spec and enables cryptographic review. At the last BBS WG call the importance of adding helpful informative text to that draft was noted too.

dlongley commented 1 year ago

@Wind4Greg,

For BBS what is leaked or disclosed by the algorithm is: n (number of messages) and selected indexes. How unique? Issuer can pad out n to prevent this from being as unique.

I think the VC-DI-BBS cryptosuite should take this as an input parameter to ensure that there are N-many BBS "messages", automatically padding up to whatever number is provided and throwing an error if there are less than that number. This should be as easy as possible for issuers to use; for a given credential type, they should be setting an upper bound and using the same number for every VC (of that type) signed. They should not have to get involved in the BBS details in any way. We must allow for implementers to build APIs that will make this very easy for issuers, with as few dials as possible, otherwise we should expect failure (in a variety of ways).

If it's reasonable, some default upper bound could also be recommended by VC-DI-BBS if it doesn't create additional issues. If a VC doesn't fit in those bounds, an error or warning could be raised indicating that unlinkability is likely not possible.

The holder chooses selected indexes. If they choose to reveal k messages there are, worst case, C(n, k) possible choices so a particular choice would represent a 1 out of C(n,k) uniqueness.

Is there any problem with what I proposed above -- and where the "automatically generated padding messages" are just added onto the end? Do they need to be shuffled in some way that would then require more information to be transmitted to the holder about which to use? Ideally not.

When we apply the HMAC to the blank node ids we are producing a permutation of the BN blank node ids which worst case is 1 out of BN!. In my windsurf racing example there were only 6 blank nodes so we have 1 in 720 uniqueness. Note the permutation isn't directly revealed but shows up in three exposed artifacts: (a) mandatory indexes, (b) label map, (c) selective indexes.

Because permutations could lead to such extreme uniqueness (factorial growth), I wanted to understand clearly what information might get leaked if we don't reshuffle via HMAC or other alternatives that could be used to prevent this such as issuer assigning node ids instead of leaving nodes blank. @dlongley does this make more sense as to why I'm so concerned about the HMAC of the node ids?

Yes, but I think the way to think about it is: What is the "total herd size"? The greater the number of blank nodes, the greater the herd size (expected number of individuals to receive a credential of a particular type) that is required -- and it grows factorially as you mentioned.

The alternative to the HMAC approach that has been discussed is to use a template with "dummy values" that will be the same for all VCs of a certain type to produce a particular blank node ordering. Then the dummy values would be replaced with real ones, followed by recanonicalization to produce the necessary label mapping. However, this has the disadvantage of maximizing data leakage: e.g., if you reveal blank node label _:b4 then that necessarily means you have three children, or three bank accounts, or you are one of three people who completed a particular set of courses at some university in the year 2000.

This same problem already happens with the BBS low-level primitives themselves (regardless of blank nodes), i.e., if the message index order is fixed for all users. Similarly, if there are too many possible ways particular BBS messages could be ordered you create linkability problems, again, regardless of blank nodes. So it's better for there to be N-many possible orderings such that N will sufficiently hide a particular member within the expected herd size.

It seems that the HMAC approach, coupled with keeping the number of blank nodes low (and keeping the possible number of orderings BBS messages low), is the most likely to preserve unlinkability and minimize data leakage. Additionally, if you're revealing enough statements to push any limits here, you're probably already linked by the revealed information itself. In other words, unlinkability is quite "fragile", as you've mentioned before.

It seems to me that having good informative text explaining how to achieve "unlinkability" and the limits of the "unlinkability" is key to a successful spec and enables cryptographic review. At the last BBS WG call the importance of adding helpful informative text to that draft was noted too.

+1

It may be that APIs should be designed to take in a predicted "herd size" from the issuer and then the implementation can compute, based on the VC itself, whether that herd size is sufficiently large to protect its members when they are trying to reveal small sets of statements (throwing an error or warning if not). I imagine a utility function could also indicate the required herd size from a given VC.

Wind4Greg commented 1 year ago

@dlongley said:

Yes, but I think the way to think about it is: What is the "total herd size"? The greater the number of blank nodes, the greater the herd size (expected number of individuals to receive a credential of a particular type) that is required -- and it grows factorially as you mentioned.

I don't see the "herd size" being related to the number of blank nodes but to the number of individuals with credentials of a particular type that visit a verifier who may be colluding with other verifiers or the issuer. In my windsurf racing example used for the ECDSA-SD test vectors there was a four element array of windsurf sail objects and a two element array of board objects which lead to the 6 blank node ids.

The reality of the windsurfing case in my garage would be four boards (two foil, 1 slalom, 1 wave) and 10 sails leading to 14 blank nodes for BN! = 8.71e10, i.e., the reordering of blank nodes is completely unique and seems like it can leak through leading to complete fingerprinting. Hence my concern about the shuffling if its not necessary.

OR13 commented 1 year ago

Another thing to consider is how disclosure impacts linkability, or disassociability (@Wind4Greg commented that NIST uses this term in https://nvlpubs.nist.gov/nistpubs/ir/2017/NIST.IR.8062.pdf in another channel).

SD-JWT treats fields as mandatory to disclose unless they are annotated. DataIntegrityProofs treat fields as optional to disclose unless they and annotated.

These postures also interact with normative statements, for example, you would expect to see /issuer always disclosed based on normative statements.

You might expect validFrom to always be optional to disclose based on normative statements.

If your validFrom includes miliseconds, and it is mandatory that can erode unlinkability.

If your credential includes an @id that can erode unlinkability.

Wind4Greg commented 1 year ago

Hi @OR13 and all, the NIST document NIST IR 8062 defines the concept of disassociability and mentions unlinkability as a somewhat equivalent term. Some BBS folks met with some NIST folks and they pointed us to a paper that analyzes two identification systems from the privacy perspective emphasizing unlinkability in the way we've been using the term. None of these talks about "errosion" in unlinkability due to artifacts.

Hence why I still like the approach that the (anti) browser fingerprinting folks take where for each "artifact" exposed, they assign it a "uniqueness" rating (factor)and as a worst case analysis we multiply all those results together and compare it to the "herd" size as discussed in Verifiable Credentials Status List to assess if there is meaningful unlinkability. We have artifacts revealed to the verifier at each layer:

BBS: reveals the number of messages n, and the list of selective disclosure indexes.
Translation of a JSON-LD credential into BBS messages: mandatory indexes, label map (still worried about blank node shuffle)
Selected Credential Content Itself: very application dependent! One UUID like item and we have uniqueness and no unlinkability.

We should be able to provide meaningful advice or numbers for layers 1 and 2 and cautions for layer 3 if we are to have a good specification. Other opinions on this methodology?

dlongley commented 1 year ago

@Wind4Greg,

The reality of the windsurfing case in my garage would be four boards (two foil, 1 slalom, 1 wave) and 10 sails leading to 14 blank nodes for BN! = 8.71e10...

I believe what matters is the number of blank nodes you reveal. In other words, unlinkability decreases as you reveal more blank nodes. It's revealing the total ordering (or a sufficiently large subset of it) that creates linkability. If you only reveal 1 blank node out of 100, the verifier has no idea what your total ordering is, despite the fact that it would be very unique.

Notably, the more blank nodes you reveal, the more other information you also reveal -- such that you're likely to lose unlinkability because of that other information anyway.

In your scenario, if you reveal just one statement about a particular board and it includes blank node label b1 ... then there are up to (N-1)! possible combinations for your other blank node labels -- and the verifier does not know which one you have. Every other VC issued with that same board has a 1/N chance of being associated with that same blank node label.

Note that if the blank node labels are instead kept constant for all users, then revealing blank node label b7 would always reveal that you have >= 7 boards (for example). It's true that the number of statements is always revealed with BBS, but as discussed above, this could be obscured somewhat by having the cryptosuite layer pad this. You could similarly pad the blank node IDs and shuffle them around (but in the same way for every VC) -- but a verifier would understand that the blank node labels are fixed and that if they see label X, then it carries additional meaning beyond what is in the statement itself. It seems that more individualized shuffling can provide a decent mixture of unlinkability and data leakage minimization.

The total possible orderings could also be reduced with a more sophisticated sorting algorithm, such that once the blank node labels have been HMAC'd, the N-Quads in which they appear could be sorted ignoring data values and blank node labels -- and whatever ordering is produced by the issuer could then be used to relabel the blank nodes. These labels would have to be included in a label map transmitted in the base proof. The HMAC would not need to be transmitted as it would only be used as an intermediate step, and a total ordering (based on a reduction of total possible orderings) would be "picked" by the issuer. However, this brings in more of the data leakage problem I just described above.

Again, I think unlinkability and data leakage are in opposition to one another -- you have to choose a bit more of one or the other. It seems that the unlinkability use cases generally call for revealing very little data -- which implies that very few blank nodes would also be revealed. This means that if the total ordering for a given user is independent of the data in the VC, minimal data should be leaked -- and the total ordering should similarly not be known to the verifier.

Wind4Greg commented 1 year ago

Great points @dlongley now everyone knows how many boards and sails I have or at least the sum of the two if I use a simple reshuffling. But that will be changing since orders for 2024 are due soon ;-) Let's start with an example that I think I've made fairly unlinkable with respect to lower level artifacts. Here is an id card for a tree in North America:

{
  "@context": [
    "https://www.w3.org/ns/credentials/v2",
    {"@vocab": "https://windsurf.grotto-networking.com/selective#"}
  ],
  "type": [
    "VerifiableCredential"
  ],
  "FirstName": "Sequoia",
  "LastName": "Sempervirens",
  "Coordinates": "(41.795621, -124.106312)",
  "Park": "Jedediah Smith Redwoods State Park",
  "County": "Del Norte",
  "State": "California",
  "DoB": "1200/03/21",
  "Over200Yrs": true,
  "Height": "296ft",
  "Eyes": "None",
  "Hair": "Brown bark, green needles"
}

Mandatory reveal is ["/State"] for some reason; To get into a bar (age and water rationing) they need to selectively reveal: ["/Over200Yrs", "/County"] . If we run this through the BBS/SD-Primitives (see my ECDSA-SD-TestVector repo BBS dir) we get the following "disclosure data" (prior to serialization):

{
  "bbsProof": "8a0657903cea7e282c796abc3febc8b5db8657f077260f313f014b27181907fa04ffac03cb59be0be30656ff3eb17ea4a949ba2251cc798fecc142d2c30c9eec5737f29edbe3a3d115f577e77a91247c50483c66191c4b87b5394f91115434ee4d82c2edf7df2e6db35a1e5a6fee96eb5fe5e13cfe9cebcc832b91a265258b7e363036e74962abe0ca9eeb6a2e2e57af46a942da708720074bb3e0daaebfa18068fc213a10993c69a9fdeb07ba0bcb7498d454b563710cdc209f1fa4c1a31cbf027fc18ae5a7e9ec729f52e76f4231e1acb9dc3a5afa986c768e160f37ad364506c89305723054d37ff54fd7373111aa74e482763e708666cf93b93b9fa4f00e13208df8465883f8e45488569106f221df8cdf7578c7304a90da2c09ca7a8dff068a8434cf5e74a55e84d63bccaee54eb3c5189a40cee86d6295ddcfdc72977d2c6ac9fedb1500866c16ececc1d23f2e0f69f57ca3309b50f0deaa8a0d161d7937d5fe2b897f56b9808a53fe0d940a5cb9ccd954cfb0f1b84a6c952e89228e3e34445fd2a182cd9b9b36e016c56acb74997163f38b171bd16817dd4e9a20dc5c1099cc7183c9091721619a296905200a2beeb53fb84bed510a87a7d8490347ed",
  "labelMap": { "dataType": "Map", "value": [ ["c14n0", "b0"]]},
  "mandatoryIndexes": [0, 3],
  "adjSelectiveIndexes": [1, 8]
}

The label map reveals nothing since there is only one blank node for the entire document; Because there is no reshuffling the mandatory indexes are the same for every tree in North America revealing nothing; Because there is no reshuffling the selective indexes are the same for every tree visiting a bar (if the starting base documents are all structured exactly the same). The BBS proof value infers n the number of selectable statements (when combined with the selected indexes). This produces the signed derived document shown below:

{
  "@context": [
    "https://www.w3.org/ns/credentials/v2",
    {
      "@vocab": "https://windsurf.grotto-networking.com/selective#"
    }
  ],
  "type": [
    "VerifiableCredential"
  ],
  "State": "California",
  "Over200Yrs": true,
  "County": "Del Norte",
  "proof": {
    "type": "DataIntegrityProof",
    "cryptosuite": "ecdsa-sd-2023",
    "created": "2023-08-15T23:36:38Z",
    "verificationMethod": "did:key:zUC7DerdEmfZ8f4pFajXgGwJoMkV1ofMTmEG5UoNvnWiPiLuGKNeqgRpLH2TV4Xe5mJ2cXV76gRN7LFQwapF1VFu6x2yrr5ci1mXqC1WNUrnHnLgvfZfMH7h6xP6qsf9EKRQrPQ#zUC7DerdEmfZ8f4pFajXgGwJoMkV1ofMTmEG5UoNvnWiPiLuGKNeqgRpLH2TV4Xe5mJ2cXV76gRN7LFQwapF1VFu6x2yrr5ci1mXqC1WNUrnHnLgvfZfMH7h6xP6qsf9EKRQrPQ",
    "proofPurpose": "assertionMethod",
    "proofValue": "u2V0BhNhAWQHAigZXkDzqfigseWq8P-vItduGV_B3Jg8xPwFLJxgZB_oE_6wDy1m-C-MGVv8-sX6kqUm6IlHMeY_swULSwwye7Fc38p7b46PRFfV353qRJHxQSDxmGRxLh7U5T5ERVDTuTYLC7fffLm2zWh5ab-6W61_l4Tz-nOvMgyuRomUli342MDbnSWKr4Mqe62ouLlevRqlC2nCHIAdLs-Darr-hgGj8IToQmTxpqf3rB7oLy3SY1FS1Y3EM3CCfH6TBoxy_An_BiuWn6exyn1Lnb0Ix4ay53Dpa-phsdo4WDzetNkUGyJMFcjBU03_1T9c3MRGqdOSCdj5whmbPk7k7n6TwDhMgjfhGWIP45FSIVpEG8iHfjN91eMcwSpDaLAnKeo3_BoqENM9edKVehNY7zK7lTrPFGJpAzuhtYpXdz9xyl30sasn-2xUAhmwW7OzB0j8uD2n1fKMwm1Dw3qqKDRYdeTfV_iuJf1a5gIpT_g2UCly5zNlUz7DxuEpslS6JIo4-NERf0qGCzZubNuAWxWrLdJlxY_OLFxvRaBfdTpog3FwQmcxxg8kJFyFhmilpBSAKK-61P7hL7VEKh6fYSQNH7aEAAIIAA4IBCA"
  }
}

As @OR13 pointed out in the above we can see that the created timestamp can be problematic, but wouldn't this be under the control of the issuer?

dlongley commented 1 year ago

@Wind4Greg,

The example above makes sense (it does need a little tweaking, however, to be VCDM conformant (the credential subject properties should be nested under credentialSubject).

As @OR13 pointed out in the above we can see that the created timestamp can be problematic, but wouldn't this be under the control of the issuer?

It's under the control of the issuer, yes -- and could just be omitted as well. It's an optional field:

https://www.w3.org/TR/vc-data-integrity/#proofs

OR13 commented 1 year ago

As an aside, the JSON Pointers needed for mandatory disclosure to ensure conformance to normative statements should probably be provided in this spec, here is an example of what mean:

https://transmute-industries.github.io/vc-jwt-sd/#example-mandatory-to-disclose-json-pointers

AFAIK, you can't selectively disclose attributes on the proof objects (can have multiple proofs).... but I could be wrong about that.

Wind4Greg commented 11 months ago

Hashing procedures have been added to the updated document. See https://w3c.github.io/vc-di-bbs/#base-proof-hashing-bbs-2023. For a full discussion of leakage and unlinkability see PR https://github.com/w3c/vc-di-bbs/pull/101.

w3c / vc-di-bbs

Hashing (bbs-proof-2023) #84