Possible ambiguity of usage `oneOf` from JSON schema

trustoverip / tswg-acdc-specification-archived

Authentic Chained Data Containers (ACDC)

Other

3 stars 4 forks source link

Possible ambiguity of usage `oneOf` from JSON schema #8

Closed blelump closed 1 year ago

blelump commented 2 years ago

Sam,

in the spec https://github.com/SmithSamuelM/Papers/blob/master/whitepapers/ACDC_Spec.md#attribute-section there is a proposal to use oneOf JSON schema key attribute. To bring here the context, below is the copy pasted snippet:

"a": 
{
  "description": "attribute section",
  "oneOf":
  [
    {
      "description": "attribute SAID",
      "type": "string"
    },
    {
      "description": "uncompacted attribute section",
      "type": "object",
      "required": 
      [
        "d",
        "i",
        "score",
        "name"
      ],
      "properties": 
      {
        "d": 
        {
          "description": "attribute SAID",
          "type": "string"
        },
        "i": 
        {
          "description": "Issuee AID",
          "type": "string"
        },
        "score": 
        {
          "description": "test score",
          "type": "integer"
        },
        "name": 
        {
          "description": "test taker full name",
          "type": "string"
        }
      }
    }
  ]
  "additionalProperties": false,
}

I get the reasoning behind the above snippet so it's clear. What may be problematic is the oneOf usage by consumers. Imagine a consumer sees such an option and constructs his schema in the following way:

"a": 
{
  "description": "attribute section",
  "oneOf":
  [
    {
      "description": "plane",
      "type": "object",
      "required":       [...      ],
      "properties":       {       ...      },
   },
    {
      "description": "car",
      "type": "object",
      "required":       [...      ],
      "properties":       {       ...      },
    },
    {
      "description": "bike",
      "type": "object",
      "required":       [...      ],
      "properties":       {       ...      },
    },
  ]
  "additionalProperties": false,
}

It is at least problematic to reason whether data is car, plane or bike. The oneOf opens context mixing for various data types.

blelump commented 2 years ago

Found the answer in https://github.com/SmithSamuelM/Papers/blob/master/whitepapers/ACDC_Spec.md#chain-link-confidentiality-protection-with-composable-json-schema .

SmithSamuelM commented 2 years ago

Well of course any composition operator can be abused or misused or confused. That is why ACDCs are best used within an ecosystem governance framework EGF. That EGF defines the structure of the credentials that are relevant to its ecosystem. Which means that the confusion would never occur because only ACDCs that comply in structure with the EGF would be used. But even then any pair of entities planning to engage in a transaction have to agree on what data they want to exchange within the context of that exchange. For example, GLEIF defines its ecosystem of ACDCs, aka vLEIs. This ECF includes defining the contextual semantics and syntax for their credentials. Likewise, other groups of issuers define the ecosystem of VCs that apply to their transaction set. Given that set, the use of oneOf to enable composition for the purpose of selective disclosure becomes clearly defined.

What the ACDC spec is doing is showing how to leverage a universally adoptable set of tools like JSON Schema validators, digital signatures, and cryptographic digests to enable selectively disclosed and chain-link confidential exchanges of data without writing that tooling from scratch. This aligns with the KERI ethos of minimally sufficient means.

It is good to remember that a universally applicable data semantic construct Is not possible. Unfortunately, this impossibility is overlooked by many who wish for a world where such a thing is possible. But the hard problem of polysemy and other forms of uncertainty makes this impossible. Practically speaking all data semantics are most useful within the narrow context defined by an ecosystem of transactions. The contextualized transaction set defines the applicable semantic set not the other way around.

By default every ACDC should be structured to enable selective disclosure for chain link confidentiality. Which means the top-level attribute block always includes a oneOf operator for that purpose. Within that oneOf array a given element may use a nested oneOf for other purposes. But if so, it will be clearly spelled out by the ECF for that ACDC.

SmithSamuelM commented 2 years ago

Imagine for example that some Issuer issues an ACDC that has in it an attribute labeled, "licensed to kill", with a value true. Along with a schema that defines the type and value of the credential. What possible meaning could this credential have? The possible meanings are unbounded. Does it mean that the identified Issuee indeed may kill without breaking any laws? No such simple bare construct is immune from such polysemy.

In any practical application, the credential is only understood in the context of an ecosystem that may be defined or explained by potentially hundreds or thousands of pages of documentation. These may include goverance rules and recourse rules in the event of dispute over the interpretation of that documentation. Absent such a rich ecosystem that brings contextual meaning, bare credentials including schema are useless. The toy examples we see so often promulgated by credential technologists belie the fact that schema can not by themselves provide sufficient contextual meaning for decision making in the real world (i.e. actionable semantics). Credential schema provide syntactical structure that places the credential within a context but the schema by itself cannot define the context. Only humans can do this.

The fiction of knowledge graphs is that they are self-contained. The best automated reasoning tools only work within a narrowly defined context. The context must always be encompassed by humans who supervise its activities. A knowledge graph is just another form of automated reasoning. It must be supervised. The only intelligence we know of that can practically speaking solve the polysemy problem is a human and most humans do so very poorly. That why we have human recourse mechanisms.

SmithSamuelM commented 2 years ago

Another way of looking at polysemy is from mathematical logic. https://en.wikipedia.org/wiki/Gödel%27s_incompleteness_theorems

Gödel's incompleteness theorem when applied to a knowledge graph means that no knowledge graph can be self-contained. There must be an axiomatic context outside the graph that cannot be expressed in or by the graph itself. There is not recursively defined completeness of any logical system. This is due to something called syntactic incompleteness. Given a set of axioms by which a set of rules may be expressed, those rules may be semantically complete, i.e well defined with respect to the axioms, but the set of semantic rules are incapable of expressing the axioms that the rules depend on.

Adding the complexity of polysemy in the meaning of the set language itself only compounds the problem of incompleteness.

blelump commented 2 years ago

Thanks @SmithSamuelM .

Could you elaborate on this comment https://github.com/trustoverip/tswg-acdc-specification/issues/8#issuecomment-1082447787 , especially on the first paragraph there:

[...] Likewise, other groups of issuers define the ecosystem of VCs that apply to their transaction set. 
Given that set, the use of oneOf to enable composition for the purpose of selective disclosure becomes clearly defined.

In particular, how does the oneOf address various transaction sets mentioned there? Assuming given EGF utilizes vLEI's, other VC's, how does oneOf help here?

In the spec https://github.com/SmithSamuelM/Papers/blob/master/whitepapers/ACDC_Spec.md#composable-json-schema , the oneOf is introduced to support the chain-link confidentiality. The oneOf proposes either compact or full schema. How disclosing full schema may harm the discloser in the first place?

SmithSamuelM commented 2 years ago

@blelump Disclosing schema does not harm the discloser or disclosee. The problem is making commitments. The issuer, upstream makes a commitment to the schema by signing the ACDC that includes a reference to the schema. Because the schema is static, only one schema can be committed to. So composable schema enable one commitment by the issuer to the fully composed schema, so that there are no security issues arising from dynamic schema, but the validator, downstream can decompose the schema to ask and answer differentiated questions about the data provided via partial, full, or selective disclosure. It solves a problem that otherwise requires something complicated like a cryptographic accumulator to solve.

SmithSamuelM commented 2 years ago

For example a fully compact ACDC validates against a different (decompsed schema) expression than a partially compact expression, than a fully uncompacted expression. So instead of the Issuer committing to by signing each of all the different possible deomposed variants of a schema, the Issuer only needs to sign the composed version. The validator may then decompose as needed to validate against any variant of a compact or uncompacted disclosure of the ACDC itself.

SmithSamuelM commented 2 years ago

For example, a oneOF operator to allow compact form may have nested an anyOf composition operator would allow the attributes of a given ACDC to be disclosed in any of the 24 EU official languages by providing a composed schema that included copies of the attribute block in an anyOf array nested inside one of the oneOF blocks, where each copy had its field labels translated into each of the 24 languages. Then a validator that wants only a given language can request disclosure the attribute block that uses that particular language and verify the schema by validating against a decomposed version of the schema that removes the anyOf operator with language options and instead just has the one language the validator requires. The issuer simultaneously commits to anyOf the variants when it commits to the composed schema. The validator gets to enforce a decomposed version. The process is secure because the Issuer commits to a single static composable schema that allows any of the variants. The ecosystem governance framework defines how validators may safely decompose and still be compliant. Its up to the validator to enforce their own compliance, as it should be and frankly as it only can be. The issuer certainly can’t enforce it.

mitfik commented 2 years ago

@SmithSamuelM do you have any real use cases (example of usage of oneOf in real life credential/acdc) where selective disclosure would be used in a way how it is described in the spec. Because I have a problem to find a situation where oneOf helps with selective disclosure of the semantic without revealing too much what is inside.

Why so:

in most of the cases semantic are public objects, means if an issuer creates a schema with oneOf he needs very strongly protect the "undisclosed variant" semantic so people who would not be allowed to see that part, that they would not get it in any point of time. In the same time, this semantic needs to be distributed to those who should be able to see this part. Taking into consideration that any holder is able to reveal that semantic publicly, what is the point of assuming even that this semantic would not be disclosed?

blelump commented 2 years ago

@SmithSamuelM , to add to what Robert just said, in this reply the oneOf makes sense, so that when issuer issued compacted version and validator has uncompacted version, both will verify. This is correct, but where do you see advantage of such a feature? In particular, where does it help?

Unless you think about smth like:

"a": 
{
  "description": "attribute section",
  "oneOf":
  [
    {
      "description": "person",
      "type": "object",
    },
    {
      "description": "employee",
      "type": "object"
   },
  ]
}

so that you selectively disclose only partial information about the object? In the above example, you can disclose person schema which contains person related attributed and eventually employee schema, which contains only the attributes about the work environment of this person.

mitfik commented 2 years ago

For example, a oneOF operator to allow compact form may have nested an anyOf composition operator would allow the attributes of a given ACDC to be disclosed in any of the 24 EU official languages by providing a composed schema that included copies of the attribute block in an anyOf array nested inside one of the oneOF blocks, where each copy had its field labels translated into each of the 24 languages.

I don't see much of benefits using anyOf in translation case:

first of all, using anyOf in translation case feels a bit wrong, means you are committing to different naming flavors of the schema not different variants of the data set which could be there. E.g. first_name -> imię (first_name in polish) same data just differently named. Which is not against the purpose of anyOf just feels bad since you could have hundreds of such variation even in one language e.g. first_name -> name. Does not seems that even Json Schema community thinks that way.
but the most disadvantage of such approach is that you have to know everything up front anyway. Means you could easily commit to translation by creating schema with translation embedded in the schema as additional attribute which then can be render according to the needs. In both cases you need to have custom logic to deal with it on the presentation layer anyway.

If we speak about translation use case, I think the most powerful/practical approach is to decouple capture base (meaning of the data/context, semantic) from presentation layer. What that gives us? You are able to get credential which can life way longer without need to do any revocation or reissuing it in case if new language appear.

Speaking of this example of 24 EU official languages, why if Ukraine joins UE, shall we revoke all credential and reissue them to add new language? Or it would be batter to say that credential is still valid and now there is translation "new layer" released in Ukrainian which is stamp by EGF and anyone can use it against those specific credential. Another advantage of such approach is that you can delegate responsibility of such semantic to multiple issuers where one can be responsible for capture base and others for presentation layers.

SmithSamuelM commented 2 years ago

It depends on if you care about selective disclosure and correlation across presentation contexts. The language translation example may not be the best example for that. But it does point out that there is a difference and understanding that difference is important. I agree that if one issues a credential that one wants to have unbounded dynamic ability to retranslate in the future, that the selective disclosure use does not solve that problem. But passports and drivers licenses are not issued indefinately so the problem is not anytime in the future but anytime in the next 5 or 10 years. I think you are missing the point that we need to be able to do things in a cryptographically secure way for security and or correlation minimization that has harder constraints. You can wait indefinately to solve these hard constraints or you can solve them now with tooling you have now. You can layer OCA above ACDCs. Treat ACDCs as proof of authorship and proof of authority at a lower layer. What is proven can be an undifferentiated blob from the perspective of prooving authorship or prooving authority. You just need a digest of the blob to be in the ACDC. Then you can explode that blob and process it as OCA but refer back to the ACDC as your proof. You can store the ACDC to provide proof of authorship and proof of authority at rest vis-a-vis the layers that sit on top of it.

SmithSamuelM commented 2 years ago

Part of the problem in these discussions is that you want to push OCA before its ready and before the tooling is ready. And we NEED to deliver vLEIs built on ACDCs now. So its just too soon to be having these discussions. We committed as a community to JSON Schema months ago. So although I appreciate having a broader discussion its not timely.

SmithSamuelM commented 2 years ago

We are using JSON Schema Tooling to provide a limited degree of flexibility and extensibility in ACDCs to enable chaining and basic use cases. A future version of ACDC could use other mechanisms such as OCA but that is a much bigger lift and bigger question so its seems entirely out of place to even introduce or suggest OCA instead of JSON Schema at this point in time.

SmithSamuelM commented 2 years ago

So please keep the discussion focused on how we can leverage JSON Schema to accomplish the version 1, features of ACDCs. And not keep suggesting we use OCA.

SmithSamuelM commented 2 years ago

in most of the cases semantic are public objects, means if an issuer creates a schema with oneOf he needs very strongly protect the "undisclosed variant" semantic so people who would not be allowed to see that part, that they would not get it in any point of time. In the same time, this semantic needs to be distributed to those who should be able to see this part. Taking into consideration that any holder is able to reveal that semantic publicly, what is the point of assuming even that this semantic would not be disclosed?

Let's not confuse syntax with semantics. Just because JSON and JSON Schema use the same syntax does not make them the same !. It is very easy to confuse field labels with semantics versus correlatable values. One of the attractive reasons of using (label, value) tuple in a field map is that one can infer semantics from a descriptive label without needing a separate schema. This is because absent a separate schema a descriptive label may convey semantics in and of itself.

But is a very weak form of semantics. When we are explicitly using schema for semantics then the primary purpose of a field label is syntactical, to identify a field, to distinguish it from some other field. To clarify, when using separated schema as type information (aka semantics) the semantics of the field is not conveyed by the field label, the semantics is conveyed by the specific sub-schema identified by the syntactical element that is the field label.

So either we mix semantics and syntax by not having schema at all, or we cleanly separate the semantics in the separated schema and let the schema be the semantics. This means that field labels are syntactical elements that may leak correlatable information.

So if I care about correlating field labels across presentation contexts because the language of the field label leaks information about the context in which I present. Such as how many different places in Europe did I present my PII contained in my passport then I want to minimize the correlation of field labels as syntactical elements. So the field labels matter in this case. If I don't care about correlating to field labels then I make them fully public and not selectively presentable.

So I completely agree that in most cases correlating language may not be important. But I used a trickly example to illustrate that when we care about leaking correlatable information, language is correlatable to place and therefore may be a concern when protecting against unpermissioned exploitation. It also illustrates the common misconception that field labels in ACDC are primarily semantics when indeed they are primarily syntax. The selective disclosure mechanism for ACDC hides (blinds) the field labels behind a cryptographically secure digest. This allows a presenter to not leak information via syntax (aka field labels).

"a":
[
  {
    "d": "ErzwLIr9Bf7V_NHwY1lkFrn9y2PYgveY4-9XgOcLxUde",
    "u": "0AqHcgNghkDaG7OY1wjaDAE0",
    "i": "did:keri:EpZfFk66jpf3uFv7vklXKhzBrAqjsKAn2EDIPmkPreYA"
  },
  {
    "d": "ELIr9Bf7V_NHwY1lkgveY4-Frn9y2PY9XgOcLxUderzw",
    "u": "0AG7OY1wjaDAE0qHcgNghkDa",
    "score": 96
  },
  {
    "d": "E9XgOcLxUderzwLIr9Bf7V_NHwY1lkFrn9y2PYgveY4-",
    "u": "0AghkDaG7OY1wjaDAE0qHcgN",
    "name": "Jane Doe"
  }
]

with semantics

"a": 
{
  "description": "attribute section",
  "oneOf":
  [
    {
      "description": "attribute section SAID",
      "type": "string"
    },
    {
      "description": "attribute details",
      "type": "array",
      "uniqueItems": true,
      "items": 
      {
        "anyOf":
        [
          {
            "description": "issuer attribute",
            "type": "object",
            "properties":
            "required":
            [
              "d",
              "u",
              "i"
            ],
            "properties":
            {
              "d": 
              {
                "description": "attribute SAID",
                "type": "string"
              },
              "u": 
              {
                "description": "attribute UUID",
                "type": "string"
              },
              "i": 
              {
                "description": "issuer SAID",
                "type": "string"
              },
            },
            "additionalProperties": false
          },
          {
            "description": "score attribute",
            "type": "object",
            "properties":
            "required":
            [
              "d",
              "u",
              "score"
            ],
            "properties":
            {
              "d": 
              {
                "description": "attribute SAID",
                "type": "string"
              },
              "u": 
              {
                "description": "attribute UUID",
                "type": "string"
              },
              "i": 
              {
                "description": "score value",
                "type": "integer"
              },
            },
            "additionalProperties": false
          },
          {
            "description": "name attribute",
            "type": "object",
            "properties":
            "required":
            [
              "d",
              "u",
              "name"
            ],
            "properties":
            {
              "d": 
              {
                "description": "attribute SAID",
                "type": "string"
              },
              "u": 
              {
                "description": "attribute UUID",
                "type": "string"
              },
              "i": 
              {
                "description": "name value",
                "type": "string"
              },
            },
            "additionalProperties": false
          }
        ]      
      }
    }
  ]
  "additionalProperties": false,
}

The array mechanism with anyOf is intentional because it enables the property that the field labels used in a given selective disclosure are also selectively disclosed. This provides herd privacy by default to field labels. It matters not that the set of field labels appears as a set in the public composed schema. They are merely a set of optional syntactic elements. Their appearance as a set does not indicate which members of the set of syntactic elements were actually used in the presentation. This is the essence of selective disclosure, to unbundle the members of a set which set was committed too by the issuer but the actual members disclosed is committed to by the presenter. A correlater can't correlate from the public schema what was presented across different contexts.

These tricky issues about security and privacy especially with regards semantics and syntax are hard to isolate. I am not surprised that it gets confused. Hopefully the design of ACDCs is not so confused.

blelump commented 2 years ago

@SmithSamuelM ,

the usage of anyOf basically imposes on the schema designer that whatever selective disclosure will be available on this schema in the future , it must be designed upfront. So the schema designer must define all the possible branches beforehand, as it was actually already mentioned in the translations case.

While translations are not that critical, ie. some are missing, what if the designed branches for information disclosure don't fit the actual requirements from the discloser perspective? Shall we always impose EGF first and then any eventual discloser claims simply forward to EGF?

SmithSamuelM commented 2 years ago

@blelump

so that you selectively disclose only partial information about the object? In the above example, you can disclose person schema which contains person related attributed and eventually employee schema, which contains only the attributes about the work environment of this person.

Yes exactly, the is the whole purpose of selective disclosure vs. partial disclosure. Selective disclosure enables the discloser to unbundle attributes from the issuer and only disclose the subset needed to enable a transaction. So if the credential includes name, address, phone number place of work, ethnicity, religion etc but only address is relevant then selective disclosure enables disclosure of the just the address without correlating to the other attributes bundled in the credential. This the primary feature of AnonCreds1 in Sovrin/Indy. They use a cryptographic accumulator. Here we just use an aggregated blinded commitment. Which shares many of the features of a more sophisticated accumulator. We can’t do range proofs like AnonCreds1 for example. But basic selective disclosure of multi-attributes in the same credential we can.

SmithSamuelM commented 2 years ago

@blelump

the usage of anyOf basically imposes on the schema designer that whatever selective disclosure will be available on this schema in the future

Yes off course. Except for range proofs, that is how selective disclosure works. The Issuer based on reaonable expectations can granularize the attributes in the ACDC up front. This fits many many use cases in real world applications. Except for credentials that are unbounded documents in size. If a credential has more than a few attributes, its probably mis-designed in the first place. Unbounded attributes are an anti-pattern for credentials. Credentials should not be books . If so then we don’t need to differentiate the elements in such a way. Just issue a new credential adaptively as needed. The CESR-path proof provides a way to after the fact refer to a document and issue a credential on some part of a document, on demand as needed.

Such destructuring to “data-lake” a document as a single ACDC is an anti-pattern for ACDC. If the pupose of the ACDC is to provides proof of authorship of the document then the document itself is a blob from the perspective of the credential. One can “data-lake” the blob after the fact without impinging on the mechaniss of the ACDC.

If on the other hand the credential is a proof of authorithity, then it needs to be focused and small and tight and so shoulld only have a few fields that can be granularized ahead of time.

Wanting to do both at the same time is the anti-pattern. One should pick. Is it proof of authorship or proof of authoriity. If the former then the whole document does not need to be differentiated in structure. If the later then there should only ever be a few a priori well designed welll understood attributes.

We don’t need to pre-structure all documents. Just provide tooling to allow on-demand issuance of credentials as situations change in the rare exceptional conditions where some new commitment needs to be made. Chaining helps here. Chaining allows reuse of credentials. The holder of a credential can issue a new chained credential on the fly by re-chaining or treeing credentials it already has.

It is non-sensical to believe that the environment and context in which credentials are used is such that we must design for every possible contingency. Ecosystems evolve very slowly relative to the cost of issuing new credentials as needed. If we automate the hard parts of authenticity, which KERI does, then the cost of issuing a new credential goes to zero. Which means credentials can become much more bespoke. The hard part of issuing credentials is the authentication up front to establish the AID. Once that is established, a simple API can be used to issue a new credentials from what ever information is at the issuers disposal. It really depends on the ecosystem. Some ecosystems are very very slowly evolving. Some much faster. Careful design is to not over design for all ecosystems by using the most expensive approach when only some eco-systems demand it. Too much flexibility and extensibility comes at a high cost in all the tooling and maintenance etc etc.

SmithSamuelM commented 2 years ago

All that said. This discussion is off base. Selective disclosure mechanisms in ACDC are meant to solve two corner cases. They are not the primary case. The primary us case is solved with a private attribute ACDC not a selectively disclosable attribute ACDC. The combination of chain-link-confidentiality on a graph of chained private attribute ACDCs (i.e partial disclosure) is sufficient to protect privacy. Selective disclosure (unbundling) is not needed when the use case is designed to exploit a graph of private attribute ACDCs.

The corner cases for selective disclosure ACDCs are as follows:

1) Adoption vector for legacy credentials that follow the legacy paper credential anti-pattern of mixing date needed to establish authority with data not needed to establish authority. Or mixing data needed for enforcement (forensics) with data needed to establish authority.

2) Multi-context use cases where mix and match of selectively disclosable attributes from a single ACDC needed to prove authority is a better fit than a graph of private attribute ACDCs.

The goal is for ACDCs to provide a complete solution for adoptability reason. Rarely is mix and match selective disclosure a good idea other than for legacy systems.

The use case that is being confused is the proof of authorship. If a proof-of-authority needs to be translated and the verifier has a trusted translator, then two credentials are needed not one. The first credential is proof of authority (authorization) issued by the entity delegating that authority (such as a license or passport). The second credential is a proof of authorship by the trusted translator of the translation of the first credentail. The proof of authorship need only reference the first credential and treat the translation as a blob from the standpoint of proof of authorship. It does not need to recreate the proof of authority. So the semantics of the two are different and do not need to be mixed in the same semantic construct.

Although a property graph allows one to mix semantic constructs because the edges have properties that can isolate sub-graphs and establish hierarchy of different types of sub-graphs in one super graph. So a proofs-of-authority can be linked to proofs-of-authorship. Attempting to collapse or flatten the sub-graphs into one graph with undifferentiated edge/node types is entirely problematic.

SmithSamuelM commented 2 years ago

IMHO there is a fundamental disconnect in understanding between the fields of automated reasoning and what I call data-laking for lack of a similarly precise term. Both depend on abstractions for knowledge representation. But in the former case the decision-making drives the knowledge representation whereas in the latter the ability to mix data in a lake with some type of universal semantic representation drives the knowledge representation. But the latter makes decision making more difficult not easier. So if the purpose of data laking is to support automated decision. making it seems to be largely counter productive. Automated decision making needs highly contextualized provenanced traceable metrified information, the opposite of universal decontextualized semantics. And real world decision making is ALWAYS in an environment of extreme uncertainty so any semantic construct that does not include uncertainty as a first order property is DOA for real world decision making except in very narrow contexts. Which is the antithesis of a data-lake. We want a warehouse of data buckets not a data-lake.

SmithSamuelM commented 2 years ago

What we need are systems that are hybrids. That combine uncertainty with symbolics. Pure gradient descent deep learning is not abstractable, Pure abstractions are erroneous, un-ground truthable, un-tunable. This paper provides a history of the disconnect and contention in the machine learning world. A similar related disconnect has existed in the knowledge representation world for almost the same amount of time. https://nautil.us/deep-learning-is-hitting-a-wall-14467/

it didn’t help that the semantic web ignored uncertainty entirely. This spawned a generation of technologists that felt enabled to practice without ever feeling any obligation to understand even basic concepts of reasoning under uncertainty.

SmithSamuelM commented 2 years ago

If you look at what OCA is trying to accomplish with overlays on a capture base, it has merit for providing multiple downstream processing alternatives of the capture base. But with the exception of "sensitive data" there is no concept of partial, full, or selective disclosure for privacy control at the time of disclosure. The idea of sensitive data is post disclosure after its too late. One could have an overlay that extracts the rules section and applies it to an overlay for chain-link confidentiality to subsequent users of the data but there is no conception of a Ricardian contract in OCA. There is no overlay that manages uncertainty. Or graph based processing between different ACDCs. These are all missing features that overlays of a given capture base do not address. I am not trying to discredit the good work of overlays. But OCA overlay goals are not the same as what ACDCs need to do upfront.

Finally, as I have said multiple times. It would be trivial to create a one-to-one mapping between the full disclosure variant of an attribute section of an ACDC expressed in JSON together with its JSON Schema to an OCA capture base. As you know an OCA capture base is simply a map of base attribute labels and types but that uses a newly invented syntax for expressing those attribute labels and types. So IMHO it makes perfect sense for the proponents of OCA, as an adoption vector, to leverage the existing tooling for JSON Schema by providing an adapter that maps one-to-one the ACDC attribute section expressed a JSON schema to an OCA capture base syntax. Then there is fundamental alignment between the existing ACDC spec and future use of ACDC attribute sections but mapped to their OCA overlay capture base equivalents. This can happen after the fact by any verifier once disclosure has happened. It doesn't capture the relationships between ACDCs as nodes that the edges provide but it does enable overlays of any given node.

But OCA does not provide compact transmission via partial disclosure, not protection via chain-link confidentiality with partial disclosure support, nor selective disclosure support, nor support of distributed property graphs needed for chaining of credentials. OCA 's principal value is as a semantic overlay after the fact of verifiably authentic full disclosure of the attribute section not as a mechanism for such disclosure. So its fundamental purpose is clearly not the same as ACDC its only a slice. It's just a misapplication. It fits best at a layer above ACDC. Not integral to ACDC. It feels like shoehorning something that solves one problem well downstream data-lake enabling semantic overlays of a given set of attributes to something it doesn't solve at all well granular chained (treed) proofs of authorship and/or authority in a correlation protecting manner of a given set of attributes. The latter is a precursor to the former.

SmithSamuelM commented 2 years ago

I believe that it is entirely possible to combine the concept of semantic overlays with property graph based reasoning. But that is a future topic. And will take some time to develop. In the meantime we have to as the saying goes, either fish or cut bait. We need to fish not cut bait.

SmithSamuelM commented 2 years ago

To be very specific. The capture base of an OCA is itself an overlay on the uncompacted variant of an ACDC attribute section. The OCA capture base as overlay needs a mapping function that maps its syntax to the syntax of the decomposed JSON schema that specifies the uncompacted variant of the attribute section.

This also works for selectively disclosable attribute sections. Because each selective disclosure is itself a decomposed variant. So for each decomposed variant there is a one-to-one mapping to an OCA capture base for that variant.

This resolves all the issues of interoperability. Downstream consumers of ACDC attributes can use an OCA capture base overlay to enable other OCA overlays.

blelump commented 2 years ago

@SmithSamuelM

first of all let me apologize as I was thinking a is about selective disclosure, where as it is about A. Let me also continue on this topic and share some more thoughts.Perhaps the misunderstanding of oneOf appliance is the consequence of having a different way of thinking.

The ACDC a attribute along with oneOf appliance is considered as shown here in the snippet. This is a mix of compacted and uncompacted schema that is basically about the same schema. What would be the use case for such an approach?
In the Selective Disclosure section there is a sentence which is probably the answer to this whole issue. The sentence is The ACDC chaining mechanism reduces the need for selective disclosure in some applications. So to make sure this is precisely the case, let me provide an example with Identity Card or simply an ID. According to my understanding, ACDC of an ID would look like as for example compact ACDC, where the e attribute is interesting in particular. Assume in e, there is an n that creates a relationship to a birth certificate ACDC. The birth cert ACDC contains for example name and surname attributes. So the ID ACDC is composed of, among other things, birth certificate schema. Now, when holder is being asked about the name and surname, she presents the birth cert ACDC. When for the full ID, she presents the ID ACDC. No selective disclosure needed.

Aside question:

why e does not allow for a list of nodes and instead imposes a 1:1 relationship? In the above ID example, what if the ID ACDC would have edged to more than one other ACDC's?

SmithSamuelM commented 2 years ago

@blelump

The ACDC a attribute along with oneOf appliance is considered as shown here in the snippet. This is a mix of compacted and uncompacted schema that is basically about the same schema. What would be the use case for such an approach?

ACDCs support something that in the latest version of the spec, I called “graduated disclosure”. They also support something called “contractually protected disclosure”. The latter “contractually protected disclosure” requires a type of graduated disclosure called partial disclosure. These are definitions specific to the ACDC nomenclature, for lack of better terminology. Partial disclosure is different from selective disclosure. The purpose of partial disclosure is to enable full disclosure AFTER contractual protection is put in place. There are two parties in contractually protected disclosure. The first party is the Discloser (i.e. the one making the disclosure). The second party is the Disclosee (i.e. the recipient of the disclosure). The Discloser wants to minimize the amount of information disclosed to the Disclosee until after the Disclosee has agreed (by signing) to the terms of the disclosure. The Disclosee needs proof that what the Discloser is about to disclose was actually issued by the Issuer (the Issuer may not be the same as the Discloser in a presentations). The oneOf composition in the schema committed too by the Issuer (not the Discloser) enables the Discloser to proof the structure of both a partial and full disclosure as committed to by the Issuer both before (partial disclosure) and after full disclosure. It is bundling a commitment to different forms of the schema that the verifier (recipient) can unbundle without requiring a separate commitment for each composition of schema. So composition with oneOf is an essential property that enables contractually protected disclosure. In a very real sense, composition of schema for selective application is analogous to the selective disclosure of attributes but it is the selective application of schema. Via composition the Issuer can bundle and make cryptographic commitments (signing) to composed schema variants, that may be securely unbundled (selectively applied) later by the presenter or verifier.

One type of contractually protected disclosure is “chain-link confidentiality”. This provides comprehensive privacy protection after agreement. But to achieve such comprehensive protection the potential Discloser must graduate the disclosure of information, first metadata then details. Because schema is metadata and we require schema validation both as part of partial and full disclosure we need composition of both schema, partial and full at the time of issuance not merely at the time of presentation. At the time of presentation is too late, the Issuer is no longer involved.

Furthermore, because chain-link confidentiality is not just applied to the first recipient (Disclosee) of a disclosure but to all subsequent recipients (Disclosees) in a successive chain-of-disclosures. So the schema composition originally committed to by the Issuer (head of the chain) is applied to each subsequent chain-link-confidentiality exchange (offer, agree, allow) where the compact (partial schema variant of the oneOf) is used in the offer, and the full schema variant of the oneOf is used in the allow .

SmithSamuelM commented 2 years ago

@blelump

why e does not allow for a list of nodes and instead imposes a 1:1 relationship? In the above ID example, what if the ID ACDC would have edged to more than one other ACDC's?

Short answer, It does. I just didn’t provide any examples of such in the spec so far. I have been working on the syntax for how to combine multiple edges in the same e section besides a simple logical AND. I will update the spec before the Tuesday ACDC meeting with multiple edge examples and syntax.

SmithSamuelM commented 2 years ago

@blelump

Your example of using chaining to reduce the need for selective disclosure is accurate. Unfortunately there are Issuers of legacy paper credentials that do not appreciate the advantages of graphed (chained) credentials and therefore ACDCs provide a simple selective dislcosure mechanism to support those use cases of converting existing paper credentials into verifiable ones.

More important is support for contractually protected disclosure which requires graduated disclosure and composed schema.

Also compact ACDCs also rely on graduated disclosure and composed schema.

SmithSamuelM commented 2 years ago

One reason for composed schema is to reduce the complexity of presentation exchanges and schema management. A single composed schema has one SAID. Its like a master schema. Both compact (partial) and uncompacted (full) diclosure variants are included in the master using the oneOf composition operator. Any message in an exchange can reference the SAID of the master schema, retrieve the associated schema from a schema database, apply the master schema to any variant of the actual data presented and it will pass validation IF and ONLY IF the data presented is compatible with an allowed variant in the master composed schema. The presentation can not introduce any additional complexity. The tooling for each step of an exchange becomes simplified because the schema committed to by the Issuer is the master composed schema.

Suppose as an alternative than the master schema was not a composition using oneOf but only the uncompacted (full) variant. Obviously the presenter could construct its own compact variant on-the-fly from the master uncompacted schema, that would be semantically compatible with the master. But now the validator has to recognize that the presenter correctly contstructed an on-the-fly compatible variant. The validator has to verify that variant to ensure that it was indeed a “compatible” variant. It can’t trust the presenter. This extra semantic processing and verification can be complex especially because the presenter may have a virtually unlimited set of possible “compatible” on-the-fly variants it may choose to construct. These provide an attack vector where a malicious presenter may construct a variant that the validator does not correctly evaluate as compatible or incompatible as it were. This would be a type of transaction malleability attack.

Whereas with the Issuer composed schema approach proposed by the ACDC specification, the presenter is not allowed to introduce on-the-fly variants that require extra logic on the behalf of the validator to evaluate as compatible. The only party doing decomposition is the validator not the presenter. The validator knows ahead of time what the Issuer master composed schema looks like and also knows what decomposed variant it wants to test for. It doesn’t have to semantically be able to verify any possible decomposition, only the ones it the validatory cares about. So it can construct its own tooling to securely do that decomposition from a known fixed composed master created by the issuer. It doesn’t have to do anything special to account for on-the-fly variants created by the presenter because the presenter is not allowed to do on-the-fly constuctions or decomposition. So the presenter can not do a malleability attack on the validator and importantly the validator does not have to protect itself from such an attack. Simply validating the master composed schema against any presentation by the. presenter will either validate against one of the Issuer allowed variants or it won’t. Not special logic required.

Indeed in many cases the Validator can simplify its own validation. It may not need to decompose the schema at all. It just needs to check for the presence of the desired attributes in the presented data after it validates the presented data against the master composed schema. If the desired attributes are present then they MUST have complied with a known variant that specified them. If they are not present then they MUST have complied with a known variant that did not include them. But in either case the presenter can only provide data that is compliant with a known variant committed to by the Issuer. In the latter case where the expected data is not presented but the schema validates against the presented data (i.e. is using a known compact variant) then the validator KNOWS that the source of the problem is specifically that the presenter did not make the promised (offered) disclosure and the validator can then re-request it as an incomplete presentation. This simplifies the presentation logic because the response can be simplified to an error code such as incomplete disclosure. Or can include the field labels of the expected attributes that were not disclosed. But the structure the possibilities of such error responses is known a priori by the validator because any and all presentations must be compatible with one of the variants in the composed master schema othewise they would not pass the schema validaton.

In the vast majority of partial disclosure cases, for well designed attributes and schema there should only be two variants at any nesting level of an attribute section field map. The compact variant with only the SAID and the uncompated version with all the full field map. If any field values are themselves field maps then each nesting level may include its own oneOf composition of the two variants, compated and uncompacted. Then any presentation using any combination of compacted and uncompacted field maps at any level of nesting will be an allowed variant.

In the event that some custom or bespoke variant of the Issued data truly needs to be presented, but is not possible as an allowed variant of original master composed schema. then the correct approach is for the presenter to create its own Disclosure specific bespoke ACDC. The presenter thereby become the Issuer of the ACDC with its own associated master composed schema that may include or reference attributes via an edge pointing to the original ACDC. The presenter is effectively providing the equivalent of a custom decomposed schema, but doing so in a way that does not require any different presentation validation logic complexity or tooling on the part of the validator.

To clarify, in this later approach the tooling for validation of a presentation is just the normal tooliing. The validator does not need to account for on-the-fly decompositions by the presenter of some other Issuer’s schema. Instead the bepoke ACDC created by the presenter as Issuer has allowed schema variants that are validated in the normal way but satisfy the presenter’s need to use a customized presentation. This makes schema presentations extensible without needing to build support for complex presentation logic. Every presentation is validated the same way. Its just validatiing a chained ACDC using only Issuer allowed variants of composed schema for each ACDC in the chain.

Recall that the purpose of ACDCs is to convey proof of authorship (authenticity) and sometimes additionally proof-of-authority (authorization) via a provenaced tree of chained ACDCs of the data payload(s) so contained. Additional business logic may need to be applied to that data payload after validaton of authenticity and authority. But that additional business logic SHOULD NOT be part of the presentation logic. This breaks the layering that ACDCs are designed to provide. ACDCs act at the presentation layer of a 7 layer OSI model. Not the application layer.