trustoverip / tswg-acdc-specification-archived

Authentic Chained Data Containers (ACDC)
Other
3 stars 4 forks source link

Overlays approach for semantic #9

Closed mitfik closed 1 year ago

mitfik commented 2 years ago

Following discussion from #6 I would like to elaborate on a topic of overlays capture architecture and what consequences it has on overall on data architecture.

As mentioned in #6 we need to address tree problems:

What we would like to have is:

To address above issues we need:

OCA was designed exactly for all above due to the lack of the solutions already existing on the market. To clarify OCA is an Architecture, it tells you how to compose objects (and its characteristics) to achieve desire outcome. It could potentially be used with any existing language or serialization formats.

In principle OCA does not conflict with JSON schema by any mean we can easily represent OCA objects using json schema semantic. The problem is much more on the conceptual level - architectural. JSON schema was not design to deal with all above problems. Enforcing ACDC users use only JSON Schema it could steer authentic data in wrong direction and cause same problems as we experience on the Web 2.0.

For reference an example of capture base using OCA json serialization:

{
    "type": "spec/capture_base/1.0",
    "classification": "GICS:45102010",
    "attributes": {
        "dateOfBirth": "Date",
        "dateOfExpiry": "Date",
        "dateOfIssue": "Date",
        "documentCode": "Text",
        "documentNumber": "Text",
        "documentType": "Text",
        "fullName": "Text",
        "issuedBy": "Text",
        "issuingState": "Text",
        "issuingStateCode": "Text",
        "nationality": "Text",
        "optionalData": "Text",
        "optionalDocumentData": "Text",
        "optionalPersonalData": "Text",
        "personalNumber": "Text",
        "photoImage": "Binary",
        "placeOfBirth": "Text",
        "primaryIdentifier": "Text",
        "secondaryIdentifier": "Text",
        "sex": "Text",
        "signatureImage": "Binary"
    },
    "flagged_attribute": [
        "documentNumber",
        "fullName",
        "primaryIdentifier",
        "secondaryIdentifier",
        "dateOfBirth",
        "personalNumber",
        "placeOfBirth",
        "optionalPersonalData",
        "optionalDocumentData",
        "signatureImage",
        "photoImage",
        "optionalData"
    ]
}

For references how other layers are linked take a look here

Here is the same capture based represented using JSON Schema:

{
    "$id": "EPMaG1h2hVxKCZ5_3KoNNwgAyd4Eq8zrxK3xgaaRsz2M",
    "description": "Capture base - JSON Schema for Passport",
    "title": "Passport",
    "$schema": "http://json-schema.org/draft-04/schema#",
    "classification": "GICS:45102010",
    "type": "object",
    "properties": {
      "dateOfBirth": {
        "type": "string",
        "flagged_attribute": true,
        "format": "date"
      },
      "dateOfExpiry": {
        "type": "string",
        "format": "date"
      },
      "dateOfIssue": {
        "type": "string",
        "format": "date"
      },
      "documentCode": {
        "type": "string"
      },
      "documentNumber": {
        "type": "string",
        "flagged_attribute": true
      },
      "documentType": {
        "type": "string"
      },
      "fullName": {
        "type": "string",
        "flagged_attribute": true
      },
      "issuedBy": {
        "type": "string"
      },
      "issuingState": {
        "type": "string"
      },
      "issuingStateCode": {
        "type": "string"
      },
      "nationality": {
        "type": "string"
      },
      "optionalData": {
        "type": "string",
        "flagged_attribute": true
      },
      "optionalDocumentData": {
        "type": "string",
        "flagged_attribute": true
      },
      "optionalPersonalData": {
        "type": "string",
        "flagged_attribute": true
      },
      "personalNumber": {
        "type": "string",
        "flagged_attribute": true
      },
      "photoImage": {
        "type": "string",
        "flagged_attribute": true,
        "format": "binary"
      },
      "placeOfBirth": {
        "type": "string",
        "flagged_attribute": true
      },
      "primaryIdentifier": {
        "type": "string",
        "flagged_attribute": true
      },
      "secondaryIdentifier": {
        "type": "string",
        "flagged_attribute": true
      },
      "sex": {
        "type": "string"
      },
      "signatureImage": {
        "type": "string",
        "flagged_attribute": true,
        "format": "binary"
      }
    }
}

Now since JSON schema does not have build in mechanism for linking objects (e.g. linking translation in separate object, or formatting as separate object) people would define that all inline in one schema. This would cause problems to address issues mentioned at the beginning of that post.

To summarize it:

Feature OCA JSON Schema
minimalistic capture base Yes Yes, but not enforced
separate layers Yes No
content-addressable linking mechanism Yes Partially - only for $ref, since does not support layers
bundle identifier Yes No

Semantic problem is more complex then what was mentioned here. But based on Human Colossus research we found that certain characteristic is required to even try to do it right. This is why I would suggest to expand on that matter in ACDC specification in a way that we don't limit ACDC to be able to use only JSON schema but define characteristic of semantic which is allowed to be used to have authentic data containers which can address mentioned issues. We have that agility in all other aspects of ACDC, like crypto for SAID and SCID why not allow to have that on semantic as well? I get the argument that too much flexibility generate adoption issues but without it we close the doors for huge communities which needs to address above issues and JSON Schema does not help with it.

To make that clear I am not looking for OCA adoption vectors - OCA will be fine. I am more worry about that ACDC would be narrow use case for verifiable credential and won't be adopted outside that space. Since we already know that JSON Schema does not solve the problems of the data harmonization overall. The "data" world is way bigger then "verifiable credential", so I am more after ACDC adoption vectors then anything else.

happy to hear your thoughts on that

SmithSamuelM commented 2 years ago

It would be helpful to combine this issue with the other one already opened as there is significant overlap with https://github.com/trustoverip/tswg-acdc-specification/issues/8

So I will reproduce here the answers from above that specifically address your question here.

@SmithSamuelM

If you look at what OCA is trying to accomplish with overlays on a capture base, it has merit for providing multiple downstream processing alternatives of the capture base. But with the exception of "sensitive data" there is no concept of partial, full, or selective disclosure for privacy control at the time of disclosure. The idea of sensitive data is post disclosure after its too late. One could have an overlay that extracts the rules section and applies it to an overlay for chain-link confidentiality to subsequent users of the data but there is no conception of a Ricardian contract in OCA. There is no overlay that manages uncertainty. Or graph based processing between different ACDCs. These are all missing features that overlays of a given capture base do not address. I am not trying to discredit the good work of overlays. But OCA overlay goals are not the same as ACDC goals..

@mitfik

In principle OCA does not conflict with JSON schema by any mean we can easily represent OCA objects using json schema semantic.

You have it backwards. The point is that you can represent a fully disclosed ACDC attribute section as an OCA capture base overlay of the JSON Schema of the fully disclosed ACDC attribute section.

Finally, as I have said multiple times. It would be trivial to create a one-to-one mapping between the full disclosure variant of an attribute section of an ACDC expressed in JSON together with its JSON Schema to an OCA capture base. As you know an OCA capture base is simply a map of base attribute labels and types but that uses a newly invented syntax for expressing those attribute labels and types. So IMHO it makes perfect sense for the proponents of OCA, as an adoption vector, to leverage the existing tooling for JSON Schema by providing an adapter that maps one-to-one the ACDC attribute section expressed a JSON schema to an OCA capture base syntax. Then there is fundamental alignment between the existing ACDC spec and future use of ACDC attribute sections but mapped to their OCA overlay capture base equivalents. This can happen after the fact by any verifier of an ADCD once disclosure has happened. It doesn't capture the relationships between ACDCs as nodes that the edges provide but it does enable overlays of the attributes within a given node.

This enables all the architecture elements of OCA goodness for downstream processing of that node. This is clean compatibility in a layered sense.

But OCA does not provide compact transmission via partial disclosure, not protection via chain-link confidentiality with partial disclosure support, nor selective disclosure support, nor support of distributed property graphs needed for chaining of credentials. OCA 's principal value is as a semantic overlay after the fact of verifiably authentic full disclosure of the attribute section not as a mechanism for such disclosure. So its fundamental purpose is clearly not the same as ACDC its only a slice. It's just a misapplication. It fits best at a layer above ACDC. Not integral to ACDC. It feels like shoehorning something that solves one problem well downstream data-lake enabling semantic overlays of a given set of attributes to something it doesn't solve at all well granular chained (treed) proofs of authorship and/or authority in a correlation protecting manner of a given set of attributes. The latter is a precursor to the former.

To be very specific. The capture base of an OCA is itself an overlay on the uncompacted variant of an ACDC attribute section. The OCA capture base as overlay needs a mapping function that maps its syntax to the syntax of the decomposed JSON schema that specifies the uncompacted variant of the attribute section.

This also works for selectively disclosable attribute sections. Because each selective disclosure is itself a decomposed variant. So for each decomposed variant there is a one-to-one mapping to an OCA capture base for that variant.

This resolves all the issues of interoperability. Downstream consumers of ACDC attributes can use an OCA capture base overlay to enable other OCA overlays.

So all the goodness of OCA architecture works as an overlay. Starting with using the OCA capture base as an overlay itself of an ACDC attribute section.

More generally any ACDC could include as an attribute the SAID of an external OCA capture base. The ACDC itself is just making an authenticatable, authorizable commitment to that capture base via a commitment to its SAID. The actual exploded capture base can then be attached to the ACDC or cached. The ACDC provides proof of authorship and/or proof of authority to the referenced (via SAID) capture base detail.

Thus there is no incompatibility here. This provides clean layering. Clean separation of concerns. It allows ACDCs to do what they do best and allows an OCA to do what it does best. This is how the architecture should work.

In this more general approach, the only thing that the JSON Schema of the ACDC cares about is that there is a field whose value is the SAID of an OCA capture base. The ACDC itself is a support for the OCA, literally a container that provides a authenticated disclosed comittment to an OCA capture base in one field. The container itself is opaque to the OCA capture base and any downstream processing. Likewise, the OCA capture base is opaque to the container as it should be. That is what containers are meant to be. Their payloads are opaque. So no conflict between schema compatibility. There is no need to schematize the structure of the ACDC attribute section within the OCA other than to map the one field that includes the SAID of the OCA. This is trivial to do. There is also no need to EVER schematicise the OCA capture base inside the ACDC attribute section. The OCA structure is completely opaque to the ACDC.

This thin layering fits the hourglass model. I have suggested this layering multiple times in these conversations, and I have yet to see a specific response that indicates any reason why this does not work.

Pointing out features of OCA that JSON Schema does not support for OCA overlays does not explain why using ACDC as a layer below OCA is a problem. The CONTAINER in Authentic Chained Data Containers is there for a reason. Your proposal to De-containerize an ACDC is going backwards. Its de-layering. It feels like going in circles to me.

SmithSamuelM commented 2 years ago

I am going to take one more pass at this in hopes to flip the switch in your mind.

In a 7 layer OSI stack, authentication usually happens at the presentation layer. The application payload is not processed until the application layer and is completely opaque to the presentation header that provides all the information needed for authentication. ACDCs provide an analogous function to the presentation layer. The application layer is opaque to them. OCA works at the application layer not the presentation layer. The mistake is to confuse the separation of these two layers. I am not surprised at the confusion because the W3C VC data model is confused about the separation between those two layers and that has confused the community. The reason I started ACDC was largely to restore the separation between authentication (presentation) layer that is an ACDC and the application layers that consume the payload of an ACDC.

An ACDC does not need to have a payload. It could serve the analogous role of an access token. Its pure authentication, no payload, no application layer. It is just the presentation header.

Or the ACDC could have a payload which means it is a presentation header wrapper to a payload. The payload layer itself may consist of multiple layers. This is the case with OCA, but the application layer(s) MUST not be confused with the presentation layer or any other layers below the application layer.

When the ACDC is acting as a proof-of-authorship, the proof is about the authorship of some data. The data payload could be nothing more than a SAID of the payload. This is analogous to a token that includes a hash of some other data that the token is authorizing.

When the ACDC is acting as a proof-of-authority, then there may not be a payload either, the attributes in the ACDC itself are just syntactic sugar that characterizes the type of authority that is being proven. Or the attributes could also include a reference to a payload so that the ACDC is providing both proof-of-authority and/or proof-of-authorship about its payload.

This last use case is where the confusion sets in. A proof-of-authorship just needs a signature (and information needed to look up the key state to verify the signature i.e. KERI) Whereas a proof-of-authority requires some extra information besides merely a signature (and the information needed to look up the key state to verify the signature i.e. KERI). It needs information to characterize the "authority" or "authorization". In addition to the authority characterization information, the ACDC may or may not have OTHER information that is purely payload.

This authority characterizing information is often confused with the OTHER information in the payload. It's not. It's part of the authentication (presentation) layer.

The problem with paper credentials is that they combine multiple types of information. For example, a driver's license mixes proof-of-authority i.e. an authenticatable authorization to drive with forensic information needed to enforce the proof-of-authority. Downstream processing is almost always on the forensic enforcement information, not on the authorization to drive. The former is a payload the latter is an authorization. The VC data model was influenced too strongly by the paper credential data model and ends up making the same mistake of mixing proofs of authorship/authority with payloads to which such a proof applies. (application of proof == application layer, proof itself == presentation (authentication) layer)

The word credential in English means proof-of-entitlement permission right etc. Or as I have used here proof-of-authority to do something. The "to do something" is the characterization of that proof. But historically such proofs come with Other data not part of the proof of something. So for example in the supply chain example. A proof-of-authority is used to authorize a trading partner to sell a given product. The proof-of-authority is not the product. The product details, its consituent parts may be attached to the proof-of-authority as a manifest i.e. the data payload is the manifest. So the OCA applies to the manifest not the proof-of-authority. If I want to selectively disclose the manifest then the selective disclosure mechanism sits outside the manifest it is not part of the manifest itself! Is this not abundantly clear?

Think about product import documentation. There is a proof-of-tariff that authorizes the import and there is a manifest of what's inside the box being imported. Don't be confused by the fact that they are both pieces of paper. The two pieces of paper act at different layers of processing and have different purposes and different properties for processing. The fact that we have tools to model them both as documents is the problem. We succumb to the false idea that they are the same just because they are both data and can be processed with data processing tools such as schema validators. This is a dangerous confusion and I believe to be the fundamental confusion resulting in this conversation. I am going to great lengths to disabuse you of the idea that merely because a proof-of-authority may be expressed as item of data and may be processed with data processing tools makes it the same as the payload attached to such a proof.

When I use HTTP to send a JSON document in the body of the HTTP document. The HTTP headers , the TCP headers, the IP headers, the MAC layer headers, The ethernet headers, are all included in the data that is sent along with the JSON Body. But it would be entirely foolish from a data modeling and functional layering perspective to assume that we should be schematizing all those headers as part of one universal schema simply because they are wrappers around the JSON body. Just because we can does not mean we should. Universality comes at a cost. Each layer is in a different trade space. The separation between layers allows granular optimization of the layered trade spaces.

Likewise, an ACDC is a header to its payload or it may not have a payload in which case it is just a header. The ACDC's trade space is different than its payload's trade space. Clearly, therefore the ACDC MUST be processed in its own layer, not its payload's layer. And should optimize its tooling for its own layer not its payload's layer.

OCA is designed to shape the payload for different applications of the payload in each of the OCA overlays. Think different overlay == different shape. It is NOT designed to operate on the layer specific header data that is stripped off in all the lower layers that support the payload. Don't confuse the tool (OCA syntax) with the application of the tool.

OCA is a layered data shaping tool not a protocol layering tool. Protocol layering and data shape layering are two different types of layering. The use of the term layer in both does not make them even remotely close to being the same thing and the tooling for each has significantly different properties.

I hope this flips the switch.