ACDC Semantic - Why not to use JSON Schema

trustoverip / tswg-acdc-specification-archived

Authentic Chained Data Containers (ACDC)

Other

3 stars 4 forks source link

ACDC Semantic - Why not to use JSON Schema #6

Closed mitfik closed 1 year ago

mitfik commented 2 years ago

Based on the current discussion over the recent ACDC call I wanted to raise an important topic in my opinion related to the proposed specification around the semantic part.

As we all agreed the ACDC needs to be end-verifiable, security comes first, this is why any reference needs to be done through SAID (or similar content-addressable identifier) to make sure that the cryptographic commitment would be valid.

The current specification suggests solely on JSON Schema, and I would suggest rethinking that since in my opinion this could harm ACDC development and adoption.

Some of the arguments which are in favor of JSON Schema which I was able to capture are:

a vast ecosystem of tools and libraries
adoption vector since JSON Schema is used in many areas
human-readable
simple to use
... others?

But there are some significant issues related to it:

first of all to make it clear JSON Schema is "just an IETF draft" - Expires: June 11, 2021 that is not much of the issue for me just wanted to point that out, if someone would like to use that argument as well as adopted specification.
lack of support for common types JSON Schema support just basic data types which makes it hard to model complex data e.g. medical data. It does support format annotation but this is just annotation, which means it does not have to be implemented and the implementation can vary. We can treat that as a custom feature in most cases. This means that if we want to have a date, we can not choose from dd/mm/yy or mm/dd/yyyy even if we would enforce format in the validation step, the object, which is optional we would be limited to that what the given tool is implementing, without a way to pass that information all the way down with ACDC.
$schema keyword Alt ought not mandatory it is recommended, the attribute allows to define which version of the schema we are referring to. Unfortunately, it is URL. Sounds familiar? JSON-LD? We can't allow using that keyword since we can't assure about the immutability of those objects pointing to the location. So we can skip the completely or enforce to use SAID instead but then most of the validators would fail on that and we need a custom one. Why is important to address that since people would use it since it is out of the box, here is an example from vLEI
```
{
    "$id": "ExBYRwKdVGTWFq1M3IrewjKRhKusW9p9fdsdD0aSTWQI",
    "$schema": "http://json-schema.org/draft-07/schema#",
    "title": "GLEIF vLEI Credential",
   ...        
```
In some cases that do no harm, but changing the meaning of the semantic can lead to change the meaning of the data (e.g. change from boolean to string which could validate as true even if their value would be false). If it is allowed to do it wrong people would do so.
- $id keyword - Almost the same problem as above, according to the spec the $id need to be URI - SAID is not URI per definition, so we would need to create one to stay complaint. As you can guess above vLEI won't validate against standard JSON Schema validator. According to the spec: The value of $id is a URI-reference without a fragment that resolves against the [Retrieval URI](https://json-schema.org/understanding-json-schema/structuring.html#retrieval-uri). Which again leads us to "JSON-LD" hell if we won't be based on content-addressable identifiers. Why needs to be URI which is resolved? because $id embedded in subschema can use its base URI to resolve against.
$ref keyword - optional but can be used as a reference to another schema. This allows us to have some sort of composability, but if we won't use content addressable identifiers we lose security. The URI-references in $ref resolve against the schema’s Base URI
JSON Schema is used in OAS The problem is that even OAS is not using JSON schema as it is, they use extended subset (not even subset but extended) of JSON Schema Specification Wright Draft 00., and modified it quite to fit their needs. This means that even if you would like to leverage the JSON schema ecosystem you would exclude OAS (which probably is way bigger) and if you would like to stay compliant with OAS you need to ditch standard spec. I would even argue that OAS would be better than JSON schema but won't solve the problem either.

JSON Schema was designed with the mind of the web, this is why is heavily influenced but pattern designs for the location-based network.

With ACDC we are going towards a content-based network and this is why we should be careful to pick related specifications to be used since they break a lot of security aspects of our architecture.

In addition, ACDC is agnostic to the serialization format, we can have it in JSON, XML, etc. What if I would like to have XML serialization do I still need to use JSON schema to parse my XML ACDC?

To summarize it, ACDC should not reference JSON Schema as a main semantic description it does not have any properties which can help with what we need:

support for complex data types to be able to describe simple but as well very complex data structure,
immutable objects across whole chain, no URI or URLs, it needs to be mandatory to use content addressable identifiers
capturing meaning of data (rich layered architecture to be able to achieve that)
reusability of schema to increase interoperability of the data

As you know I am one of the co-authors of OCA (Overlays Capture Architecture) where the last 4 years we spent building an alternative to existing semantic solutions. We did quite some research in that space and tried to aggregate all important characteristics of the semantic which is needed to capture the meaning of data. A few important take away from that effort is that:

you can NOT convince everyone to name stuff the same way as you do
you have to go with layered architecture to decouple responsibility to increase interoperability of the objects.

I am not saying that ACDC should force to use OCA (although that would be nice) but define the specification in a way that would focus on the characteristics of such solution and not a specific implementation. I know that this could make the implementation more complex but through the adoption, we would show which one is the best and people would gravitate towards the best anyway.

We could apply same principle as we have with SAID or SCID where first byte would let you know not only which hashing algorithm it is but as well which type of semantic is behind that.

If you would like to learn about such characteristics I recommend to take a look on OCA: https://the-human-colossus-foundation.github.io/oca-spec/

Happy to hear your opinion about above points.

SmithSamuelM commented 2 years ago

Thanks for the comments. I will go through them in detail. tomorrow.

My initial reaction is that JSON Schema are sufficiently expressive that we can create overlays that support OCA but leverage the extensive already universally adopted tooling for JSON Schema. We just have to define a static profile of JSON Schema to ensure security. Most users of JSON Schema use static schema so it does not pose a problem to leverage the tooling this way One of the biggest adoption barriers is tooling and given that apparently OCA will be an open spec but the HSF implementation will be closed (non-FLOSS) then its a strong reason not to use OCA unless it is an overlay on top of open JSON Schema tooling. I strongly object to the development of standards where the primary (reference) implementation is not at least as open as the standard spec itself.

Sent from my iPad

On Mar 24, 2022, at 17:05, Robert Mitwicki @.***> wrote:

Based on the current discussion over the recent ACDC call I wanted to raise an important topic in my opinion related to the proposed specification around the semantic part.

As we all agreed the ACDC needs to be end-verifiable, security comes first, this is why any reference needs to be done through SAID (or similar content-addressable identifier) to make sure that the cryptographic commitment would be valid.

The current specification suggests solely on JSON Schema, and I would suggest rethinking that since in my opinion this could harm ACDC development and adoption.

Some of the arguments which are in favor of JSON Schema which I was able to capture are:

a vast ecosystem of tools and libraries adoption vector since JSON Schema is used in many areas human-readable simple to use ... others? But there are some significant issues related to it:

first of all to make it clear JSON Schema is "just an IETF draft" - Expires: June 11, 2021 that is not much of the issue for me just wanted to point that out, if someone would like to use that argument as well as adopted specification. lack of support for common types JSON Schema support just basic data types which makes it hard to model complex data e.g. medical data. It does support format annotation but this is just annotation, which means it does not have to be implemented and the implementation can vary. We can treat that as a custom feature in most cases. This means that if we want to have a date, we can not choose from dd/mm/yy or mm/dd/yyyy even if we would enforce format in the validation step, the object, which is optional we would be limited to that what the given tool is implementing, without a way to pass that information all the way down with ACDC. $schema keyword Alt ought not mandatory it is recommended, the attribute allows to define which version of the schema we are referring to. Unfortunately, it is URL. Sounds familiar? JSON-LD? We can't allow using that keyword since we can't assure about the immutability of those objects pointing to the location. So we can skip the completely or enforce to use SAID instead but then most of the validators would fail on that and we need a custom one. Why is important to address that since people would use it since it is out of the box, here is an example from vLEI { "$id": "ExBYRwKdVGTWFq1M3IrewjKRhKusW9p9fdsdD0aSTWQI", "$schema": "http://json-schema.org/draft-07/schema#", "title": "GLEIF vLEI Credential", ...
In some cases that do no harm, but changing the meaning of the semantic can lead to change the meaning of the data (e.g. change from boolean to string which could validate as true even if their value would be false). If it is allowed to do it wrong people would do so. $id keyword - Almost the same problem as above, according to the spec the $id need to be URI - SAID is not URI per definition, so we would need to create one to stay complaint. As you can guess above vLEI won't validate against standard JSON Schema validator. According to the spec: The value of $id is a URI-reference without a fragment that resolves against the Retrieval URI. Which again leads us to "JSON-LD" hell if we won't be based on content-addressable identifiers. Why needs to be URI which is resolved? because $id embedded in subschema can use its base URI to resolve against. $ref keyword - optional but can be used as a reference to another schema. This allows us to have some sort of composability, but if we won't use content addressable identifiers we lose security. The URI-references in $ref resolve against the schema’s Base URI JSON Schema is used in OAS The problem is that even OAS is not using JSON schema as it is, they use extended subset (not even subset but extended) of JSON Schema Specification Wright Draft 00., and modified it quite to fit their needs. This means that even if you would like to leverage the JSON schema ecosystem you would exclude OAS (which probably is way bigger) and if you would like to stay compliant with OAS you need to ditch standard spec. I would even argue that OAS would be better than JSON schema but won't solve the problem either. JSON Schema was designed with the mind of the web, this is why is heavily influenced but pattern designs for the location-based network.

With ACDC we are going towards a content-based network and this is why we should be careful to pick related specifications to be used since they break a lot of security aspects of our architecture.

In addition, ACDC is agnostic to the serialization format, we can have it in JSON, XML, etc. What if I would like to have XML serialization do I still need to use JSON schema to parse my XML ACDC?

To summarize it, ACDC should not reference JSON Schema as a main semantic description it does not have any properties which can help with what we need:

support for complex data types to be able to describe simple but as well very complex data structure, immutable objects across whole chain, no URI or URLs, it needs to be mandatory to use content addressable identifiers capturing meaning of data (rich layered architecture to be able to achieve that) reusability of schema to increase interoperability of the data As you know I am one of the co-authors of OCA (Overlays Capture Architecture) where the last 4 years we spent building an alternative to existing semantic solutions. We did quite some research in that space and tried to aggregate all important characteristics of the semantic which is needed to capture the meaning of data. A few important take away from that effort is that:

you can NOT convince everyone to name stuff the same way as you do you have to go with layered architecture to decouple responsibility to increase interoperability of the objects. I am not saying that ACDC should force to use OCA (although that would be nice) but define the specification in a way that would focus on the characteristics of such solution and not a specific implementation. I know that this could make the implementation more complex but through the adoption, we would show which one is the best and people would gravitate towards the best anyway.

We could apply same principle as we have with SAID or SCID where first byte would let you know not only which hashing algorithm it is but as well which type of semantic is behind that.

If you would like to learn about such characteristics I recommend to take a look on OCA: https://the-human-colossus-foundation.github.io/oca-spec/

Happy to hear your opinion about above points.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.

pknowl commented 2 years ago

The key benefit of OCA is that different actors from different institutions, departments, etc., can control specific task-oriented objects within the same OCA bundle. In other words, different actors may have dynamic control over assigned overlays rather than the entire semantic structure. Object interoperability is essential in a Dynamic Data Economy where multiple actors from various institutions participate in complex use cases, supply chains, and data flows, supported by multi-stakeholder data governance administrations and frameworks.

SmithSamuelM commented 2 years ago

@pknowl ACDC is an over-the-wire protocol for exchanging authentic data. Once that data has been exchanged, any number of down stream semantic overlays may be imposed upon the over the wire data. So semantic overlays are good for the downstream purposes. There is no incompatability here. A one-to-one overlay enables over-the-wire compactness and security and semantic interoperability post processing. The key is to use an OCA overlay over JSON Schema that is used over the wire and not to replace JSON Schema for over-the-wire. JSON schema is the over the wire “capture base” and then one defines a one-to-one OCA overlay that becomes the OCA capture base. This is how other protocols solve the problem especially in resource constrained environments. The over-the-wire syntax is constrained. The problem is trying to force downstream semantics upstream onto the over-the-wire constrained syntax. We want the universal tooling already available for JSON Schema. And JSON Schema are sufficiently rich to provide the over-the-wire syntactical field type structure. A given ACDC JSON Schema can be then mapped one-to-one to a given OCA overlay. This provides clean separation of concerns.

SmithSamuelM commented 2 years ago

@pknowl JSON Schema supports regular expressions https://json-schema.org/understanding-json-schema/reference/regular_expressions.html This means practically speaking any arbitrarily complex field type may be expressed as JSON Schema. Therefore there is no reason an OCA capture base equivalent for any OCA capture base can not be expressed as a JSON Schema. Therefore one can build one-to-one mapping overlays between JSON Schema and OCA. The advantage of using JSON Schema is that the tooling is already supported in practically every language on every OS and every environment. The adoption battle for JSON Schema has already been fought and JSON Schema won. ;) Lets not fight that battle again by trying to replace JSON Schema with something that should be an overlay.

SmithSamuelM commented 2 years ago

@mitfik With regards JSON Schema Metadata fields. Note: schema metadata fields are not schema fields. The security implications of nonlocal URIs in metadata are significantly difference than having dynamically generated schema due to nonlocal URI references in the schema itself. There are still some security implications but they are not equivalent and it is in invalid equivocation to make the comparison to JSON-LD @context which directly dynamically generates the schema and JSON schema metadata fields.

The $id field in ACDC JSON Schema is set to the SAID of that schema. It is not a URI reference. This has been true for some time.

The version field in JSON Schema is not normative it is informative. It indicates that a given version is expected but is not used to dynamically switch verson. At least not in the tooling we use. If that were so then it would be a problem. But AFAIK the URI in the version field is treated as a string and is not dereferenced. Nothing is looked up. So it is not dynamic. Its just a fixed version string expressed as a URI. The version of the tooling called by the validator is normative.

Let me elaborate: JSON Schema metadata are not schema and should not pose a that same security risk that JSON-LD @context does. @context is not schema metadata it is used to dynamically derive the actual schema itself.. If there is any possibility of json schema metadata posing a security risk, we will constrain how it is used it in the ACDC profile. Non relative URLs in metadata do not contribute to schema. We already forbid non relative URIs in schema except for namespaced references that include the SAID. So they are locked down.

Its also a stretch to state that leaky semantics of metadata results in leaks in the actual semantics of the static schema. The key security issue for semantic drift of static schema is not the metadata but the fidelity or “correctness” of the tooling wrt to a given version. But like any tooling there is always the problem with ensuring that the tooling is a correct implementation. I suspect this applies to OCA implementations as well. Merely locking down versioning expressed in schema metadata does not solve this tooling correctness problem so its a red herring to point to non-relative URLs in metadata and infer semantic drift in the actual schema. The semantic drift in the tooling may occur independently of the metadata.

To restate, this is unlike schema.org @context which is not metadata at all but is used to derive the schema itself. Its insecurity properties are therefore fundamental. Its not leaky semantics of schematic metadata that is the problem with JSON-LD, its leaky schematic syntax and leaky schematic semantics that is problematic with json-ld.

Similarly, overly expressive schema start to suffer from the inability of “correctness” verification in tooling that supports it. This is the fundamental problem with overly expressive smart contract languages. It becomes pathologically difficult to guarantee functionally correct code even with formal verification methods, DO-178C, etc. Too much expressiveness makes formal verification more difficult not less. So we will want to have a much more limited profile for JSON schema for ACDCs to better ensure we can assert “correctness” of an implementation. The one complexity we allow is composed schema which allow compactness and selective disclosure versions. Each alternate version is still simple. The composition operator oneof adds complexity in a very specific way that we can manage.

SmithSamuelM commented 2 years ago

Let me suggest both a Forest and Trees analogy and a Goldilocks analogy. From an abstract data modeling perspective, an ACDC is best modeled as graph fragment of a directed edge labeled property graph LPG. The edges allow chaining and treeing of ACDCS into graph based semantic structures. The properties in the ACDC attribute section provide the node properties of the fragment. The edges and edge properties of the ACDC edge section provide the edges of the graph fragment. Complexity of data expression in an LPG should be at the graph level not the node level. You want simple nodes and simple edges composed into LPGs. So where does Goldilocks fit in. If you have nodes and edges that are TOO simple, like JSON-LD/RDF graphs then the ability to reason and model at the graph level is too hard. Even simple concepts become cumbersome because nodes are singular properties and edges are not differentiated by properties.. So Goldilocks, JSON-LD/REFD chair is too small, or TOO simple. The other end of the spectrum is to abuse LPGs by making the nodes complex nested documents with arbitrarily complex nested field maps for the properties of the node. Indeed in this case each node is its own subgraph but this subgraph is not part of the semantics of the LPG graph to which the ACDC graph fragment belongs. The chair is too big, TOO complex. The just right fitting chair are simple nodes with enough properties to represent a rich semantic node and enough properties on the edges to represent rich connections between rich nodes but not so complex that reasoning about them as fragments of a LPG is problematic.

We are not leveraging the LPG model if we are making individual ACDCs complex documents in their own right. They need ot be fragments of a graph. And the graph semantics are what we care about. So JSON Schema is more than adequate for JUST RIGHT nodes and Edges.

So the right place in my opionion for a complex semantic overlay is not at the node level but at the graph level. A semantic overlay that provides semantics on LPGs becomes really useful. Anything lower level than that is using a mismatch in the abstract data model.

Now for the Forest and Trees analogy. An LPG graph model is the forest. An individual ACDC is a tree. Each tree has its own sourcing its own root of trust and it's easy to verify and secure and provenance. The forest is where complexity is expressed not in each tree. If we have complex trees then we go from a forest of trees to an aspen grove where the roots of trust are all mixed together. This is difficult to securely verify. This is the node as complex document model,

Another analogy. The soup vs Ala Cart model for presentations of credentials. The soup model of credentials means that Properties from multiple sources and multiple roots of trust are mixed together in a soup and served up as a soup or stew of properties. In contrast the ala cart model means that properties from multiple credentials are chained together in a graph of individual courses that can be individually secured and provenanced and served up as discrete courses in a meal not mixed together in a soup/stew that is a single course meal.

So in the ala cart model, semantic overlays are expressed over graphs not over documents. If an ACDC node is so complex that I need extremely complex schema overlays to understand it then I am truly missing the forest for the trees. And sitting in an uncomfortable chair slurping stew instead of sitting in a comfortable chair enjoying a multi-course meal.

SmithSamuelM commented 2 years ago

The GLEIF vLEI set of ACDCs follow this graph model of ala cart chained provenanced credentials, instead of a soup model or a complex node as document model.

GLEIF -> QVI -> LEI -> ECR -> verifier

Each credentail in that chain is simple. The Graph of that chain is rich in meaning and can be recomposed in numerous ways without creating new credentials that each require a new complex schema for each variant. The important semantics are expressed at the level of a graph of ACDCs not at the level of a single ACDC.

A graph based semantic overlay is in abstracted graph composition elements which are decomposable into already verified, cached, provenanced fragments. So the overlay is expressed at a higher level of abstraction. The details are hidden below the graph abstraction layer. Decisions are made at this higher level of abstraction thus reducing apparent complexity to the decision maker and simplifying semantic overlay constraints on the lower layers.

((GLEIF -> QVI -> LEI) + (GLEIF -> QVI -> LEI -> OOR)) -> verifier

((GLEIF -> QVI -> LEI) + (GLEIF -> QVI -> LEI -> ECR)) -> verifier

Composition of graph fragments provides reusability at scale. Composition of already verified graph fragments provides verifiability at scale

We can leverage the burgeoning off-the-shelf market support for LPG Decision Making and semantic composition. This includes LPG databases with Graph Query languages that naturally express semantic overlays at the graph level. Not to mention support for machine learning using weighted directed edge LPG.

We can do complex graph composition to generate overlays using graph languages that already exist and have broad commercial support. We don't have to reinvent them. We just need an over-the-wire protocol that secures them. That is where they lack.

ACDCs provide securely attributed fragments of distributed LPGs.

All we need is to map those securely attributed (authentic) fragments to graph databases and we can leverage the full suite of semantic overlay goodness already available in the marketplace.

SmithSamuelM commented 2 years ago

I suggest if the ACDC working group is serious about semantic overlays it should figure out why the emerging GQL standard is inadequate as a semantic overlay to ACDCs using JSON Schema or any of the dozen other already available graph languages as overlays to ACDCs using JSON Schema. Until then let's stay focused on securing LPG fragments (AKA ACDCs) which will provide a zero trust architecture to feed into These Graph language overlays. https://en.wikipedia.org/wiki/Graph_Query_Language. And if someone wants to feed them into an OCA overlay so be it. But not at the cost of precluding or delaying or replacing the already broad support for both JSON Schema tooling and LPG tooling. Tooling is the most expensive barrier to adoption. KERI is already a big lift by itself. If it were not for the fact that there is no portable (not ledger locked), securely attributable, decentralized identifier protocol alternative to KERI it would be hard to justify KERI.

SmithSamuelM commented 2 years ago

@mitfik

$ref keyword - optional but can be used as a reference to another schema. This allows us to have some sort of composability, but if we won't use content addressable identifiers we lose security. The URI-references in $ref resolve against the schema’s Base URI

Non-relative URI references are forbidden in ACDC schema. One may use namespaced SAIDs such as a DIDURL that includes a SAID. But these are verifiable against the enclosed SAID so are not dynamic resources but are distributed static references.

JSON Schema support just basic data types which makes it hard to model complex data e.g. medical data.

See above, JSON Schema support regular expressions which support any practically useful complex data type.

$schema keyword Alt ought not mandatory it is recommended, the attribute allows to define which version of the schema we are referring to. Unfortunately, it is URL.

See above URI references in schema metadata are not equivalent to @context JSON-LD URI references to dynamic schema. As mentioned above $schema is not normative and is not dereferenced in tooling (at least the tooling we are familiar with) but is used as a static string.

JSON Schema is used in OAS The problem is that even OAS is not using JSON schema as it is, they use extended subset (not even subset but extended) of JSON Schema Specification Wright Draft 00., and modified it quite to fit their needs.

The most important thing we are leveraging from the JSON Schema ecosystem is its tooling. Tooling provides the syntax of schema expression. We do not need to leverage schema libraries. So when we say we use JSON Schema 2020-12 we mean the syntactical elements that the tooling supports. These syntactical elements enable us to express static JSON schema. Those static schema are universally parseable and validatable by any/all implementations of that tooling version. So we get instant universal adaptability via the tooling. This is leverage that counts the most.

Indeed leveraging schema libraries may be entirely problematic because schema libraries often are contextualized. Each ACDC ecosystem will most likely need its own ecosystem contextualization that requires its own library of schema. These schema must be static and statically referenced with SAIDs to ensure they are secure. As I have explained in other venues, the idea of universal semantics that span contexts is entirely problematic due to polysemy and uncertainty limitations on semantic chaining inference distance. We want narrow contexts with narrowly defined schema libraries.

In addition, ACDC is agnostic to the serialization format, we can have it in JSON, XML, etc. What if I would like to have XML serialization do I still need to use JSON schema to parse my XML ACDC?

ACDCs only support the following four serialization JSON, CBOR, MessagePack, CESR. And for CBOR only the subset that is expressible in JSON is supportable. ACDC does not support XML and likely never will because of XMLs dynamic type dictionary. Once you remove the dynamic type features of XML, then there is little reason to use it over JSON. The JSON XML adoption battle has already been fought and JSON won. So other than legacy vestigial support , XML is a dead-end technology as an adoption vector.

Thus JSON Schema is sufficient to support schema for all the four formats because all the four formats are mappable to dictionaries i.e. field maps (label, value) pairs or nested field maps (or arrays of such). All of which are supportable by JSON schema. (see above the problems of correctness of TOO rich schema.

Because schema do not include any cryptographic primitives, schema do not benefit from the compactness and composability properties of CESR. Which is why KERI and ACDCs use CESR for primitives. Verbosity of schema overlay mapping is not a problem so there is no benefit to embed support for multiple schema types. We just pick one and standardize on it. JSON Schema tooling adoption and support makes it the winner. We only need to use a subset that fits the ACDC profile requirements. So any extra fluff in JSON Schema we can safely ignore. But the most important thing is we don't have to write new tooling or support new tooling or support tooling for multiple schema formats in ACDC. If someone wants to support OCA then they just provide tooling that maps JSON Schema to OCA. Not require everyone to support multiple schema types that include OCA.

I get the fact that you want an adoption vector for OCA. But it doesn't solve an essential problem for ACDCs that JSON schema does not already solve.

pknowl commented 2 years ago

Thanks for your answers, @SmithSamuelM, but as OCA works from the first principles of computing, we won't compromise on either the concept or the implementation unless it is fundamentally evident that a pre-existing solution performs all of our semantic requirements with absolute integrity. At Human Colossus, we are in the throws of getting OCA version 1.0 out the door, so it is a distraction for us to spend too much energy on the "JSON Schema versus OCA" argument until we get through the release. As we start to delve into complex use cases, I'm sure community members will begin to understand why we have constructed the architecture as we have. I appreciate the need to get things out the door from the KERI/ACDC side. We have the same pressures for OCA. We treat objectual integrity as our Layer 1 at HCF. In other words, semantics comes before authentication in our stack. Over the past four years, OCA has become a foundational cornerstone at HCF to demonstrate structural and contextual harmonisation and integrity. Our main priority is to get v1.0 out the door so that people can start to have a play. At that stage, we'll be happy to revisit the "JSON Schema versus OCA" argument with anyone who wants to roll their sleeves up. OCA has to integrate with any existing data model or data representation format as a natural course. In that regard, it'll be able to on-ramp JSON Schema in the semantic harmonisation process. As you can appreciate, deep-stack semantics experts are (and have always been) our sounding board for OCA, and ToIP is not the right community in that regard. We've already picked at the seams in the FAIR data communities and look forward to bringing OCA back to ToIP already vetted and ready for complex use case implementations. Good luck with all of the KERI/ACDC stuff. Leave OCA to the HCF community. We'll touch base when the time is right.

mitfik commented 2 years ago

Most users of JSON Schema use static schema so it does not pose a problem to leverage the tooling this way One of the biggest adoption barriers is tooling and given that apparently OCA will be an open spec but the HSF implementation will be closed (non-FLOSS) then its a strong reason not to use OCA unless it is an overlay on top of open JSON Schema tooling. I strongly object to the development of standards where the primary (reference) implementation is not at least as open as the standard spec itself.

I think we can exclude that argument since OCA spec same as it's main reference implementation is developed under EUPL-1.2 which definitely is FLOSS license

mitfik commented 2 years ago

@pknowl ACDC is an over-the-wire protocol for exchanging authentic data. Once that data has been exchanged, any number of down stream semantic overlays may be imposed upon the over the wire data. So semantic overlays are good for the downstream purposes. There is no incompatability here. A one-to-one overlay enables over-the-wire compactness and security and semantic interoperability post processing. The key is to use an OCA overlay over JSON Schema that is used over the wire and not to replace JSON Schema for over-the-wire. JSON schema is the over the wire “capture base” and then one defines a one-to-one OCA overlay that becomes the OCA capture base.

I think what we can agree on here is that we have few problems which needs to be solved separately:

data capture - capturing the meaning of the data (e.g. is it about a person or about a car)
data processing - understanding how to process the data (e.g. what date format was used, or character encoding)
data presentation - being able to present the data (e.g. in given language, layout)

Now if we would try to combine all those information into one static schema we would loose quite a lot of flexibility for overall ecosystem. Above problems can be address in many ways, but we are entering into decentralized (dynamic) data economy where the intention is to let data flow. What that means is that interoperability is one of the key characteristic in this ecosystem.

To increase the interoperability of semantic parts we need to create system where we would encourage people to reuse schema as much as possible, give them a chance to adopt the schema to their needs without need to build complex data transformation pipelines or integration between wallets or system just because for example someone named attribute differently.

In my opinion one of the way to achieve that would be to define the system where 3 above challenges are decouple and address desperately on different layers.

Let's look on the example: EU commission with other partners defines capture base to define meaning of the data which capture the context of driving license. They publish the capture base as immutable object (SAID + Signature of the Governance Framework Authority (GFA)). Then EU states or partners can reuse and apply rules on the data processing layer according to their local needs. Let say Switzerland would define that expiration date is captured in a format dd/mm/yyyy where US would capture same information in a format mm/dd/yyyy. Next local jurisdictions in Switzerland and USA states would prepare presentation layers on which they would define how it should look like (e.g. German, French, English) and how it should be presented (Logo, colors, etc.)

Now let's say that someone from Switzerland visit USA - Arizona, police officer stops him for control and ask for driving license credential. He scan it validate and few things can happen:

because of the capture base is the same - issued under same GFA he knows that what he sees it is driving license which is valid
because of the data processing layers he can display expired_at in correct format for him since he can do automatic translation between his data processing layer and the credential holder. So he would not make a mistake in judging if the credential already expired.
he can display the layout and language according what he is used to in Arizona and understand all the fields even that credential was issued in German in first place.

If we think about JSON schema as over-the-wire solution, we would need take into consideration at least two layers defined above, data capture and data processing without that "other side" would not be able to know how to understand what s/he sees and how to process it. Unfortunately would be hard to clearly decouple those layers in JSON Schema to facilitate interoperability of the objects, multiple issuers or even to facilitate different jurisdiction of the captured data.

SmithSamuelM commented 2 years ago

@SmithSamuelM

I strongly object to the development of standards where the primary (reference) implementation is not at least as open as the standard spec itself.

@mitfik

I think we can exclude that argument since OCA spec same as it's main reference implementation is developed under EUPL-1.2 which definitely is FLOSS license

It would be helpful to better define terms in order to resolve this argument. Historically the term "permissive" was used to refer to freely licensed open source (FLOSS) which usually meant that there were no downstream encumbrances on the use of the software. This typically meant that a user could embed, combine, refactor the software in new software that was proprietary. So for example Apache2, BSD, MIT were "permissive" and "GPL" or any copyLeft license was not "permissive". More recently the OSI (Open Software Initiative) has nuanced thier definitions of the terminology and now used the term "reciprocal" and "non-reciprocal" instead of non-permissive and permissive (respectively) to refer to licenses that have downstream encumbrances that prevent use in proprietary software.

See the following discussion Permissive and Copyleft Are Not Antonyms

Therefore we can better characterize the differences as: Apache2 is a non-reciprocal permissive open source license. EUPL is a reciprocal permissive open source license.

EUPL Reciprocity

The EUPL is a reciprocal (or copyleft) licence, meaning that distributed contributions and improvements (called "derivatives") will be provided back or shared with the licensor and all other users. At the same time (and unlike other copyleft licences like the GPL or AGPL), the EUPL is compatible with most other open reciprocal licences and is interoperable.

Therefore to clarify my argument using more up-to-date OSI terminology, I will restate it. The primary reference implementation of an open non-reciprocal specification should have no more restrictive reciprocity than the specification.

The reason is that implementors start with a reference implementation but if the reference implementation is reciprocal then any derivative version of an implementation that is proprietary can't use the reference implementation as a starting point. This poses an adoption barrier that is at the core of the debate between the types of open source software such as copyleft or "reciprocal" versus non-copyleft or non-reciprocal.

Therefore OCA under EUPL is incompatible with ACDC as a normative requirement because ACDC is fully non-reciprocal and the only implementation of OCA (HCF) is reciprocal.

That is what I meant in the first place but I was clearly using an old definition of permissive or freely licensed open.

SmithSamuelM commented 1 year ago

This discussion appears to have been resolved

SmithSamuelM commented 1 year ago

if more discussion needed can reopen