Introduction

This issue discusses how data is internally organized within the shared tree. Data model is distinct from the runtime objects created by the API layer to expose the working tree to the application. In particular, API layer implementations may expose their own data models, which the API layer internally maps to the underlying SharedTree data model.

Data model is closely related to the serialization format used for storage, as it is essential that the data model losslessly and efficiently round-trips to and from the serialization format. In practice, the data model is conceptually equivalent to the serialization format, but with some details related to physical storage, metadata, extensibility points, and compression abstracted away.

Data model is also related to the algebra of operations used to mutate the SharedTree, as mutation operations are interpreted in the context of the data model to which the operations are applied. However, we exclude API level concerns (e.g., indices, cursors, etc.) from the discussion unless a choice of data model precludes one.

Finally, data model and schema are intimately related, as schema defines a permitted subset of what is expressible in the underlying data model. This is particularly true with schema-on-write, where out of schema commits are rejected by the service.

Requirements

JSON Interoperability

We take it as a given that the underlying data model for the SharedTree is tree-structured data. We also agree that JSON is the modern lingua franca of the web and services. We therefore begin with the following requirements:

JSON must efficiently and losslessly round-trip to and from the underlying SharedTree data model.
Deviations from JSON in the underlying SharedTree data model must be well justified.

Consequently, we require that the underlying SharedTree data model can express the following in a natural way:

Object-like records of named properties
Array-like sequences of consecutive items (non-sparse / index agnostic)
null, true/false, finite numbers (f64), and strings

Type Information

All SharedTree stakeholders desire the ability to associate type information with data in the tree.

Identity

Multiple customers require the ability to obtain stable unique identifiers for data inserted into the tree. These identifiers are used for creating share links and maintaining graph-like references.

Discussion

The active discussion regarding data model is a tension between a desire to add additional expressiveness to the data model and a desire to choose a data model that minimizes complexity/overhead when integrating with specific languages and services. At the heart of this issue is a debate regarding schema on read vs. write.

Schema-On-Write

In the schema-on-write approach each SharedTree instance has an explicit schema, and the active service enforces that only commits that conform to the schema are accepted. This approach increases the COGS of writing to the tree, but the work to validate conformance to schema is performed exactly once.

Having an explicitly enforced schema simplifies the task of maintaining data integrity across diverse clients and systems integrating with SharedTree storage. Because commits are checked for conformance before writing to the tree, schema-on-write also offers additional defense against a class of data corruption bugs.

The primary downside of schema-on-write is the cost and complexity of performing data migrations. Advocates of the schema-on-read approach would additionally argue that schema-on-write is overly rigid in that it prohibits applications and services from dynamically extending the schema. Note that one mitigation to the cost of data migrations is to borrow from the 'schema-on-read' approach and explicitly allow multiple variants (either temporarily or indefinitely.)

Schema-On-Read

In the schema-on-read approach there is an implied shared schema that is assumed by clients and backend systems, but the schema is not validated by the service. Instead, clients are responsible for maintaining the integrity of the data by locally ensuring their commits conform to a common schema.

Because the 'common schema' is enforced by code running in individual applications and services, the 'schema-on-read' approach affords the possibility of applications and services having different but compatible definitions of schema, where each application/service ignores (but preserves) portions of the SharedTree that are outside its local schema.

The flexibility to dynamically augment the schema is powerful, but also contains pitfalls with respect to data corruption and loss when writers operate on the tree with divergent understandings of the current schema.

Impact to Data Model

With a schema-on-write mindset, one views schematized data as a strict subset of what is expressible in the underlying data model. Therefore, one tends to want a more rigid type system and a 1:1 correspondence between the underlying data model and the projected runtime objects exposed to the application.

Some things I associate with the schema-on-write approach:

Distinguishing between T and T*
ArrayNode as the exclusive way of creating T*
Distinguishing uint8/16/32, int8/16/32, and float32/64

With a schema-on-read mindset, one views schematized data as a dynamic reinterpretation of the underlying data model. Because of the need to tolerate and preserve out of schema data, one tends favor to uniformity and additional structural degrees of freedom so that applications can unobtrusively attached private data at points in the tree where it won't be lost, orphaned, or overwritten.

Some things I associate with the schema-on-read approach:

Unique ids as trait labels
Every trait is an implicit sequence (i.e., T* everywhere)
Primitive values like 'true' may also have traits (e.g., 'signedOffBy', 'expirationDate', etc.)
Values as opaque byte blobs with reinterpret casts occurring at read time.

A few of us (@yann-achard-MS , @CraigMacomber, @PaulKwMicrosoft) met today to chat about the tree abstraction. We have general agreement on the following lists, though this has not yet been discussed outside of that group yet and should be reviewed to ensure it takes into full consideration the requirements found above.

In the schema-aware application-facing API:

We agree that:
- All nodes can have children, unless specified otherwise by schema.
- We want to be able to support all of the primitive types of JavaScript (and any others out there of interest). It should be easy for us to add support for any primitives we want, but do not have a plan for an extensibility model.

In the underlying data model:

We agree that:
- Allow primitive nodes to have children.
- Don't bake in indexing at a deep level.
- All nodes can have children.
We need more discussion about:
- All nodes have values
- There are a number of open questions mentioned under "Tree Abstraction" here that we need to discuss.

In both of the above:

We agree that:
- For now, we are creating a tree, not a forest or a DAG. This also implies "not a graph", but we are supporting graph-like references, so the point is really the idea that there is a special set of relationships that form a tree, and "parent" relationships that invert these.
- All nodes can have identity and type/definition.
- We want to support labeled structured children, like JSON and unlike XML.
- We should support arbitrary user-defined string trait labels.
- We should support efficient globally unique trait labels.
- We should support the ability to have implicit sequences.
- All primitives exist as nodes.

We'd like to chat more about the following in the 2/25 meeting:

Better understand how PropertyDDS' schema-on-write system handles schema changes.
How a schema-aware editing API could provide guarantees of schema compliance. Specifically, we suspect that the issues mentioned in this Epic can be fully mitigated, and that the major win in flexibility would be worth the complexity overhead.

QQ regarding schema-on-write.

Schema here includes the ability to to specify the multiplicity of a trait, such as containing at most 0 or 1 items?

What happens if two clients insert into this trait? Locally their edits are valid, but when merged, one will be invalid.

Are there a general class of edits that are well-typed when constructed, but do not remain well-typed through merge? Is this a difference between the notion of 'schema' in PropertyDDS and SharedTree?

Schema here includes the ability to to specify the multiplicity of a trait, such as containing at most 0 or 1 items?

The schema on read approach should allow you to use any schema system you want, which validates any particular invariants you have, including this one.

The specific schema system I have been working on focuses on 3 multiplicities (0 or 1, exactly 1, and 0 or more), but if there is demand, could support other restrictions (such as specific counts).

What happens if two clients insert into this trait? Locally their edits are valid, but when merged, one will be invalid.

The client makes the edit through a schema aware editing API that knows the trait is optional (0 or 1 child), and thus has at least these two choices for how to encode the edit, both of which will result in in-schema merges:

Replace all items with the child sequence with the new value (last write wins)
Include a constraint that the child sequence is empty before its modified (causing the second edit to conflict, resulting in first write wins)

Are there a general class of edits that are well-typed when constructed, but do not remain well-typed through merge? Is this a difference between the notion of 'schema' in PropertyDDS and SharedTree?

Yes, assuming shard tree goes with schema-on-read (like the existing experimental version).

The idea is that the DDS does not use the schema, and can perform arbitrary tree edits, but for clients/apps to use a schema aware API for making edits can't cause schema violations on merge. We leave it up to the app authors to make sure all their clients have compatible schema, but also provide them with optional tools to handle schema violations that occur (ex: format version changes, issues from bugs etc: it let you be a bit more fast and loose with schema changes, third party apps with different schema etc).

It would also be possible to add an optional schema validation component to shared-tree (ex: via another DDS sharing most of the code). The main idea here is not to require that: but it would still be possible to support optionally (though likely not in the initial version, and only if we find a need for it).

One challenge with the way we're currently talking about 'schema-on-read' is that most customers who want schema are looking for a guarantee that their applications/services cannot encounter out-of-schema data. To them, discussing 'schema error handlers' that fix up violations after the fact feels the same as working with no schema.

I realize that this is inaccurate, as the intention of the 'schema error handler' is to is to be a hook that enables implementing schema in a decentralized fashion. Decentralized/augmentable schema is an idea I like a lot from a service COGS perspective, but I think we need some notion of explicit schema to offer stronger assurance that:

Non-conformant edits are rejected.
Two conformant edits, when merged, produce a conformant edit (e.g., JSON values should not unexpectedly morph into arrays due to concurrent inserts.)

While still allowing for the idea of applications augmenting the tree with private data and many of the other benefits of the 'schematize' approach.

@CraigMacomber - I was wondering if you could help me bootstrap that discussion by brainstorming ideas to bolster our integrity guarantees while keeping the features of schematize that we like.

One angle to is add an implicit constraint that requires that writers must use a compatible schema, and then focus on the problem of defining and detecting "incompatible schemas".

We could check that the result of an edit conforms to the schema of the mutated node's type definitions. (e.g., you can't remove the 'x' of a node with type 'point'.)
- Riffing on @jack-williams static analysis / type system approach, we could:
  - Trust clients to provide accurate type annotations
  - Verify that when a type definition is altered, it is transitively permissible
- We could verify that writes do not replace newer schema versions with older versions (i.e., disallow downgrades.)
- Or alternatively, expose newer schemas to older readers as read-only.

Another angle is to place restrictions on how schemas may evolve to ensure that there is always a path forward. I think there are two areas to explore here:

Require breaking changes be expressed as transformations from the previous version that can be applied 'on-the-fly'.
Reduce the need for breaking changes with data model features that enable applications to unobtrusively attach additional data in a non-breaking way.

Note that with respect to M1, understanding the data model impact is the priority, but I understand that it's hard to detangle that from the broader questions about schema.

Some things I associate with the schema-on-write approach:

Distinguishing between T and T ArrayNode as the exclusive way of creating T Distinguishing uint8/16/32, int8/16/32, and float32/64

I think these are not necessary strictly properties of a schema on write approach. If I understand the schema on read proposal correctly, there would still be definitions on the primitive nodes that could identify the data-types.

In PropertyDDS, we did distinguish between T and T*, in the sense that only collections (arrays, maps and sets) allowed the use of polymorphism, but this is more a technical limitation than a fundamental property of this appraoch. It could also be allowed for other members of a structure, e.g. via some specifiers that relate to the semver annotations / allow inheriting members (the main reason we didn't yet introduce this is, that we did no yet have type change commands in the changeset syntax and thus only could change a type via a remove insert).

Schema here includes the ability to to specify the multiplicity of a trait, such as containing at most 0 or 1 items?

In PropertyDDS, this was possible in the schema language (multiple entires were declared via collections, single entries were listed directly). These two cases mapped to two separate representations in the changeSet (e.g. for the collection a nested modification of the collection was created) and thus these two cases could not be violated in a merge (even without the merging algorithm knowing the schema)

In the following, I'll try to structure the differences / similarities between the two approaches in a table, to make it easier to compare them		Schema on Read
Types / Definition	All nodes can have a definition (including primitive nodes where it encodes data-type)	All nodes have a type (either identifying the associated schema or an inbuilt type)
Relations between definitions	Does this system use any mechanism to relate definitions?	There are two different ways, in which types are related SemVer Versions (major means breaking change, either an existing field was removed/its type changed or there is a semantic difference in the meaning of a field (e.g. price changed from USD to EUR value). Minor means that additional fields were added. Patch indicates no data change, only annotations changed in the schema (e.g. documentation). Inheritance The schemas can declare that one schema inherits from another schema (this implies data compatibilty, i.e. the inheriting schema can only add members, not remove existing entries).
Extensible Schemas	It is always possible for an app to add additional fields that do not exist in the schema. Collisions are avoided by using unique keys for fields in the schema.	If the schema for a node declares that this node type is extensible, it is acceptable to add additional fields. If strings are used as keys, this could lead to collisions when migrating the node to a newer schema version, could also be avoided via unique IDs for the fields.
Schemaless Data	Possible	Possible
Schema Storage	Schemas are only known to applications.	Schemas are stored centrally / within the document. Applications may contain schema definitions in their code, which are used to initially create the data and to check whether the schemas within document are the same than the ones known to the app. If not an error is triggered or an out-of-schema migration would be needed.
Adding fields to a schema	Possible, the definition remains unchanged. Apps using the old schema will just ignore the new fields. Apps expecting the new schema will trigger an out-of-schema handler.	Possible, requires creation of a new schema with an increased minor version. Apps that expect the old schema will know that a minor version increase is no problem and will continue working. Apps that support the newer schema, can decide based on the minor version, how to handle the missing fields. This could also be done via automatic migration code that converts between schemas (e.g. by initializing with default values).
Breaking changes to a schema	Can be done in two ways. Definition remains unchanged. In that case an out of schema handler will be invoked and might for example detect that a field is missing and has to treat that case. Definition is changed The new definition relates the node to a new Schema and tells the application that the semantics has changed. Is there any mechanism to tell applications that the new definition refers to the same type of object, just a different version?	A new schema with an increased major version is created. The application can decide based on the major version which code to use to access the object. Alternatively, an automatic migration handler could be registered that knows how to convert between major versions.
Polymorphism	Schemas only specify the definition of nested nodes. As long as a schema migration handler is able to map from the stored data to the schema the application expects, deviations from schema are allowed. *Do we have any mechanism to allow for different definitions than expected, e.g. inheritance or definition changing breaking changes to schemas?	The schemas can use a semver like scheme to specify whether changes to major, minor, patch version would be allowed for a specific child. In addition, inheritance can be used to enable polymorphism.
Out-of-schema data	Can occur. An out-of-schema handler will be invoked and has to handle the situation. No enforcement to prevent out of schema data.	Should not occur (depends on the enforcement mechanism, see below). If out of schema data appears, it would be considered an error in the document. Either the application has to revert to an older version or try to repair. However, there can be schemas unknown to the application in the document (i.e. completely unknown or with a major version that is larger than supported). In such a case, the application either has to ignore the unknown parts of the document (e.g. by showing a hint to the user) or it could try to download migration code for the automatic migration system.
Schema enforcement	Schemas are not enforced. Out of schema data is generally allowed and can be produced via the official API, but clients should only produce out-of-schema data if this is intended (e.g. because of a schema migration between application versions).	Clients using the DDS will always produce in schema data, out-of-schema data could only be produced by malitious clients writing without using the DDS code (or via bugs in the code). Additionally schemas may be enforced. Two possible approaches: Server Side If a change contains out-of-schema data, it will be rejected by the server. Client Side If clients detect that a change produces out of schema data, they will ignore this change. The misbehaving client would get out-of-sync with all other clients. We therefore would need some mechanism to throw the misbehaving client out of the session. We also need a mechanism to make sure, the out-of-schema data does not become part of the summary. If we generate summaries server side, the summary generation code would need to check for out of schema data. Otherwise, there is a possibility to have corrupted summaries. Those could be repaired by replaying operations from an earlier summary and ignoring out-of-summary changes (which would produce the same result as server-side rejection).
Merge handler	If all clients use a certain subset of editing primitives (e.g. replacing instead of inserting, adding constraints), there will be no out-of-schema merge resolutions.	The merge handler must not produce out-of-schema data for valid inputs.
Interaction with other server side systems (e.g. data-bases, full-text search, eventing systems, a REST interface)	If the data is exposed to other systems that expect schema compliant data, an application specific, sever side, out-of-schema handler must be invoked that first translates out-schema-data into schema compliant data.	Data is always in schema and can sent to downstream systems as is (out of schema data would be considered an error). Potentially, an application dependent migration handler might be needed, if the downstream systems don't support all possible schema versions.
Migration	Handled by the out-of-schema handler. It will get an arbitray set of traits (identified via unique identities) and will have to convert this to schema compliant data.	Two options for an application: the application contains code that supports multiple major versions of a schema. An automatic migration handler is registered, that can translate between different major versions of the same schema.
Efficient Data Encoding	The system would internally detect repeating patterns in the data to derive efficient encodings for those, but these would not be directly tied to the schemas.	Schemas could directly be used to encode the data compactly. For out of schema data, we could infer encodings in a similar way to the proposal for "schema-on-read".

Supporting both approaches at the same time would probably not be too much additional effort.

Schemas could be declared as "on-read" vs "on-write". For on read schemas, only the definition and the flag that it is "on-read" would be stored on the server. For "on-write" schemas, the whole schema would be stored.
Client side, there would be a mechanism to register a hook per schema. For "on-read" schemas, this would handle the "out-of-schema" case, for "on-write" schemas, it would only have to perform migrations between different schema versions, but would not have to handle the general "out-of-schema" case.
Enforcement would only happen for schemas that have been declared as "on-write".

Allow primitive nodes to have children.

Does this mean that for example a number could have children? How would this be exposed in a JSON API? Would the number be represented as a JS Object and the value would be accessible via a special member (e.g. via a symbol value?). So if a child is added to a primitive node, its type would change in JS from number to Object?

Could you give examples, where this additional degree of freedom would be useful (compared to just adding an additional primitive child called value or one using a unique label as key)?

JSON Interoperability

We take it as a given that the underlying data model for the SharedTree is tree-structured data. We also agree that JSON is the modern lingua franca of the web and services. We therefore begin with the following requirements:

JSON must efficiently and losslessly round-trip to and from the underlying SharedTree data model.

Deviations from JSON in the underlying SharedTree data model must be well justified.

Consequently, we require that the underlying SharedTree data model can express the following in a natural way:

Object-like records of named properties

Array-like sequences of consecutive items (non-sparse / index agnostic)

null, true/false, finite numbers (f64), and strings

Which level of guarantee do we provide to users using the JSON APIs? Will they be functionally equivalent to other APIs?

How do we expose data that does not directly correspond to the JSON data model via the JSON API? (e.g. sequences without an explicit array node in the tree, primitive values stored in inner nodes)?
Do we guarantee that the topology of the JSON doesn't unexpectedly change, if clients modify the tree with non-JSON APIs (e.g. empty arrays disappearing, primitive values changing type into an array)?
If an user would like to enforce JSON compliance could this be done on the whole DDS level (e.g. by setting a flag when the DDS is created)? Or is this only locally enforced via the schema on specific nodes?

The below strikes me as a good balance between data integrity and dynamic flexibility:

Data Integrity

There is an explicit schema for the tree expressed as the type of the root node
Nodes always transitively conform to the schema as expressed by the tree's node types
- Implies that merges that produce non-conformant nodes are conflicts
Conflicts must be detected and rejected prior to applying to remote clients (or storage service)
- Consequently, only the sender will ever need to roll back their local branch.

Dynamic Flexibility

Applications can unobtrusively annotate the strongly typed tree with application-specific data.
Adding new annotations must not require modifying the schema (i.e., node types are unchanged)
- We want to support a heterogeneous collection of applications that collaborate on shared data.
- We want to be able to dynamically add new applications w/o a schema migration
Annotations must not collide (with each other or the explicit schema)
- Implies that annotations use unique ids or namespaced keys
Annotations must be preserved by edits (including edits by unaware apps, schema migrations, etc.)
- Implies that annotations are dynamic/extended properties of the node (so they are moved and deleted along with the node to which they are attached)

For further discussion:

How to implement Data Integrity guarantees w/o additional latency & service costs
- One idea: #9282
What data integrity guarantees apply to the annotations?
- Applications can employ strong typing to their annotations by grouping them under a typed aggregator node.
- @PaulKwMicrosoft asks "can/should we add concepts like 'A or B, but not both' without introducing an aggregator?"

It could also be allowed for other members of a structure, e.g. via some specifiers that relate to the semver annotations / allow inheriting members (the main reason we didn't yet introduce this is, that we did no yet have type change commands in the changeset syntax and thus only could change a type via a remove insert).

We should be cautious about introducing a type change command, especially given augmented schemas. Would this just be for convenience? Would it be intended just for conversion to newer versions of given schemas? It might be safest to ensure we have very precise requirements before we head in this direction, and then create a very targeted API.

Schemas could be declared as "on-read" vs "on-write". For on read schemas, only the definition and the flag that it is "on-read" would be stored on the server. For "on-write" schemas, the whole schema would be stored.

This feels like it might be an unnecessary degree of freedom. If we engineer server-side validation, it’s unclear why an application wouldn’t want to leverage it.

Does this mean that for example a number could have children? How would this be exposed in a JSON API? Would the number be represented as a JS Object and the value would be accessible via a special member (e.g. via a symbol value?). So if a child is added to a primitive node, its type would change in JS from number to Object?

Could you give examples, where this additional degree of freedom would be useful (compared to just adding an additional primitive child called value or one using a unique label as key)?

There are a few parts to this treatment of values as nodes.

Do we actually need to pay for an object when only a value is needed? No, the implementation should definitely use just a value in the common case.
In the common case of reading a primitive property, is it necessary to call .valueOf() or something similar? No: if the property’s type in the schema is specifically a primitive, e.g. “number”, then in the API reading the property should be as simple as e.g. somePoint.x. Behind the scenes, somePoint.x might do a simple property read or it might call a getter that extracts the value from a node object, but this is implementation detail. (Note that if a trait can contains Sequence, we’ll always have an explicit sequence object, even if there’s only one element in the sequence, to guarantee that one can always read [Symbol.iterator] from a trait’s contents. Node can implement [Symbol.iterator] on itself, so this is only if the trait contains a single primitive. Alternately, we could store the non-virtual node form of the primitive.)
Does this mean that in the tree abstraction the primitive is a direct child under the trait? No, in the model, there is a node present there, with its own type (e.g. NumberNode) and identity. It should be possible to access this node, which will often require a temporary “virtual node” object to be generated on-demand, though we don’t need to implement support for this right away.
Why make primitives nodes in the model? Uniformity and flexibility. Consider polymorphic fields: if one has a trait of type “number | boolean | Foo”, instead of returning an object or a primitive and requiring the client to test the result using typeof, one can switch on result[type]. If an application author decides to allow bookmarks/references to primitives, their identity is there to be leveraged. If an application author decides to allow movement of primitives, we get correct merge outcomes. If another app decides to augment a primitive to have children (e.g. comments), nothing breaks. Most importantly, by making somePoint.x a convenience wrapper that in principle calls .valueOf(), we get a level of indirection and the potential for abstraction. This abstraction could be a behavior built into SharedTree itself, e.g. to postpone resolution of a conflict we could replace any subtree (including a primitive) with a tree specifying the states in the relevant revisions, but it is also something that could be leveraged by applications (sample use: computed expressions), even in a world of heterogeneous documents, all by establishing this simple convention at the outset.
Why allow a node with a value to have children? For uniformity and forward-compatibility. The presence of a node with a given type represents a choice by the user, and choices can be annotated, have metadata, and so forth. If some nodes/entities were not allowed to have children, an application that wanted to truly “future-proof” its schema (since schema migration is expensive, one way or another) would have to defensively wrap each such node in an aggregator, “just in case”, resulting in a lot of pairs in the tree. This pattern with virtual nodes effectively does this, but doubly-efficiently, by making the node virtual and the value-reading transparent, so only the true primitive value is needed in most instances. In particular, it enables augmentation by other applications, which cannot be anticipated.

I personally think that requiring a unary + to extract the value of a number property, or ‘’+ for a string property (or, somewhat weirdly, !!+ for a boolean property), is a small price for a truly uniform API), but for true JSON compatibility, we need to support the somePoint.x syntax. I think the best approach is to support both “NumberPrimitive” and “NumberNode” as types in schemas, with the former yielding an actual number from somePoint.x and requiring a distinct syntax (somePoint[Point.x]?) to get the node, and the latter requiring +somePoint.x etc. to get the value. NumberNode would allow use of [type] in polymorphic contexts, or contexts that might become polymorphic in future versions.

Which level of guarantee do we provide to users using the JSON APIs? Will they be functionally equivalent to other APIs?

We don’t necessarily need separate APIs, just a JSON schema (and a type system that allows one to derive a schema from the JSON schema while retaining all of its constraints).

How do we expose data that does not directly correspond to the JSON data model via the JSON API? (e.g. sequences without an explicit array node in the tree, primitive values stored in inner nodes)?

See above for primitive values stored in (possibly virtual) inner nodes. As for implicit sequences, let’s first focus on the “Array-ness” of a sequence. For full JSON compatibility, we need to support the [] syntax, which means we’re going to need to potentially either devirtualize and copy the children to an actual Array object, or return a proxy. Do we always want to pay these run-time costs for clients that want to use the Sequence interface instead? We could support this by having distinct Sequence and JavaScriptArray types in schemas, with the latter returning either the proxy or array (perhaps making the choice dynamically).

Now, code expecting the JSON model could either restrict itself to reading JSON schemas only or, if it wanted to read non-JSON data, “opt in” to the [] syntax by calling .toArray() on a sequence trait. By making Node implement .toArray(), this also happens to take care of the implicit sequence, ensuring that the JSON client will always see an array. If we make a read of a missing sequence trait return an “empty sequence” object (and I’m thinking we should), then .toArray() on that can of course return an empty array. (Returning to the primitives discussion, we might want foo[children].bar return an empty sequence object if the bar trait is empty – it’s an interesting question whether this would be the most convenient outcome for this method, given that it’s not targeting clients that expect JSON.)

Do we guarantee that the topology of the JSON doesn't unexpectedly change, if clients modify the tree with non-JSON APIs (e.g. empty arrays disappearing, primitive values changing type into an array)?

Given the above, I don’t think this is an issue – any transitions are hidden from clients expecting the JSON model.

If an user would like to enforce JSON compliance could this be done on the whole DDS level (e.g. by setting a flag when the DDS is created)? Or is this only locally enforced via the schema on specific nodes?

Working this into the type system would give us more flexibility and not require a special mechanism. The “base” JSON schema would presumably want “JSON all the way down”, but one might want to write a schema derived from it that had properties that permitted non-JSON.

For the sake of uniformity (which I agree with the above is worth perusing given that schema can be used to project out as primitives) is it worth introducing annotations and annotation types as further concepts?

At what point are annotations not just qualified properties, and strongly typed annotations are schemas that only apply to the annotations qualifier?

For the schema strongly applied to the tree at all times, are we assuming that this schema language is different (simpler) than the domain-oriented schema typically associated with on-read / dynamic handlers?

One aspect of schema-on-write / strongly typed trees that I want to unpick is to distinguish the difference between having the data always conforming to a schema, and having the data associated with a schema in an application-agnostic way.

It would certainly be possible to have a schema published with a dds, that any application could read and query, but still need to validate. Just wondering what the key requirements are for strong enforcement of typing.

Regarding M3: Schema and Constraints / Deliverables / Schema / Repsository from here

I see a clear distinction between the 'application schema', 'container schema', and 'container data'.

Premise: it is not possible to guarantee the application schema (compile time) is the same or a super set of the container schema. For reference, Property DDS will fail if a container has data of a type which has not been registered. Applications should always be able to load containers and then choose to stop or continue based on an evaluation of application schema, container schema, and container data.

The question now becomes, where is the container schema stored?

I would like to advocate that the container schema be stored in the container itself and be supported by the SharedTree API.

An alternative would be to store the container schema remote to the container (e.g., in a centralized schema registry).

An embedded schema seems simpler and more developer-friendly. The entire schema need not be recorded. Only types which are instantiated are needed. An embedded schema seems to simplify the scenario where new a application adds a new type while an old application is still connected to the same container.

I see a clear distinction between the 'application schema', 'container schema', and 'container data'.

Thats a nice set of terminology to express this. Here is my current vision using those terms:

definitions

'container schema' : a set of constraints it is valid to assume the 'container data' meets, and that must be maintained when editing (including through conflicts).
'application schema' : a set of constraints the application wants to make about the data.

Container schema options

There are several ways to express 'container schema'. I'm going to break this up into 2 parts: places we can store schema information, and places that can refer to it.

Places we can store container schema information:

Inline in whatever place is referring to it (see list below)
As container data
Hard coded into shared tree (constraints like the tree being a tree not a DAG fall into this)
Injected via a shared-tree subclass or other fluid configuration (schema data / constraints shipped as code)
In some external repository: the repository defines an append only namespace of schema. This repository is known about by shared-tree somehow (any of the other items in this list could contain the reference to the repository)

Places that can refer to/apply container schema:

Code can special case the root (ex: shared-tree, a subclass, or its configuration, can apply some rules, possibly recursively, to the root)
Declaratively for a specific type: nodes in the tree are typed, so similar to the above, the tree can be configured to apply specific schema based on type of the nodes. This forces all nodes of the same type to have these same schema: its not contextual, but easily handles open polymorphism (ex: a child thats allowed to be anything, as long as its in schema for its type)
In other schema: for example, a schema can apply specific schema to its children. This allows for contextual schema (under one parent the same type might have different rules compared to under another parent)

Note that its possible to refer to schema in a way that's unambiguous, but code handing the data might not always have the schema. For example document could refer to a schema by its hash, or name in a append only namespace. This can have interesting implications for updates to new schema (ex: one client adds data using a schema shipped as code that another client does not have). Some of the options do not have this issue (inline, central repository (assuming you are ok going down if it goes down and schema are not cached), and in container data)

What options to support.

Thats a lot of options, and I think there are good reasons for all of them, so long term, we likely want all of them to be options to some extent. But I think we can pick a good subset for the now, and leave open the option of extending it in the future.

I think we want to at least:

Make adoption of different constraint/schema systems optional and incremental: all of these different options should be opt in, and easy to add to and existing application if desired.
Provide a schema-on-read system (ex: schematize) that can give an application a nice interface to the data, including validation and error handling. Anytime the application schema and container schema are different, this can allow the app to function as well as possible (detect and handle out of schema data, both on read and on edit, as localized as permitted by the provided error handlers).
Provide at least one way to customize the container schema (schema-on-write), which can at least do type based constraints.
Have at least one good design-pattern apps can use to handle schema migrations for each way they have to enforce/specify schema (make sure this supports a roll out process were clients have mixed code versions, and old documents will be supportable forever)

Expected usage patterns

I suspect different users will mostly fall into 2 groups:

pure schema-on-read: no customization of container schema.
almost pure schema-on-write: only using schema-on-read to assist with schema migration so their application schema doesn't have to deal with old formats (which unavoidably pile up in the container schema). They may additionally use schema-on-read to enforce/fix some invariants the container schema system isn't expressive enough to declare (referential integrity for graph like references, numeric ranges, and any other appellation level invariants that are desired)

Schedule

What work we need to do for each milestone in #8273

M0

Reach consensus that on the set of options we may want to support in the future which we think covers all future use-cases (ex: lists above).
Account for these as extension points in the architecture if needed.
Agree on a proposed underlying tree format (ex: the one in this thread), including determining that its suitable for constraining in these ways

M1

The architecture should account future options when doing implementation in M1.

M2

Design and document patterns for handling schema migrations (so apps can start having compatibility for old documents).

M2 or M3

Since the schema on-read-system for application schema is used in both expected usage patterns, and impacts the APIs Apps read the data with (we want to give them a schema aware API that's possible to make performant which requires incremental schema validation and error handling), I think we may want to start building it into the API in M2. I recommend starting with a subset of the schema language that declaratively expresses constraints that will be possible to support in in container schema, even if we only support use as application schema for now.

Thus we might want to pull this subset of the work into M2.

M3

Extend the above schema to be usable as container schema in at least one way (even if that's just putting schema hashes for each type in the document so we can enforce applications schema are not mismatched on the same data).

Polish up and finalize schema language, schema aware APIs, and make sure APIs result in data staying in schema (including fuzz testing, and maybe some proofs).

Sometime after M3

Schema/Shape optimized storage formats, and optimize schematize for these formats and for known container schema.

Support additional functionality (other options for container schema, imperative extensions to application schema)

@PaulKwMicrosoft

It could also be allowed for other members of a structure, e.g. via some specifiers that relate to the semver annotations / allow inheriting members (the main reason we didn't yet introduce this is, that we did no yet have type change commands in the changeset syntax and thus only could change a type via a remove insert).

We should be cautious about introducing a type change command, especially given augmented schemas. Would this just be for convenience? Would it be intended just for conversion to newer versions of given schemas? It might be safest to ensure we have very precise requirements before we head in this direction, and then create a very targeted API.

It is not only for convenience but also for reducing load when executing upgrade of the Tree DDS data to the new type version. Change Command would allow us to apply for example Add Member without : loading the whole type instance to memory, adding the member, removing it from Tree DDS and inserting the upgraded version. We would simply send the Add Member Change Command and it would be applied where needed (for example b-tree+ persistence, partially checked out type etc). I believe that especially partial checkout can benefit from it. And also any further external persistence such as full-text search facility.

This issue has been automatically marked as stale because it has had no activity for 180 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!

microsoft / FluidFramework

SharedTree: Data Model and Schema #9234

Introduction

Requirements

JSON Interoperability

Type Information

Identity

Discussion

Schema-On-Write

Schema-On-Read

Impact to Data Model

definitions

Container schema options

What options to support.

Expected usage patterns

Schedule

M0

M1

M2

M2 or M3

M3

Sometime after M3