Help and clarify how JSON Schema can be interpreted from validation rules to data definition. This extends to how those data definitions can be represented in any programming language
After looking at the individual issues, I would like to share a few general and overarching thoughts. I feel scattering them around the individual issues will lead to confusion. Hence, I decided to create a new issue.
For context, I am currently working on an IDL and my thoughts have been shaped by that.
I am looking forward to your feedback and comments. :)
General Approach
According to the philosophy outlined by @gregsdennis in issue #47:
“[We] must start with those languages by enumerating their features, and, for each language feature, find a way to represent it in JSON Schema using either existing keywords or defining new ones.”
I think that this is exactly the right approach and, in my mind, it means defining a type system which has all relevant features and then finding a way to map between types of that system and JSON Schema with extended IDL vocabulary.
My suggestion is to turn to type theory to get a foundational understanding of different features of programming languages. In particular, I think, we should start from algebraic data types (ADTs) and see how those map to JSON schema.
Why ADTs?
First of all, I would like to point out that ADTs are supported by most languages. In my opinion, their widespread availability alone make them a perfect fit for the type system of an IDL and, thereby, for the IDL vocabulary. If we can map a JSON schema to a set of ADTs, then we can also easily generate code for a wide varity of languages.
At the same time, ADTs are very general and, in my experience, sufficient to describe any data model one could want. I also take it to be important for the IDL vocabulary that JSON formats already existing in-the-wild can be faithfully captured. Note that I do not mean that any existing JSON schema can simply be annotated to support code generation. It has already been pointed out that this does not make much sense (see #47). Instead, I mean that I can (re)write a schema for some format to accommodate code generation. While this may not be possible in all cases, it is nevertheless a goal worth having. I am quite confident that, due to their generality, ADTs as a foundation would allow capturing many existing formats.
To summarize, I think ADTs are a good starting point for the IDL vocabulary because (a) they are supported by most languages (or can be encoded), and (b) they are very general covering most data modeling needs.
ADTs and JSON
To give you an idea about how ADTs map to JSON, let's have a look at an example for two-dimensional coordinates (in Rust):
I guess, you are all able to build the respective JSON schemas in your head. ;)
Note that enums are a special case of sum types, namely sum types without data. Here is another example:
enum TrafficLightColor {
Red,
Yellow,
Green,
}
Again, there are multiple options to encode the color of a traffic light. For instance, "Red", "RED", 0, and { "color": "Red" }, are all imaginable encodings of the variant Red of the TrafficLightColor type.
How to proceed?
I suggest, we focus the effort on mapping ADTs to JSON Schema and vice versa. To this end, we may create a list of possible JSON encodings of ADTs and their respective JSON schemas, i.e., identifying the encoding patterns. The examples I have shown cover simple objects (#46), enumerations (#43), sum types (#48), and to some extend polymorphism[^2] (#49). Reconstructing ADTs from a JSON schema of a particular representation without any further annotations is challanging, to say the least. This is where I see the role of the IDL vocabulary to provide the necessary information.
[^2]: Sum types enable polymorphism. For instance, in Java one may use a sealed interface to implement them. For the example, one could envision a function taking a Coordinate which in OOP land may be either an instance of CartesianCoordinate or PolarCoordinate.
Tooling
As I said, I am working on an IDL and, as part of that, I am currently working on the generation of JSON Schema from the types of the IDL. I could, in principle, come up with some keywords to preserve the type information when doing the mapping. I also see this as a kind of test bed for the different encodings. It is already possible to specify different JSON encodings by annotating the type definitions in the IDL. This is also how I managed to define the structure of JSON Schema itself in the IDL.
After looking at the individual issues, I would like to share a few general and overarching thoughts. I feel scattering them around the individual issues will lead to confusion. Hence, I decided to create a new issue.
For context, I am currently working on an IDL and my thoughts have been shaped by that.
TL;DR: I think algebraic data types (ADTs) would be a good foundation for the IDL vocabulary.
I am looking forward to your feedback and comments. :)
General Approach
According to the philosophy outlined by @gregsdennis in issue #47:
I think that this is exactly the right approach and, in my mind, it means defining a type system which has all relevant features and then finding a way to map between types of that system and JSON Schema with extended IDL vocabulary.
My suggestion is to turn to type theory to get a foundational understanding of different features of programming languages. In particular, I think, we should start from algebraic data types (ADTs) and see how those map to JSON schema.
Why ADTs?
First of all, I would like to point out that ADTs are supported by most languages. In my opinion, their widespread availability alone make them a perfect fit for the type system of an IDL and, thereby, for the IDL vocabulary. If we can map a JSON schema to a set of ADTs, then we can also easily generate code for a wide varity of languages.
At the same time, ADTs are very general and, in my experience, sufficient to describe any data model one could want. I also take it to be important for the IDL vocabulary that JSON formats already existing in-the-wild can be faithfully captured. Note that I do not mean that any existing JSON schema can simply be annotated to support code generation. It has already been pointed out that this does not make much sense (see #47). Instead, I mean that I can (re)write a schema for some format to accommodate code generation. While this may not be possible in all cases, it is nevertheless a goal worth having. I am quite confident that, due to their generality, ADTs as a foundation would allow capturing many existing formats.
To summarize, I think ADTs are a good starting point for the IDL vocabulary because (a) they are supported by most languages (or can be encoded), and (b) they are very general covering most data modeling needs.
ADTs and JSON
To give you an idea about how ADTs map to JSON, let's have a look at an example for two-dimensional coordinates (in Rust):
There are multiple options to map such coordinates to JSON. Here are three options[^1] for the coordinate $(4, 5)$:
{ "x": 4, "y": 5 }
(implicity tagged){ "type": "Cartesian", "x": 4, "y": 5 }
(internally tagged){ "Cartesian": { "x": 4, "y": 5 } }
(externally tagged)[^1]: I took some inspiration from Serde here.
I guess, you are all able to build the respective JSON schemas in your head. ;)
Note that enums are a special case of sum types, namely sum types without data. Here is another example:
Again, there are multiple options to encode the color of a traffic light. For instance,
"Red"
,"RED"
,0
, and{ "color": "Red" }
, are all imaginable encodings of the variantRed
of theTrafficLightColor
type.How to proceed?
I suggest, we focus the effort on mapping ADTs to JSON Schema and vice versa. To this end, we may create a list of possible JSON encodings of ADTs and their respective JSON schemas, i.e., identifying the encoding patterns. The examples I have shown cover simple objects (#46), enumerations (#43), sum types (#48), and to some extend polymorphism[^2] (#49). Reconstructing ADTs from a JSON schema of a particular representation without any further annotations is challanging, to say the least. This is where I see the role of the IDL vocabulary to provide the necessary information.
[^2]: Sum types enable polymorphism. For instance, in Java one may use a sealed interface to implement them. For the example, one could envision a function taking a
Coordinate
which in OOP land may be either an instance ofCartesianCoordinate
orPolarCoordinate
.Tooling
As I said, I am working on an IDL and, as part of that, I am currently working on the generation of JSON Schema from the types of the IDL. I could, in principle, come up with some keywords to preserve the type information when doing the mapping. I also see this as a kind of test bed for the different encodings. It is already possible to specify different JSON encodings by annotating the type definitions in the IDL. This is also how I managed to define the structure of JSON Schema itself in the IDL.