message VariableDef {
// The Type of the variable.
VariableType type = 1;
// The name of the variable.
string name = 2;
// Optional default value if the variable isn't set; for example, in a ThreadRun
// if you start a ThreadRun or WfRun without passing a variable in, then this is
// used.
optional VariableValue default_value = 3;
}
The only typing information we have is the VariableType which is simply an enum and contains no schema information. We are lacking the following functionality:
Regex validation for STR variables. For example, enforcing that it matches an email.
Schema for any BYTES variables. For example, we could include information to denote that the BYTES field is a protobuf, an Avro record, etc.
Any form of Json Schema for JSON_OBJ or JSON_ARR variables. Currently, we don't even validate that it is a proper valid Json String, much less check for presence of any fields.
Proposal
I propose to introduce a new Tenant-scoped GlobalGetable object called VariableSchema. It would look like this:
message VariableSchemaId {
string name = 1;
int32 version = 2;
}
message VariableSchema {
// Id of the schema
VariableSchemaId id = 1;
// human-readable description
string description = 2;
oneof schema {
// An Open-API v3 Schema
OpenApiV3Schema open_api = 3;
// Protobuf schema
ProtoBufSchema proto_schema = 4;
// String Regex
StringRegex string_regex_schema = 5;
}
}
Protobuf already has a well-defined "Proto Descriptor" API for sharing protobuf schemas. For OpenAPIv3, we can use the ApiCurio Data Models Library. Lastly, String regexes should be easy enough to match using a Pattern.
Discussion
The discussion section takes it for granted that LittleHorse should adopt some form of schema management solution. Instead we focus on the specific implementation proposed above.
Benefits
The two most crucial benefits of what is proposed above are:
No additional external dependencies are introduced to the server.
Because the schemas can be managed inside the global KTable, there is no performance penalty (compared with using an external Schema Registry in which we would have to make external network calls).
Further benefits are that:
The schemas can be viewed inside our dashboard.
[omitted internal LH Cloud reasons]
Drawbacks
Implementing schema management like this will be considerable effort.
If our users have their own Schema Registries, they will have to maintain their schemas in two places.
The first concern isn't a huge problem given the quality of engineers that work on LittleHorse. For the second concern, vendors or community members could write adapters that keep the LittleHorse VariableSchema objects in sync with an external Schema Registry.
Alternatives
An alternative is to have a hard integration with an external Schema Registry such as Confluent Schema Registry or ApiCurio.
This would be nice because:
Some users may already store schemas in these systems and would not want to duplicate the schema.
We have less work to do in our server.
However, there are some drawbacks:
We have another hard dependency that is required for certain WfSpecs to work. If a user doesn't deploy ApiCurio or Confluent Schema Registry, then they cannot use all of the desired functionality in our WfSpecs.
Schema validation requires making external calls inside our command processors, which will hurt performance.
There are several popular schema registry implementations which have various levels of incompatibility with each other. It is unclear which implementation to pick. We would need to take into account adoption, licensing, functionality, features, ecosystem, and project stability.
Background
The
VariableDef
proto right now is as follows:The only typing information we have is the
VariableType
which is simply an enum and contains no schema information. We are lacking the following functionality:STR
variables. For example, enforcing that it matches anemail
.BYTES
variables. For example, we could include information to denote that theBYTES
field is a protobuf, an Avro record, etc.JSON_OBJ
orJSON_ARR
variables. Currently, we don't even validate that it is a proper valid Json String, much less check for presence of any fields.Proposal
I propose to introduce a new Tenant-scoped
GlobalGetable
object calledVariableSchema
. It would look like this:Protobuf already has a well-defined "Proto Descriptor" API for sharing protobuf schemas. For OpenAPIv3, we can use the ApiCurio Data Models Library. Lastly, String regexes should be easy enough to match using a
Pattern
.Discussion
The discussion section takes it for granted that LittleHorse should adopt some form of schema management solution. Instead we focus on the specific implementation proposed above.
Benefits
The two most crucial benefits of what is proposed above are:
Further benefits are that:
Drawbacks
The first concern isn't a huge problem given the quality of engineers that work on LittleHorse. For the second concern, vendors or community members could write adapters that keep the LittleHorse
VariableSchema
objects in sync with an external Schema Registry.Alternatives
An alternative is to have a hard integration with an external Schema Registry such as Confluent Schema Registry or ApiCurio.
This would be nice because:
However, there are some drawbacks:
WfSpec
s to work. If a user doesn't deploy ApiCurio or Confluent Schema Registry, then they cannot use all of the desired functionality in our WfSpecs.