Algebra of schema transformations

We call "schema transformation" all the operations one can apply to a schema that maintain the guaranty that data written by the original schema can be read using the new one (backward compatibility) and vice versa (forward compatibility).

These transformations are implemented in Avro, although there are probably not identified as an algebra by Avro's author.

So the two steps for solving this issue can be:

look at Avro's source code and documentation to list the candidate operations
formalise these findings to come up with a more fundamental abstraction (there is a fair chance we'll end up with a category of "schemas+transformations" (the so-called schema transformations being the morphisms of that category).

Alternatively, searching through the academic literature should also yield interesting results.

This probably needs proper definition of forward/backward compatibility.

Backward compatibility is achieved when processes using a newer version of a schema remains able to process data that was written using an older version of the same schema. Similarly, forward compatibility is achieved when processes using an older version of a schema keep being able to process data emitted by processes that use a newer version of the said schema.

More precisely, for any pair of functors F and G (F being covariant and G contravariant) then, given s0, s1, ... , sn the successive versions of a schema for type A, and F_si (resp. G_si) the instance of F[A] (resp. G[A]) derived from si:

backward compatibility: forall i,j, i<=j, forall a:A, F_sj(G_si(a)) = a.
forward compatibility: forall i,j, i>=j, forall a:A, F_sj(G_si(a)) = a

Let me take a look at what I can find about this

Note that in some cases, upgrading data under older schema is all that's necessary. It's not necessary to downgrade data from the latest schema to a predecessor.

As for operations, I'd look for a small, orthogonal, economical set of operations and test them with specific use cases.

Here are some ideas:

add term to product (with default value)
remove term from product (this is the dual of 1, seen from the opposite side)
add term to sum
remove term from sum (with default value)
promote primitive to singleton flat product / sum
demote singleton flat product / sum to primitive
pull term in product in product to top level
push term in product to term in product in product
pull term in sum in sum to top level
push term in sum to term in sum in sum

(1) - (4) cover simple additions and removals; (5) - (6) cover elaborations / de-elaborations, (7) - (10) cover moving information around, without changing it.

I have a feeling this is not quite right, something's fishy about (5) - (6). Maybe there's something more general or fundamental there. The other ones seem more solid.

I think there's a more fundamental idea here, namely that of a normal form.

So my idea is that we formulate a set of reversible algebraic operations. Some of those are given by the fact that Product and Sumtypes form at least a commutative semiring (as long as you have some sort of unique tag for each term in the product/sum).

Having these operations I think we could define a normal form such that if Schema[A] and Schema[B] have the same normal form there exists an Iso[Schema[A],Schema[B]] that is constructible by applying the steps from Schema[A] to the normal form composed with the reverse of Schema[B] to the normalform.

I have the hunch that the steps that @jdegoes listed could be covered by adding additional, maybe non-reversible, operations to that set of algebraic transformations. This would mean that Iso[Schema[A],Schema[B]]would becomeSchema[A] => Schema[B]` I think.

However I am massively lacking in the required theoretical knowledge to verify that any of the above pans out.

Eg. 1 & 3) Those are the primitives as this boils down to applying product or sum respectively to two terms. one being your existing term and the other is the new term. But this how we construct schemas anyway

2 & 4) not sure about these ones yet.

5 & 6) are application of identity law. if S[_,_], S0, P[_, _] and P1 are our sumtype constructor (\/), sum identity type (Nothing), producttype constructor (Tuple2) and product type identity (Unit) it should be possible to define def sumIdentity[SX]:Iso[S[SX,S0],SX] etc.

7 - 10) I think that would be application of commutativity and associativity?

"Reversible" is too strong, it would prevent you from removing information.

You need up.down.up.down == up.down, and down.up.down.up == down.up. But definitely not up.down == id or down.up == id (they are too strong).

Remember the concrete problem:

A user has, let's say, a versioned set of JSON documents
Every time a change is made to the schema, they produce a new schema version
On reading older JSON data formats, they need to upgrade them to later versions
There may be business input into the format of the JSON documents, but even if there isn't, they cannot change old schemas because documents may exist "in the wild" with those schemas

Migration is generally a fit for self-describing formats like Avro, JSON, etc. Avro has some of this built in, at least the basics, but you can't do structural modifications that preserve information.

If you look at these requirements you can see that Iso on schemas is not going to be able to handle them. You can handle an upgrade as type MigrateUp[A, B] = Schema[A] => Schema[B]. This type guarantees upgrades are possible. However some machinery is missing around versioning.

Possibly you can get that with type MigrateUp[E, A, B] = Schema[A] => Either[E, Schema[B]]. Now upgraders can fail so you can try different ones.

But this is not ideal. It means your schema definitions, which might consist of 100+ different structures, has to be replicated on every major version. You have the "old" schemas, when you produce a new version, you have the "new" schemas. This is not really that much savings from copy/pasting your whole data model into a new package, and writing conversions where necessary. Both are troubling from a maintenance perspective.

Now, at least with the Schema approach, you copy / paste the schemas but in theory because there's isolation between the schema and the ADT, you don't need a new ADT. So you modify your ADT but keep the old schemas the same. Maybe you version all the schemas independently so you only have to copy/paste the schemas that change. That starts to look like savings.

Now even better is if you can take a schema and describe changes to that schema, and then dynamically produce a new schema based on the change list. In this case, you can imagine an append-only log of changes to a base schema, inserting version annotations as appropriate; and no schemas would ever have to be copy/pasted. Your "changelog" schema would only have additions to some sort of list-like structure, and it can always be used to materialize the latest version given any older version.

That's in the realm of magic. It's easy from a maintenance perspective. And with proper types, you have guarantees you aren't breaking backward compatibility. Your data model involves as it needs to and you just add entries to the changelog when you change the format of the data.

I've drawn a sketchy diagram to illustrate backward/forward compatibility.

diagram

Diagram explanations

A0 and A1 are two versions of a given ADT.

D0 and D1 are the "wire formats" of respectively A0 and A1 (in practice both are the same type, we need to make them different to get a commutative diagram but we'll just mention both as D from now on).

u, d and u', d' are two pairs of "imaginary" morphisms¹ between these types.

We have two target functors R and W defined by their action on Ai:

R is a decoder: it can be seen as a collection of morphisms ri : D => Ai
W is an encoder, equivalent to a collection of morphisms wi : Ai => D
R and W are inverse of one another, ie. for all i, fi . gi = id[Ai] and gi . fi = id[Di]

In this setting, we name backward compatibility the ability to derive a morphism up : D => A1 and forward compatibility the ability to derive a morphism down: D => A0, such that the whole diagram commutes².

That is, the following identities must hold:

u = up . w0 = r1 . u' . w0
d = down . w1 = r0 . d' . w1

up is an "upgrading reader" able to read data written with an older format (w0), 1. means that writing a value of typeA0 and then reading it with up produces the same result as applying u to that value, and that writing the same A0 value, then transforming the obtained data with u' and finally reading that with r1 also yields the same result.

Similarly, down is a "downgrading reader", able to read data written with a newer format (w1), etc.

A few things are worth noticing in this definition:

backward/forward compatibility is mostly (if not exclusively) a matter of deriving readers**³**.
forward compatibility requires the ability to fetch schemas at run time (A1 didn't exist when the code using A0 was written).
We want to achieve b/f compatibility without any transformation on the wire format (so no Di => Dj morphism is allowed). It would be possible – and even easier, provided that F and G are functors – to implement b/f compatibility using such wire-data transformations, but that would be unacceptably inefficient.

Implementation ideas

The above definition stresses the fact that successive versions of the ADT never coexist in the codebase and that we don't want to modify the wire-data.

But modifying schemas is still allowed.

Maybe b/f compatibility can be achieved by simply modifying the "current" schema in such a way that the F derived from the modified schema is our actual up (resp. down).

Imagine for example that I have the following SA1 schema for the type A1 above:

val sA1 = "age" -*>: prim(ScalaInt) :*: "name" -*>: prim(ScalaString) :*: "active" -*>: prim(ScalaBoolean)

and that I know that is is equal to the result of adding an age field of type Int with default value 42 to the schema of the (old) type A0 (that had only name and active fields).

I can then (automatically) come up with a sA0up schema, representing "upgraded A0" values:

val sA0up = iso("name" -*>: prim(ScalaString) :*: "active" -*>: prim(ScalaBoolean), Iso[(String, Boolean), (Int, (String, Boolean))](p => (42, p))(t => t._2))

The instance of R[A1] derived "as usual" from sA0up will behave exactly as the up we are looking for.

With the exact same information, we can also derive the downgrading version:

val sA0down = iso(sA1, Iso[(Int, (String, Boolean)), (String, Boolean)](t => t._2)(p => (42, p)))

Likewise, the R[A0] derived from sA0down will behave exactly like the down we're looking for.

Conclusion

I think that coming up with a solution for this issue is rather easy after all. It is "just" a matter of defining an Transformation ADT and two "morphisms" up and down of "type" (Schema, Transformation) => Schema.

But I also think that the provided solution will be quite hard to verify. By that I mean that it would be:

difficult to test: despite this looooong comment, the properties of up and down are still quite vaguely defined, and they are so in terms of their relationship with many other morphisms (R, W and the imagniary u, u', d and d').
almost impossible to check for completeness: the above definition doesn't provide any mean to make sure that a proposed Transformation ADT is complete (ie. that we covered all the possible transformations from which we can derive up and down).

¹ : They are "imaginary" in the sense that they will never be implemented in production code. u and d don't make sense because A0 and A1 should never be present simultaneously in any state of the code base ; u' and d' would make sense (because D0 and D1 are actually the same type in practice), but wouldn't be efficient (we want to achieve backward/forward compatibility without transforming the wire data).

² : There are actually two diagrams overlayed on one another here. One could define backward and forward compatibility independently and draw two commuting diagrams, one with only up, u and u' and one with only down, d and d' respectively.

³ : Although it would be possible to define upgrading or downgrading writers (eg some A0 => D1 or A1 => D0), that wouldn't make much sense in practice. In general, you can know "who" has written a given piece of wire-data, but you cannot know in advance "who" will read the data you write. So upgrading/downgrading writers would only make sense in the case of peer-to-peer communication when the node emitting data knows the schema used by the receiving node, but that case can also be handled with upgrading/downgrading readers.

Random thoughts:

Agreed, I don't think we need to worry (for now) about writers. The most common use case I would guess is to be able to read data from previous version of the schema. (A full general solution would be very interesting tho).
Down is useful if it can be automatically derived, meaning I don't need to change running code to be able to understand future schemas.
Restricting the possible changes to the schema simplifies the problem: e.g. adding and removing fields. IIUC, schemas are trees, and changes that end up rearranging subtrees are more difficult to calculate (I still need to look at that generic tree diff algo)
Not sure how the "user interface"/API will look like... are we going to describe the new schema as original schema + changes or just define a brand new one and we'll derive the difference?

Not sure how the "user interface"/API will look like... are we going to describe the new schema as original schema + changes or just define a brand new one and we'll derive the difference?

I think this would deserve a whole discussion/issue on its own.

My first intuition would be to aim for something like :

// In a JVM running the (new) version where A1 is defined but not A0
val a1 : Schema[A1] = ???
val transfo: Transformation = ???
val upgradingA0: Schema[A1] = Schema.upgradingVia(transfo).to(a1)
val readA0asA1 = upgradingA0.to[Reads]

// in a JVM running the (old) version where A0 is defined but not A1
val a0: Schema[A0] = ???
val transfo: Transformation = ??? // the same as above, but here it must be obtained at run time
val downgradingA1: Schema[A0] = Schema.downgradingVia(transfo).to(a0)
val readAsA0 = downgradingA1.to[Reads]

Note that in each case, the version of A that isn't known locally (as well as its schema) is never mentioned.

Hi, your talk at scalar was very interesting. Just some random thoughts I had, to be honest I don't fully understand your current solution, so just ignore anything irrelevant.

I think forward compatibility would be very interesting in time especially with streaming applications. With only backwards compatibility, all consumers have to be updated before the producer is updated. This is quite painful. As mentioned above forward compatibility needs some way to get schemas at runtime. The two approaches I'm aware of are embedding them in the record (protobuf) or fetching them from a versioned repository (confluent schema registry for avro).

For doing the actual migration from the writers schema to the readers schema ( so up.down.??? ) we would need some way to retrieve the writers schema from serialized data without being able to fully deserialize it. This probably has to be something format specific like a version field in JSON or a magic byte in binary formats. I think this would tie in nicely with the migrations step approach mentioned above, where every migration rule would map to a new version. So given a history of (version, migration, data schema):

(1, _, {"f0": {"type": "Int", "default": 1}}) (2, rename("f0", "f1"), {"f1": {"type": "Int", "default": 1}}) (3, removeDefault("f1"), {"f1": {"type": "Int"}})

We should be able to deserialize something like {"version": 0} in an application running version 3 as {"version": 3, "f1": 1} with the compiler guaranteeing correctness for it. This is an example avro struggles with https://github.com/confluentinc/schema-registry/issues/209

spartanz / schemaz