Apache Avro schema tools

Apache Avro schema tools for Tarantool, implemented from scratch in Lua.

Notable features:

Avro defaults;
Avro aliases;
data transformations are fast due to runtime code generation;
extensions such as built-in nullable types.

avro_schema = require('avro_schema')

Installation
Creating a schema
Validating and normalizing data with a schema
Checking if schemas are compatible
Checking if an object is a schema object
Querying a schema's field names or field types
Compiling schemas
- Compile options
Generated routines
References
- Related discussions
Nullability (extension)
Default values

Installation

To install the module use

tarantoolctl rocks install avro-schema

Creating a schema

ok, schema = avro_schema.create {
    type = "record",
    name = "Frob",
    fields = {
      { name = "foo", type = "int", default = 42 },
      { name = "bar", type = "string" }
    }
  }

Creates a schema object (ok == true). If there was a syntax error, returns false and the error message.

Validating and normalizing data with a schema

ok, normalized_data_copy = avro_schema.validate(schema, { bar = "Hello, world!" })

Returns true if the data was valid. Otherwise, returns false and the error message.

The avro_schema.validate() function creates a normalized copy of the data. Normalization implies filling in default values for missing fields. For example, because the "foo" field has a default value = 42, the result from the above example will be { foo = 42, bar = "Hello, world!" }.

Checking if schemas are compatible

To facilitate data evolution Avro defines certain schema mapping rules. If schemas A and B are compatible, then one can convert data from A to B.

ok = avro_schema.are_compatible(schema1, schema2)
ok = avro_schema.are_compatible(schema2, schema1, "downgrade")

Allowed modifications include:

renaming types and record fields (provided that aliases are correctly set);
extending records with new fields (these fields are initialized with default values, which are mandatory);
removing fields (contents are simply removed during conversion);
modifying unions and enums (provided that type definitions retain some similarity);
type promotions are allowed (e.g. int is compatible with long but not vice versa).

Let's assume:

B is newer than A.
A defines Apple (a record type).
B renames it to Banana.

Upgrading data from A to B works, since Banana is marked as an alias of Apple. However, downgrading data from B to A does not work, since in A the record type Apple has no aliases.

To make it work we implement downgrade mode. In downgrade mode, name mapping rules take into account the aliases in the source schema, and ignore the aliases in the target schema.

Checking if an object is a schema object

avro_schema.is(object)

Querying a schema's field names or field types

avro_schema.get_names(schema [, service-fields])

avro_schema.get_types(schema [, service-fields])

The first argument must be a schema object, such as the one created in the Creating a schema example above.

The optional second argument is a table with names of types, such as {'string', 'int'}.

The result will be a Lua table of field names (for the get_names method) or a Lua table of field types (for the get_types method).

The order will match the field order in the flat representation.

Compiling schemas

Compiling a schema creates optimized data conversion routines (runtime code generation).

ok, methods = avro_schema.compile(schema)
ok, methods = avro_schema.compile({schema1, schema2})

If two schemas are provided, then the generated routines consume data in schema1 and produce results in schema2.

What if the schema1 source and the schema2 destination are not adjacent revisions, i.e. there were some revisions in between? While going from source to destination directly is fast, sometimes it alters the results. Performing conversion step by step, using all the in-between revisions, always yields correct results but it is slow.

There is a third option: let compile generate routines that are fast yet produce the correct results.

Compile options

A few options affecting compilation are recognized.

Enabling downgrade mode (see avro_schema.are_compatible for details):

ok, methods = avro_schema.compile({schema1, schema2, downgrade = true})

Dumping generated code for inspection:

ok, methods = avro_schema.compile({schema1, schema2, dump_src = "output.lua"})

Troubleshooting code generation issues:

ok, methods = avro_schema.compile({schema1, schema2, debug = true, dump_il = "output.il"})

Add service fields (which are part of a tuple, but are not part of an object):

ok, methods = avro_schema.compile({schema, service_fields = {'string', 'int'}})

Generated routines

Compile produces the following routines (returned in a Lua table):

flatten
unflatten
xflatten
flatten_msgpack
unflatten_msgpack
xflatten_msgpack
get_types
get_names

Here is an example which uses the avro schema that we described in the section Creating a schema, a Tarantool database space, and the methods that compile produces. This is a script that you can paste into a client of a Tarantool server; the comments explain what the results look like and what they mean.

-- Create a Tarantool database, an index, and a tuple
box.schema.space.create('T')
box.space.T:create_index('I')
box.space.T:insert{1, 'string-value'}
-- Let tuple_1 = a tuple from the database space
tuple_1 = box.space.T:get(1)
-- Load the module
avro_schema = require('avro_schema')
-- Load avro_schema and create a schema as described earlier
ok, schema = avro_schema.create {
    type = "record",
    name = "Frob",
    fields = {
      { name = "foo", type = "int", default = 42 },
      { name = "bar", type = "string" }
    }
  }
-- Compile, so that "methods" will have the generated routines
ok, methods = avro_schema.compile(schema)
-- Invoke unflatten(). The result will look like this:
-- - {'foo': 1, 'bar': 'string-value'}
-- That is: unflattening can turn tuples into avro-schema objects.
ok, result = methods.unflatten(tuple_1)
result
-- Make a new Lua table with an integer and a string component
-- table_1 = {42, 'string-value-2'}
-- Invoke flatten(). The result can be inserted into the database.
-- The value of the newly inserted tuple will look like this:
-- - [1, 'string-value']
-- That is, flattening can turn avro-schema objects into tuples.
ok, tuple_2 = methods.flatten(result)
box.space.T:truncate()
box.space.T:insert(tuple_2)
-- Make an avro_schema object with {foo=2, bar='Hello, World!'}
ok, normalized_data_copy = avro_schema.validate(schema, { bar = "Hello, world!" })
-- Invoke xflatten(). The result will look like this:
-- - [['=', 1, 42], ['=', 2, 'Hello, world!']]
ok, result = methods.xflatten(normalized_data_copy)
result
-- That is, the format of an xflatten() result is exactly
-- what a Tarantool "update" request looks like.
-- Therefore let's put it in an update request ...
box.space.T:update({42},result)
-- And the result looks like:
-- -- - [1, 'Hello, world!']

So: with flatten() for inserting, xflatten() for updating, unflatten() for getting, we have ways to use avro_schema objects as tuples in Tarantool databases.

With the other three methods that work with transformations of avro_schema objects -- flatten_msgpack() and xflatten_msgpack() and unflatten_msgpack() -- we have similar functionality, except that the transformations are to and from MsgPack objects. (The ..._msgpack() methods are usually faster because they do not need to encode or decode internally.)

The final two methods -- get_types() and get_names() -- have almost the same effect as get_types() and get_names() described in the earlier section Querying a schema's field names or field types. (The main difference is that the optional "service_fields" argument is unnecessary if methods is the result of a compile done with the service_fields = option.) For example:

tarantool> methods.get_names()
---
- - foo
  - bar
...
tarantool> methods.get_types()
---
- - int
  - string
...

References

Named types are ones that have mandatory name fields in their definitions: record, fixed, enum.

Named types can be referenced after the first definition (in depth-first, left-to-right traversal).

Example:

{
    name = 'user',
    type = 'record',
    fields = {
        {name = 'uid', type = 'long'},
        {
            name = 'nested',
            type = {
                type = 'record',
                name = 'nested_record',
                fields = {
                    {name = 'x', type = 'long'},
                    {name = 'y', type = 'long'}
                }
            }
        },
        {
            name = 'another_nested',
            type = 'nested_record'
        }
    }
}

Notes:

A reference is a usage of a type (not a value), so the effect is as if you define the same type with an a different name.
A field of a record also has a name, but it is not a type, so you cannot reference a field by its name.
A record can be referenced from within itself only as part of a union or an array.
An array and a map are unnamed and cannot be referenced by a name, consider related discussions below.

Related discussions

Nullability (extension)

The problem: in database management systems NULL is a value, not a type. So it should be possible, for example, to have a "long integer" type that can contain both NULL and integers.

One can try to handle this with a union such as {'null', 'long'} which can have both null and {long = 42}. What really is necessary, though, is that a single field, whose name determines the type, can contain both null and 42 as valid values (see the JSON Encoding section of the avro-schema standard). This problem -- expressing a single type that accepts both null and 42 -- is the problem that the nullability extension solves.

A type can be marked as nullable by adding an asterisk ("*") at the end of the type name:

{
    name = 'user',
    type = 'record',
    fields = {
        {name = 'uid', type = 'long'},
        {name = 'first_name', type = 'string'},
        {name = 'middle_name', type = 'string*'},
        {name = 'last_name', type = 'string'}
    }
}

The following types can be marked as nullable:

All primitive types: null, boolean, int, long, float, double, bytes, string.
All named complex types: record, fixed, enum.
Almost all unnamed complex types: array, map (but not union).

Notes:

A type reference can be non-nullable or nullable (asterisk-marked) independently of the original type definition.
Use standard {'null', ...} without an asterisk to make a union nullable type.
The xflatten method is not designed to work with complex nullable types.

...

Default values

Default values are substituted in two cases:

during flattening if the fields are not presented in the data
during unflattening and schema evolution in case the target schema has extra fields with the default values

Notes:

Only zero-size arrays and maps are supported by now.

Default value may be inherited from an inner field with a default value or overridden. Example:

local schema = {
type = "record", name = "Frob", fields = {
    { name = "foo", default = {f1=1, f2={f2_1=2}}, type =
        { type = "record", name = "default_1", fields = {
            {name = "f1", type = "int"},
            {name = "f2", default = {f2_1=21}, type =
                {type = "record", name = "default_2", fields = {
                    {name = "f2_1", type = "int"}}
                }}
        }}},
    { name = "bar", type = "int"}
}
}
ok, handle = avro_schema.create(schema)
ok, methods = avro_schema.compile(handle)
ok, unflattened = methods.flatten({bar=11})
-- returns {1,2,11}
ok, unflattened = methods.flatten({foo={f1=3},bar=11})
-- returns {3,21,11}

tarantool / avro-schema

readme

Apache Avro schema tools

Table of contents

Installation

Creating a schema

Validating and normalizing data with a schema

Checking if schemas are compatible

Checking if an object is a schema object

Querying a schema's field names or field types

Compiling schemas

Compile options

Generated routines

References

Related discussions

Nullability (extension)

Default values