hirrolot / datatype99

Algebraic data types for C99
MIT License
1.38k stars 23 forks source link
adt algebraic algebraic-data-types c99 derive introspection metalang99 metaprogramming pattern-matching reflection-library sum-types tagged-unions type-system variant

Datatype99

Safe, intuitive [algebraic data types] with exhaustive pattern matching & compile-time introspection facilities. No external tools required, pure C99.

Highlights

Installation

Datatype99 consists of one header file datatype99.h and one dependency Metalang99. To use it in your project, you need to:

  1. Add datatype99 and metalang99/include to your include directories.
  2. Specify -ftrack-macro-expansion=0 (GCC) or -fmacro-backtrace-limit=1 (Clang) to avoid useless macro expansion errors.

If you use CMake, the recommended way is FetchContent:

include(FetchContent)

FetchContent_Declare(
    datatype99
    URL https://github.com/hirrolot/datatype99/archive/refs/tags/v1.2.3.tar.gz # v1.2.3
)

FetchContent_MakeAvailable(datatype99)

target_link_libraries(MyProject datatype99)

# Disable full macro expansion backtraces for Metalang99.
if(CMAKE_C_COMPILER_ID STREQUAL "Clang")
  target_compile_options(MyProject PRIVATE -fmacro-backtrace-limit=1)
elseif(CMAKE_C_COMPILER_ID STREQUAL "GNU")
  target_compile_options(MyProject PRIVATE -ftrack-macro-expansion=0)
endif()

(By default, datatype99/CMakeLists.txt downloads Metalang99 v1.13.2 from the GitHub releases; if you want to override this behaviour, you can do so by invoking FetchContent_Declare earlier.)

Optionally, you can precompile headers in your project that rely on Datatype99. This will decrease compilation time, because the headers will not be compiled each time they are included.

Happy hacking!

Usage

Put simply, Datatype99 is just a syntax sugar over tagged unions; the only difference is that it is more safe and concise. For example, to represent a binary tree, you would normally write something like this:

typedef struct {
    struct BinaryTree *lhs;
    int x;
    struct BinaryTree *rhs;
} BinaryTreeNode;

typedef struct {
    enum { Leaf, Node } tag;
    union {
        int leaf;
        BinaryTreeNode node;
    } data;
} BinaryTree;

To avoid this boilerplate, you can use Datatype99:

datatype(
    BinaryTree,
    (Leaf, int),
    (Node, BinaryTree *, int, BinaryTree *)
);

Say you want to sum all nodes and leafs in your binary tree. Then you may write something like this:

int sum(const BinaryTree *tree) {
    switch (tree->tag) {
    case Leaf:
        return tree->data.leaf;
    case Node:
        return sum(tree->data.node.lhs) + tree->data.node.x + sum(tree->data.node.rhs);
    }

    // Invalid input (no such variant).
    return -1;
}

... but what if you accidentally access tree->data.node after case Leaf:? Your compiler would not warn you, thus resulting in a business logic bug.

With Datatype99, you can rewrite sum as follows, using a technique called pattern matching:

int sum(const BinaryTree *tree) {
    match(*tree) {
        of(Leaf, x) return *x;
        of(Node, lhs, x, rhs) return sum(*lhs) + *x + sum(*rhs);
    }

    // Invalid input (no such variant).
    return -1;
}

of gives you variables called bindings: x, lhs, or rhs. This design has a few neat aspects:

The last thing unmentioned is how you construct variants. Internally, Datatype99 generates inline static functions called value constructors; you can use them as follows:

BinaryTree leaf5 = Leaf(5);
BinaryTree leaf7 = Leaf(7);
BinaryTree node = Node(&leaf5, 123, &leaf7);

Finally, just a few brief notes about pattern matching:

Congratulations, this is all you need to know to write most of the stuff! If you feel fancy, you can also introspect your types at compile-time; see examples/derive/ for the examples.

Syntax and semantics

Having a well-defined semantics of the macros, you can write an FFI which is quite common in C.

EBNF syntax

<datatype>      ::= "datatype(" [ <derive-clause> "," ] <datatype-name> { "," <variant> }+ ")" ;
<record>        ::= "record("   [ <derive-clause> "," ] <record-name>   { "," <field>   }* ")" ;
<datatype-name> ::= <ident> ;
<record-name>   ::= <ident> ;

<variant>       ::= "(" <variant-name> { "," <type> }* ")" ;
<field>         ::= "(" <type> "," <field-name> ")" ;
<variant-name>  ::= <ident> ;
<field-name>    ::= <ident> ;

<derive-clause> ::= "derive(" <deriver-name> { "," <deriver-name> }* ")" ;
<deriver-name>  ::= <ident> ;

<match>         ::= "match(" <lvalue> ") {" { <of> }* [ <otherwise> ] "}" ;
<matches>       ::= "MATCHES(" <expr> "," <ident> ")" ;
<if-let>        ::= "ifLet(" <lvalue> "," <variant-name> "," <ident> { "," <ident> }* ")" <stmt> ;
<of>            ::= "of(" <variant-name> { "," <ident> }* ")" <stmt> ;
<otherwise>     ::= "otherwise" <stmt> ;
Note: shortened vs. postfixed versions Each listed identifier in the above grammar corresponds to a macro name defined by default -- these are called _shortened versions_. On the other hand, there are also _postfixed versions_ (`match99`, `of99`, `derive99`, etc.), which are defined unconditionally. If you want to avoid name clashes caused by shortened versions, define `DATATYPE99_NO_ALIASES` before including `datatype99.h`. Library headers are strongly advised to use the postfixed macros, but without resorting to `DATATYPE99_NO_ALIASES`.

Semantics

(It might be helpful to look at the generated data layout of examples/binary_tree.c.)

datatype

  1. Before everything, the following type definition is generated:
typedef struct <datatype-name> <datatype-name>;
  1. For each non-empty variant, the following type definition is generated (the metavariable <type> ranges over a corresponding variant's types):
typedef struct <datatype-name><variant-name> {
    <type>0 _0;
    ...
    <type>N _N;
} <datatype-name><variant-name>;
  1. For each non-empty variant, the following type definitions to types of each field of <datatype-name><variant-name> are generated:
typedef <type>0 <variant-name>_0;
...
typedef <type>N <variant-name>_N;
  1. For each variant, the following type definition to a corresponding sum type is generated:
typedef struct <datatype-name> <variant-name>SumT;
  1. For each sum type, the following tagged union is generated (inside the union, only fields to structures of non-empty variants are generated):
typedef enum <datatype-name>Tag {
    <variant-name>0Tag, ..., <variant-name>NTag
} <datatype-name>Tag;

typedef union <datatype-name>Variants {
    char dummy;

    <datatype-name><variant-name>0 <variant-name>0;
    ...
    <datatype-name><variant-name>N <variant-name>N;
} <datatype-name>Variants;

struct <datatype-name> {
    <datatype-name>Tag tag;
    <datatype-name>Variants data;
};
Note on char dummy; `char dummy;` is needed to make the union contain at least one item, according to the standard, even if all variants are empty. Such a `datatype` would enforce strict type checking unlike plain C `enum`s.
  1. For each variant, the following function called a value constructor is generated:
inline static <datatype-name> <variant-name>(/* ... */) { /* ... */ }

If the variant has no parameters, this function will take void and initialise .data.dummy to '\0'; otherwise, it will take the corresponding variant parameters and initialise the result value as expected.

  1. Now, when a sum type is fully generated, the derivation process takes place. Each deriver taken from derive(...) is invoked sequentially, from left to right, as
ML99_call(DATATYPE99_DERIVE_##<deriver-name>I, v(<datatype-name>), variants...)

where

Put simply, a deriver is meant to generate something global for a sum type, like interface implementations or almost any other stuff. In terms of Rust, you can think of it as of the derive attribute.

record

record represents a record type: it is simply a struct for which the derivation process is defined.

  1. The following structure is generated:
typedef struct <record-name> {
    // Only if <record-name> has no fields:
    char dummy;

    <type>0 <field-name>0;
    ...
    <type>N <field-name>N;
} <record-name>;
Note on char dummy; `char dummy;` is needed to make the structure contain at least one item, according to the standard. Such `record(Foo)` can be used to implement interfaces for it (see [Interface99]).
  1. Each deriver taken from derive(...) is invoked sequentially, from left to right, as
ML99_call(DATATYPE99_RECORD_DERIVE_##<deriver-name>I, v(<record-name>), fields...)

where

match

match has the expected semantics: it sequentially tries to match the given instance of a sum type against the given variants, and, if a match has succeeded, it executes the corresponding statement and moves down to the next instruction (match(val) { ... } next-instruction;). If all the matches have failed, it executes the statement after otherwise and moves down to the next instruction.

A complete match construct results in a single C statement.

of

of accepts a matched variant name as a first argument and the rest of arguments comprise a comma-separated list of bindings.

There can be more than one _ binding, however, non-_ bindings must be distinct.

To match an empty variant, write of(Bar).

MATCHES

MATCHES just tests an instance of a sum type for a given variant. If the given instance corresponds to the given variant, it expands to truthfulness, otherwise it expands to falsehood.

matches

DEPRECATED: use MATCHES instead.

ifLet

ifLet tries to match the given instance of a sum type against the given variant, and, if a match has succeeded, it executes the corresponding statement.

Think of ifLet(<expr>, <variant-name>, vars...) { /* ... */ } as of an abbreviation of

match(<expr>) {
    of(<variant-name>, vars...) { /* ... */ }
    otherwise {}
}

A complete ifLet construct results in a single C statement.

Unit type

The unit type UnitT99 represents the type of a single value, unit_v99 (it should not be assigned to anything else). These are defined as follows:

typedef char UnitT99;
static const UnitT99 unit_v99 = '\0';

If DATATYPE99_NO_ALIASES remains undefined prior to #include <datatype99.h>, UnitT99 and unit_v99 are also accessible through object-like macros UnitT & unit_v.

Derive helper attributes

You can pass named arguments to a deriver; these are called derive helper attributes. They must be specified as object-like macros of the form:

#define <variant-name>_<namespace>_<attribute-name> attr(/* attribute value */)

where <namespace> is either <datatype-name>/<record-name> or <variant-name>/<field-name> for datatype/record-specific and variant/field-specific attributes, respectively.

To manipulate derive helper attributes, there are a few predefined macros:

(The naming convention here is the same as of Metalang99.)

Miscellaneous

Macro Metalang99-compliant counterpart
datatype DATATYPE99_datatype
record DATATYPE99_record
of DATATYPE99_of
ifLet DATATYPE99_ifLet

(An arity specifier and desugaring macro are provided for each of the above macros.)

Guidelines

Clang-Format issues

If you use Clang-Format, cancel formatting for a datatype definition using // clang-format off & // clang-format on to make it look prettier, as in the examples.

#undef derive helper attributes

Always #undef derive helper attributes after a corresponding datatype definition not to pollute your namespace.

Descriptive names

If the meaning of variant parameters is not clear from the context, give them descriptive names. This can be achieved in several ways:

// 1. Define type aliases to variant parameters.
typedef double XCoordinate;
typedef double YCoordinate;

typedef double Width;
typedef double Height;

datatype(
    Shape,
    (Point, XCoordinate, YCoordinate),
    (Rectangle, Width, Height)
);

// 2. Define separate structures.
typedef struct {
    double x, y;
} Point;

typedef struct {
    double width, height;
} Rectangle;

datatype(
    Shape,
    (MkPoint, Point),
    (MkRectangle, Rectangle)
);

Comparison:

Pitfalls

Top-level break/continue

Do not use break/continue inside a statement provided to of/ifLet but outside of any for/while loops in that statement. For example, this code is fine:

match(x) {
    of(Foo, a, b, c) {
        for (int i = 0; i < 10; i++) {
            continue;
        }
    }
}

But this code is not fine:

for (int i = 0; i < 10; i++) {
    match(x) {
        of(Foo, a, b, c) {
            if (a == 7) { break; }
            continue;
        }
    }
}

To make it valid, you can rewrite it as follows:

for (int i = 0; i < 10; i++) {
    match(x) {
        of(Foo, a, b, c) {
            if (a == 7) { goto my_break; }
            goto my_continue;
        }
    }

    // Datatype99 prohibits top-level `break`/`continue`.
    my_continue:;
}
my_break:;

Array as a variant parameter

To specify an array as a variant parameter, you must put it into a separate struct; see examples/array_in_variant.c.

Mutable bindings

Bindings introduced by of are always mutable, so make sure you do not mutate them if the value passed to match is qualified as const.

Credits

Thanks to Rust and ML for their implementations of sum types.

Publications

Release procedure

  1. Update DATATYPE99_MAJOR, DATATYPE99_MINOR, and DATATYPE99_PATCH in datatype99.h.
  2. Update CHANGELOG.md.
  3. Release the project in GitHub Releases.

FAQ

Q: Why use C instead of Rust/Zig/whatever else?

A: There is a lot of software written in plain C that can benefit from Datatype99; C is #1 programming language as of 2020, according to TIOBE. People use C due to technical and social reasons:

See also:

Overall, if you can afford a more modern/high-level language, I encourage you to do so instead of using old C. However, many people do not have this possibility (or it would be too costly).

Q: Why not third-party code generators?

A: See Metalang99's README >>.

Q: How does it work?

A: In short, datatype expands to a tagged union with value constructors; match expands to a switch statement. To generate all this stuff, Metalang99 is used, a preprocessor metaprogramming library.

More on it in "Compiling Algebraic Data Types in Pure C99".

Q: Does it work on C++?

A: Yes, C++11 and onwards is supported.

Q: What is the difference between Datatype99 and Metalang99?

A: Metalang99 is a functional language for metaprogramming, whereas Datatype99 is an implementation of algebraic data types written in this language.

Q: What about compile-time errors?

A: Some kinds of syntactic errors are detected by the library itself:

Error: Bar(int) instead of (Bar, int)

[playground.c]

datatype(A, (Foo, int), Bar(int));

[/bin/sh]

$ gcc playground.c -Imetalang99/include -Idatatype99 -ftrack-macro-expansion=0
playground.c:3:1: error: static assertion failed: "ML99_assertIsTuple: Bar(int) must be (x1, ..., xN)"
    3 | datatype(A, (Foo, int), Bar(int));
      | ^~~~~~~~

Error: Missing comma

[playground.c]

datatype(A, (Foo, int) (Bar, int));

[/bin/sh]

$ gcc playground.c -Imetalang99/include -Idatatype99 -ftrack-macro-expansion=0
playground.c:3:1: error: static assertion failed: "ML99_assertIsTuple: (Foo, int) (Bar, int) must be (x1, ..., xN), did you miss a comma?"
    3 | datatype(A, (Foo, int) (Bar, int));
      | ^~~~~~~~

Error: Trailing comma is prohibited

[playground.c]

datatype(A, (Foo, int), (Bar, int), /* trailing comma is prohibited */);

[/bin/sh]

$ gcc playground.c -Imetalang99/include -Idatatype99 -ftrack-macro-expansion=0
playground.c:3:1: error: static assertion failed: "ML99_assertIsTuple: must be (x1, ..., xN)"
    3 | datatype(A, (Foo, int), (Bar, int), /* trailing comma is prohibited */);
      | ^~~~~~~~

(For better diagnostics, use the latest Metalang99.)

The others are understandable as well:

Error: unknown type name specified in datatype

[playground.c]

datatype(Foo, (FooA, NonExistingType));

[/bin/sh]

playground.c:3:1: error: unknown type name ‘NonExistingType’
    3 | datatype(
      | ^~~~~~~~
playground.c:3:1: error: unknown type name ‘NonExistingType’
playground.c:3:1: error: unknown type name ‘NonExistingType’

Error: non-exhaustive match

[playground.c]

match(*tree) {
    of(Leaf, x) return *x;
    // of(Node, lhs, x, rhs) return sum(*lhs) + *x + sum(*rhs);
}

[/bin/sh]

playground.c: In function ‘sum’:
playground.c:6:5: warning: enumeration value ‘NodeTag’ not handled in switch [-Wswitch]
    6 |     match(*tree) {
      |     ^~~~~

Error: excess binders in of

[playground.c]

match(*tree) {
    of(Leaf, x, excess) return *x;
    of(Node, lhs, x, rhs) return sum(*lhs) + *x + sum(*rhs);
}

[/bin/sh]

playground.c: In function ‘sum’:
playground.c:15:9: error: unknown type name ‘Leaf_1’; did you mean ‘Leaf_0’?
   15 |         of(Leaf, x, excess) return *x;
      |         ^~
      |         Leaf_0
playground.c:15:9: error: ‘BinaryTreeLeaf’ has no member named ‘_1’; did you mean ‘_0’?
   15 |         of(Leaf, x, excess) return *x;
      |         ^~
      |         _0

Error: improperly typed variant arguments

[playground.c]

BinaryTree tree = Leaf("hello world");

[/bin/sh]

playground.c: In function ‘main’:
playground.c:18:28: warning: passing argument 1 of ‘Leaf’ makes integer from pointer without a cast [-Wint-conversion]
   18 |     BinaryTree tree = Leaf("hello world");
      |                            ^~~~~~~~~~~~~
      |                            |
      |                            char *
playground.c:6:1: note: expected ‘int’ but argument is of type ‘char *’
    6 | datatype(
      | ^~~~~~~~

Error: an undereferenced binder

[playground.c]

int sum(const BinaryTree *tree) {
    match(*tree) {
        of(Leaf, x) return x; // x is int *
        of(Node, lhs, x, rhs) return sum(*lhs) + *x + sum(*rhs);
    }
}

[/bin/sh]

playground.c: In function ‘sum’:
playground.c:17:28: warning: returning ‘Leaf_0 *’ {aka ‘int *’} from a function with return type ‘int’ makes integer from pointer without a cast [-Wint-conversion]
   17 |         of(Leaf, x) return x; // x is int *
      |                            ^

From my experience, nearly 95% of errors make sense.

If an error is not comprehensible at all, try to look at generated code (-E). Hopefully, the code generation semantics is formally defined so normally you will not see something unexpected.

Q: What about IDE support?

A: VS Code automatically enables suggestions of generated types but, of course, it does not support macro syntax highlighting.

Q: Which compilers are tested?

A: Datatype99 is known to work on these compilers:

Troubleshooting

warning: control reaches end of non-void function [-Wreturn-type]

This warning happens when you try to return control from within a match statement, and your compiler thinks that not all hypothetical variants are handled. For example:

datatype(MyType, (Foo), (Bar));

int handle(MyType val) {
    match(val) {
        of(Foo) return 5;
        of(Bar) return 7;
    }
}

The above code may seem perfect at first glance, but in fact, it is not. The reason is this: match(val) boils down to switch(val.tag) under the hood, with val.tag being an ordinary C enumeration consisting of the variants Foo and Bar. But what if a caller provides us with neither Foo nor Bar, but with something like 42 (not a valid variant)? Since enum is merely another way to give integers names, a compiler would not complain on the caller site. However, on the callee site, we would have the warning:

test.c: In function ‘handle’:
test.c:10:1: warning: control reaches end of non-void function [-Wreturn-type]
   10 | }
      | ^

The solution is to either panic or return some error-signaling code, like this:

int handle(MyType val) {
    match(val) {
        of(Foo) return 5;
        of(Bar) return 7;
    }

    // Invalid input (no such variant).
    return -1;
}

See issue #9.