Syntax for adding annotations and metadata to fields of structs

aiverson commented 5 years ago

It is possible to add arbitrary key/value pairs as extra information onto structs and fields of structs using the Lua API. Adding data to a struct with the Lua API is very straightforward, as is adding it to an entry in a struct which was created with the Lua API. However, adding data to an entry of a struct created using the syntactic sugar is inconvenient. The obvious tool to do this is to have a bit of syntax for adding arbitrary key/values to structs in the syntax for struct declarations.

Having such metadata is useful for a variety of applications, for example in a serialization library to indicate whether an int8 should be serialized as a number or a single character string.

I have been unable to locate such a syntax in the documentation or tests and would like to write the extension if it doesn't exist yet.

Do any more active developers have suggestions or preferences for this syntax before I start implementing it, or would it not fit well as a part of the core of Terra?

aiverson commented 5 years ago

Possible syntax:

struct example {
    fieldA: int64 @JSONserialized("string")
    fieldB: int8 @{jsonbehavior = "char", csvbehavior = "char"}
    fieldC: int @[boundschecked(-10, 10), lazy(`fieldA / fieldB / fieldB)]
}

A type declaration followed by an @ by analogy to python's decorators and java's annotations starts a metatagging expression. A meta expressions may start with a name, in which case the expression evaluates to tagging data, a open curly brace, in which case the table literal is used as tagging data that is merged into the field data as part of the metatype. A square bracket starts a comma separated list of tagging expressions which are merged or composed into the final tagging.

This seems like a workable syntax.

elliottslaughter commented 5 years ago

Just as a counterpoint, what about a syntax like:

struct example [whatever] {
  fieldA [something] : int64,
  fieldB [another] : int8,
  fieldC [thing] : int,
}

The main advantage is that [] is already the escape operator, so it's hopefully relatively obvious that you can use arbitrary Lua syntax here. Because it's on the side of the name (vs the type) it should parse unambiguously. It doesn't involve introducing any novel syntax, only a new position where escapes can be introduced. (You want a list? Just write fieldName [{a, b, c}] like you do in a normal escape.)

Personally, I don't have a super strong need for this. I'm sure @aiverson is aware, but just for anyone else reading this, you can already do:

struct example { ... }
example.annotation = whatever
example:getentries()[1].annotation = something

Or:

local example = terralib.types.newstruct("example")
example.annotation = whatever
example.entries = {
  { name = "fieldA", type = int64, annotation = something },
  ...
}

Because structs are just Lua objects there is already quite a bit of flexibility in what you can do with them; it's mainly a question of what you want the syntax to look like.

aiverson commented 5 years ago

The main problem I have with the first API example you give is that it depends on the order of the fields. This violates DRY principle by having the ordering of the fields be specified in multiple places that aren't guaranteed to always be updated simultaneously and may cause hard to diagnose bugs. If there were a API call to get an entry by name, that risk would be lessened, but it still wouldn't be a safe as syntax support for it that fuses the declaration and annotation. The second API example is nice from a DRY perspective.

The example of using square brackets for the annotations is nice and clean, but I see two conceptual types of annotations: tables which get merged with the entry to just add simple KV pairs to it, and functions which get called with the entry as an argument and are capable of more complex logic to inspect or alter the entry, for example to use the name of the field as a key in a hashmap or produce a member _field and make .field have getter/setter behaviors backed by _field. This supports both the usecases of annotations for field-associated data in metaprogramming and the equivalent of Kotlin's by keyword for delegated properties as userspace libraries. So while the square brackets syntax is much nicer than the @ based syntax I mentioned, it doesn't capture all of that conceptspace. The simple solution is to require that all annotations are either single functions that take the entry as an argument or a table of such functions, then just add a function to terralib that takes a table and produces such a function. This would appear as follows, and I think it is a much better syntax than I initially proposed.

struct example {
    fieldA [JSONserialized("string")]: int64
    fieldB [terralib.annotate{jsonbehavior = "char", csvbehavior = "char"}]: int8
    fieldC [{boundschecked(-10, 10), lazy(`fieldA / fieldB / fieldB)}]: int
}

elliottslaughter commented 5 years ago

Your comment about functions being called made me start thinking that it may be more useful for this behavior to apply to the whole struct, because the function might want to add or remove fields, not just edit existing fields. E.g. mimicking Rust's serde interface:

struct example [serialized] {
  fieldA [serialize_as("field-A")] : int64,
}

Is essentially equivalent to:

struct example { ... }
example = serialized(example)

Maybe this adds a version number to the struct, or something.

I'm a bit worried about the behavior of functions being non-obvious, because it's not necessarily obvious at the call site. terralib.annotation would help somewhat but marks it only at the definition of the callee. But on the other hand, I also want to minimize the amount of syntax we add because that has a burden too even if it's unambiguous in the grammar.

It's probably worth thinking about this more before committing to a specific syntax.

aiverson commented 5 years ago

What I was thinking about for functions being called was that one argument is the field data itself and the other is the entire structure. That way the function could not just modify the field definition but add additional ones and even interrupt and cancel the addition of the current field.

Isn't there already a syntax for

struct example (serialized) { ... }

I've seen that in some old example code and it worked the last time I checked.

capr commented 5 years ago

@aiverson I saw that too in tests/class*.t

aiverson commented 5 years ago

Now that I have finally gotten back around to this, I've read through all the relevant source code and know the changes that need to be made. I think that it would be worth revisiting this discussion. I think the simplest and cleanest syntax for this addition that is consistent with the existing syntax and semantics is to just make both the field annotations and the struct annotations be parenthesized comma separated lists of lua expressions that evaluate to functions. This would look like

struct example (serializeable, deserializable, Object) {
    fieldA (JSONserialized("string")): int64
    fieldB (terralib.annotate{jsonbehavior = "char", csvbehavior = "char"}): int8
    fieldC (boundschecked(-10, 10), lazy(`fieldA / fieldB / fieldB)): int
}

The main alternative I see is to make those square brackets instead of parentheses which makes them look like a splice operator, but they don't quite behave quite like any other splice operator. I'm writing the parenthesis version for now, but I invite any other thoughts and opinions. It is easy to change that bit of the syntax right up until anyone writes third party code using it.

OvermindDL1 commented 5 years ago

What about something like [@Annotation(args)] instead? The [@ as a starting value should parse unambiguously I think, in all locations that it might want to be used (there are a lot of places an annotation could go).

elliottslaughter commented 5 years ago

My initial inclination is that I'm more comfortable with (...) than [@...] since that syntax already exists in at least some places, but if we can demonstrate that [@...] is more generally applicable (with concrete examples) maybe that would be fine too.

Probably good to get @zdevito's input on this too at some point.

aiverson commented 5 years ago

Some other places that it might be useful to put an annotation or attribute are on a variable declaration in a block of code, on a function or method, and on individual arguments to a function. Additionally, these attributes could be used to allow specifying llvm variable attributes for things like volatility. It might be somewhat hard to read a heavily attributed function. Indentation should make parameter attributes with the parenthetical syntax work, but a second set of parentheses after the parameter list for an attribute list may not be the most legible syntax.

capr commented 5 years ago

I'm all for adding annotations to the language everywhere (struct members, var decls, methods, etc.) though I am a bit concerned about legibility like you said. Have you considered adding the annotations after the declaration? That way they could be spaced out to the right the way side comments are placed.

aiverson commented 5 years ago

It occurs to me that volatile may be more naturally represented by the type system than annotations.

terra example(a (nonnull, default({0})): array(int), b: int) (inline, constexpr): int
 --body
end

Is an example of the natural extension of the current syntax I was talking about. Can you provide an example of what you were talking about?

If you are suggesting placing the parenthesized annotations to the right after the type, that creates an ambiguity between calling a function in the type expression and annotating a place.

capr commented 5 years ago

Yes, something like:

struct example (serializeable, deserializable, Object) {
    fieldA: int64;  @JSONserialized("string")
    fieldB: int8;   @terralib.annotate{jsonbehavior = "char", csvbehavior = "char"}
    fieldC: int;    @boundschecked(-10, 10), lazy(`fieldA / fieldB / fieldB)
}

Adding a new symbol like @ can also help syntax highlighters tone down the color of the annotations to reduce the noise even more. Python decorators are like that IIRC.

aiverson commented 5 years ago

Python decorators go before the thing they are decorating and require a separate '@' for each annotation with no comma between them, but are otherwise quite similar to that.

I think that the comma in the last one will be troublesome to parse because of an ambiguity.

I don't think indenting attributes way off to the side like comments is the best approach, because they are semantically part of the code.

OvermindDL1 commented 5 years ago

I don't think indenting attributes way off to the side like comments is the best approach, because they are semantically part of the code.

That's how OCaml does attributes. A short history, specifically OCaml has 3 styles of attributes for stating what to bind to. The basic syntax is:

[@name]
[@name arg1]
[@name arg1 arg2 argN...]

The 3 types are:

[@inner-expression]
[@@outer-expression]
[@@@global]

They are used like:

type t =
  | A [@id 1]
  | B [@id 2]
  | C [@id 3]
[@@id_of]

Where each of the id's above reference the specific head they occur 'after', so [@id 1] is part of head A and so forth, and id_of applies to the outer expression, so in this case to the entire t type. You can apply multiple attributes as you wish as well:

(* You would *not* normally write it all on a single line like this, the
  formatter would make it far more readable *)
let a = 42 [@blah] [@blorp 1] [@bleep "egreek"] [@@fwoop] [@@fweep]

So [@blah], [@blorp 1], and [@bleep "egreek"] all apply to the 42 and [@@fwoop] and [@@fweep] apply to the whole let expression.

This:

let a = 42 [@@blah]

Is the same as:

(let a = 42) [@blah]

Global attributes just occur at the top level, not attached to any expression:

[@@@wreep "zoom"]

In OCaml there are only a few attributes that actually do anything, the most popular one is ocaml.doc, like:

[@ocaml.doc "some docs about what I'm attached to"]

And yes, doc comments do convert to it, so this:

let a = 42 (** I'm a doc!  Though you'd usually use a doc syntax for pretty printing *)

Becomes this:

let a = 42 [@ocaml.doc "I'm a doc!  Though you'd usually use a doc syntax for pretty printing"]

Unknown attributes are stored in the AST (attached to the appropriate AST node or as a global AST node) but are otherwise ignored.

The primary use of attributes in OCaml is to pass information to back-end code generators (OCaml can compile to machine code, javascript, webasm, C, and a few dozen other things), or to pass information to PPX's. PP's in OCaml are source-to-source transformers, it transforms strings to strings before passing it to the compiler, this is how languages like ReasonML are implemented on top of OCaml. PPX's are like the same but they transform AST to AST at a variety of levels. A popular PPX is the 'derive' PPX that takes a type t = Vwoop [@@derive Blah] attribute and derives and generates module and/or function definition for the plugin module Blah over the type t to do things like automatic JSON (de)serialization or whatever.

In addition, OCaml has some special attribute forms for ease of use, so something like:

let a = 42 [@@blah]

Can instead be written like:

let%blah a = 42

And there are a few other short forms that use a %name type on something like let/if/etc... for ease of common use.

aiverson commented 5 years ago

That seems really useful. We could implement a lot of useful things like that. Thank you for the information.

Docstrings would be really nice to have as a part of the language available to introspection.

slaughterj commented 5 years ago

Hello all. I have been following this discussion and would like to second the suggestion to add llvm attributes to terra variables, etc. I think this would make terra a very attractive language to use as a code generation target, an alternative to straight llvm ir or the C++ llvm api. I had been considering it for this very purpose for a language I had been working on. I'd like to say thank you to all who have been working on the project. Keep up the good work!

aiverson commented 5 years ago

Annotations could potentially also be used to allow language extensions to easily attach debug info to fragments of code as they are being generated, for example to show file and line offsets of the source code of the child language, not just the terra code.

aiverson commented 5 years ago

Although perhaps line and character offsets and should be a method on a quote to attach such position info to it as the code is being generated in a language extension?

elliottslaughter commented 5 years ago

It's not documented (I don't think) but there is an API to generate debug info via terralib.debuginfo, which we use e.g. in Regent:

https://github.com/StanfordLegion/legion/blob/master/language/src/regent/codegen.t#L2100

The advantage of this is that you can just dump it anywhere, so if you have a 10 line quote where each line came from a different source location in the extension language, each line in the quote can get its own debug info.

aiverson commented 5 years ago

Doc comments or documentation as part of annotations that are available to introspection at runtime seems generally useful to lua and not just terra. Of course, that kind of requires making lua code AST-inspectable and manipulable. Maybe it would be nice support splices in lua code and quotes of lua code as well? Create the ability to do convenient generation of Lua code as well?

slaughterj commented 5 years ago

I always thought it would be nice to have macros in lua. Of course this along with optional static structural/nominal typing it seems like there would be an almost unification of terra and lua, where lua stages and metaprograms itself, No?

On Fri, Mar 22, 2019 at 3:42 AM Open Skies notifications@github.com wrote:

Doc comments or documentation as part of annotations that are available to introspection at runtime seems generally useful to lua and not just terra. Of course, that kind of requires making lua code AST-inspectable and manipulable. Maybe it would be nice support splices in lua code and quotes of lua code as well? Create the ability to do convenient generation of Lua code as well?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/zdevito/terra/issues/324#issuecomment-475523000, or mute the thread https://github.com/notifications/unsubscribe-auth/AEXRvMu37eij93OgzTGLou6_2trHEjD2ks5vZIl-gaJpZM4Y4Ozs .

aiverson commented 5 years ago

There wouldn't be unification just yet, or not just from that change. Terra and Lua would have different semantics in quite a few cases still, but having metaprogramming in lua code on parity with terra code so that metaprogramming can be used to create metaprogramming code that creates native code can be written would be a compelling feature. Lisp's ability to make macros that make macros is a large part of its power and flexibility. Optional static typing, possibly using node annotations, would be a big win for dev tools and intelligent autocomplete.

capr commented 5 years ago

I don't think indenting attributes way off to the side like comments is the best approach, because they are semantically part of the code.

They are part of the code, true, but some of them can be optimizations that you might want to turn on or off, or serialization flags etc. again which you might want to comment out or not to try things out, and it's easier to comment them out if they're on the side.

It is more of an ergonomic issue than anything else, since you can always build structs in Lua if you need more control (Lua is a good configuration language after all and struct definitions are basically configuration).

aiverson commented 5 years ago

Good point. I'm convinced. Inline annotations should be after the things they modify. I think block annotations should allow being before the thing they modify on a separate line, so that they can conveniently be individually commented out more easily.

aiverson commented 5 years ago

What about using the ! character for annotations that modify the following blocklike thing and [: ...] for expression and subsequent annotations, since no valid lua expression can start with : this should be unambiguous. Or maybe preceding annotations can be [! ...]?

elliottslaughter commented 5 years ago

Just FYI, labels for gotos start with :: http://terralang.org/getting-started.html#gotos

I'm not aware of any use of ! but we should probably double check with the parser.

aiverson commented 5 years ago

Maybe +! and -! could be used for something? Is there any conceivable use for annotating an application of an annotation rather than just composing annotations or annotating a definition of an annotation? How would we distinguish annotating a function being assigned to a variable and the assignment expression being assigned to the variable?

capr commented 5 years ago

What I'm currently thinking:

struct def annotations, with struct S (f, g, h) { ... } and/or struct S !f !g !h { ... }
struct field annotations, with struct S { x: int !(f, g, h) } and/or struct S { x: int !f !g !h }
function def and decl annotations, with terra f():int !f !g !h ... end and/or terra f():int !(f, g, h) ... end

Sometimes I also need to treat a sub-section of a struct as a whole. For that use case I generally turn that sub-section into a separate struct and put that back into the original struct as a field, and then make that field unnamed using metamethods (a good use case for struct S { _dummy: S1 !unnamed; } perhaps although it looks like a hack to me -- I'd rather have built-in unnamed fields for this).

aiverson commented 5 years ago

By parallelism, I'd suggest we should support adding annotations to individual parameters in a function and variable declarations. Then we could add things like the built in llvm attributes to them as well as custom data with a single consistent syntax.

local terra foo(bar: int !f, baz: int !g): int !h
  var quux: int = bar + baz !i
  [body]
end

I'm not sure exactly what those things are useful for. The only things I can think of offhand are things like marking a variable volatile or final, but I think those cases might be better served with having a val keyword for an immutable variable as well as having types for immutable and volatile versions of other types.

It seems like these capabilities would be useful for the same reason as the other annotation sites, but I don't see great examples off the top of my head.

capr commented 5 years ago

Posting here a few use cases from my projects in order to constrain the design space a little:

per-struct annotations, i.e. struct (f, g, h) { ... } needs least justifying since use cases are many (reordering the struct for better packing, adding a mechanism for getters and setters, packing bool fields into bitmasks, etc. etc).
per-field annotations:
- make struct field unnamed, i.e. forward self.x to self.sub.x
- specify decoder/encoder/validator (eg. for utf16-to-utf8 conversion, fixed-decimal-to-float conversion)
- specify that the field is "owned" by the struct so that a call to field's free() is auto-appended to struct's free().
- specify an initial value (which generates code in struct's init())
- specify a "size field" -- eg. winapi has a usually-named cbSize field that needs to be set when initializing the struct.

per-function annotations:
- mark as inlined
- mark as overload (although bring back auto-overloads please!)
- mark as virtual, final, etc. method for an OOP system (although the OOP possibilities are limited in terra due to eager type-checking, see #345)

PS1: some annotations need to be parametrized like !init(123), so init(123) would have to return a function to be called as f(T).

PS2: Some more use-cases for struct and field annotations can be found at https://github.com/luapower/winapi/tree/master/winapi, just grep for struct{.

aiverson commented 5 years ago

I agree with these cases. Implementation is WIP. Perhaps the object orientation could be improved by allowing terra methods to be put inside structs alongside entries, or maybe the terra simultaneous declaration system needs more smarts so that it doesn't try to simultaneously define the method with the type it is on before the type completes. Actually, @capr, you had had some trouble with attaching methods to structs; could you try slipping an empty do end block between the struct and the methods in the old code that was giving you errors and see if that fixes it?

Other languages have trained me to put the annotations before the things they annotate, but I think this way is easier to parse.

capr commented 5 years ago

doesn't try to simultaneously define the method with the type it is on before the type completes

So basically lazy type-checking for any function that has uncompleted types in its prototype. Then why not extend this to any function that has uncompleted types in it's code too? That's basically lazy typechecking. Personally that's what I would like, so that I can arrange the code in my modules freely. Currently I can't put struct defs in the modules where they topically belong, instead I have to separate all types into an uber-types module -- either that, or forward-declare everything like C headers (we're back to C headers??)

aiverson commented 5 years ago

Why can't you put struct defs in the modules they logically belong in? I've never run into that problem, but maybe we have just been writing different code and I haven't stumbled across a problem that would make that happen. Can you give an example of what you would like to do that doesn't work?

aiverson commented 5 years ago

Not lazy typechecking, but loop resolution. Just like how structs that are defined together are allowed to reference each other (IIRC) extending it so that functions and structs that are defined together are allowed to be co-referential and mutually recursive, but are still eagerly typechecked together as they arw being defined.

capr commented 5 years ago

There's an example https://github.com/luapower/trlib. Different sub-modules need to access the same structs. There's also dependencies between methods from different sub-modules. With eager typechecking, I have to carefully arrange both struct and method definitions in dependency order. Why is this a job for a human to do? That's not how I'm thinking about the code, so this arrangement is not friendly to me, it's only friendly to the compiler.

capr commented 5 years ago

In C people write headers so at least they don't have to arrange functions in dep-tree order, but they still have to define the structs in the headers instead of near the code that uses them. Like, I want to have my free() and init() methods directly underneath the struct definition so I can make sure I'm not missing anything.

ErikMcClure commented 5 years ago

Using only ! for annotations seems like it could be confusing, especially for very large annotations:

struct example (serializeable, deserializable, Object) {
    fieldA: int64;  !JSONserialized("string")
    fieldB: int8;   !terralib.annotate{jsonbehavior = "char", csvbehavior = "char"}
    fieldC: int;    !boundschecked(-10, 10), lazy(`fieldA / fieldB / fieldB)
}

For annotations that are not single statements, maybe require wrapping it in {}:

struct example (serializeable, deserializable, Object) {
    fieldA: int64;  !JSONserialized("string")
    fieldB: int8;   !terralib.annotate{jsonbehavior = "char", csvbehavior = "char"}
    fieldC: int;    !{boundschecked(-10, 10), lazy(`fieldA / fieldB / fieldB)}
}

This prevents the annotation syntax from "spilling out" by giving it a clearly defined endpoint - either as a single statement, or as a {} block.

As for lazy typechecking, I believe that moving to eager typechecking was a mistake. Terra should move back to lazy typechecking like most other languages and instead focus on improving deep error messages. Forcing programmers to order types in specific ways is archaic and absurd.

terralang / terra

Syntax for adding annotations and metadata to fields of structs #324