Specify the type grammar more tightly

Jamesernator commented 2 years ago

This a followup to my comment as was requested.

Just to summarize that discussion, basically if we were to expose types to runtime as some form of metadata having only strings would be a bit painful as essentially one has to ship a tokenizer and parser along with any such usage.

Now while we do want a good amount of flexibility for type systems to be able to specify syntax, because of parsing these constructs will have to work in order to support the feature at all, we should already have a significant amount of the neccessary AST which we could expose to runtime.

Note that even if we don't choose to expose this information directly to runtime, having a canonical grammar for types would be significantly helpful for tooling and things that need to be able to perform tasks on the actual AST (e.g. Babel, code editors, etc).

So basically the rest of this is just giving a rough overview of what grammar flexibility is neccessary, what is impossible, and what would be a good set to expose as a canonical AST format.

Unneogiatable points

These aspects of syntax cannot be negotiated because they fundamentally contradict the existing JS grammar. In particular we require the following:

The = symbol must be well-contained to distinguish it from the beginning of a variable assignment, this will most likely be accomplished by requiring a "well-formed bracketing" that also applies to generics
- For example in TypeScript we can have "default types" for generics similar to parameters, these use a =, carrying this syntax over would be preferable
Generic delimitation must be explictly defined as >/< are already perfectly valid operators in JS, this means grammar conflicts with foo < Bar > (value) would exist if this was changed
- Alternative syntax suggestions exists in other issues
Tokens must be fully specified, in particular "stringTokens" especially as they may contain characters that would otherwise cause ambiguity
- e.g. With the = earlier, if = were not well-bracketed it could indicate a variable assignment, the same rules need to apply to string tokens and other such similar tokens (e.g. template string tokens it that is supported)
Top level constructs require explicit support by the language to be implemented, this includes things like interface, import type, type Name = and so on.
- Note that such constructs still need a well defined ending point, like they can't just be an arbitrary sequence of tokens

Prior art

NOTE: Popular type systems here refers to "TypeScript, Flow and Hegel"

Now a lot of syntax in existing type solutions is based on both existing languages and literature. A non-comprehensive list of features that are supported by all popular type systems:

All of the popular type systems agree on simple token types, all of them support parsing numbers and strings as a kind've token type|
- Flow and Hegel do not support bigint types, however they still seem to parse them
All of the popular type systems agree that curly brackets {} denote some form of object type, the actual contents of these object types are quite flexible however
All of the popular type systems use <> to denote generics
All of the popular type systems support type Name = and import type
All of the popular type systems are "well-bracketed", i.e. { / < / ( / [ are always paired with a corresponding closing token whenever present
All of the popular type systems support some binary token operators such as |
All of the popular type systems support function types syntax using (ARGS_TYPES) => RETURN_TYPE
TypeScript and Flow both support various keywords, some binary operators, some unary operators, and some for other purposes
All of the popular type systems agree that , delimits a list of items (some also allow ; for delimiting items in certain situations)

Basis for the following proposal

With the above information, I would like to propose a specific grammar for types that captures a large amount of the above while also giving a large amount of flexibility to type systems to extend with new constructs.

In the design of this, I looked at where type systems tend to experiment most with syntax, and the primary area they develop new syntax is in unary, binary, and special type operators.

Take for example TypeScript's mapped types as a concrete example:

type AllStrings<O extends object={}> = {
    [Key in keyof O as Exclude<Key, "foo">]: O[Key]
}

In this example many of the above points mentioned still hold:

Token types are well delimited (e.g. "foo", O, Key, as)
{ ... } still denotes an object type
<...> still denotes generics
[ ] are still well matched

However where the real flexibility comes in is in the bunch of unary/binary/special operators that are present:

For example extends is a special binary "operator"-like thing on generic parameters
keyof is a unary operator on a value
as is a special joiner
= is used as an operator within a generic
O[key] is used as a special syntax

Now the following grammar proposal tries to capture these points by in particular allowing sequences of tokens to be fairly unrestrictive, but constraining "well-matched" constructs.

Proposed grammar

I would like to propose something similar to the following grammar that would produce canonical ASTs for type constructs. This captures the well-formedness of many bracketed and generic structures, but keeps large freedom in the simple idea of "token sequences", which is essentially just a sequence of parsed AST nodes.

The grammar is annotated with comments explaining each part of the grammar, why it is chosen, why it is restricted in such ways, and so on.

So without further ado, a grammar proposal:

// All type systems agree on simple token types 
stringToken = "Use existing JS grammar"
numberToken = "Use existing JS grammar"
// Note that common types such as "undefined", "null" and such would be covered here
// this token type ALSO COVERS keyword operators such as "typeof" or "extends" 
identifierOrKeywordToken = "Use existing JS grammar"
// Not all popular type systems support bigint, but because of existing support in TypeScript
// and good understandability it should be included
bigintToken = "Use existing JS grammar"

// As pointed out, all popular type systems agree on the concept on "object types", however
// these is a lot of variance in what is actually supported
objectTypeIsh = "{"
    // We could have a more restricted grammar here to allow for distinguishing computed
    // keys or "callable" or such, however I feel the following design gives the biggest flexibilit

    // Each object entry is a sequence of typeIshTokens followed by a COLON followed by another
    // sequence of typeish tokens
    // All popular type systems allow delimiting between entries with `,` or `;` so we follow that convention here
     // NOTE: Optional trailing commas not handled in the grammar here, this is just because
     // it is annoying to specify in PEG syntax
     // ALSO NOTE: We should exclude ":" as a valid entry in the token sequence for the
     // typeNodeSequence of the key, but this isn't shown here
    ((typeNodeSequence ":" typeNodeSequence) (";" / ","))*
"}"

// Array type-ish covers all things like [array,or,tuple,type], [computedKey]
// arbitrary sequenced tokens like TypeScript's mapped types and such
arrayTypeIsh = "["
    // We allow ";" just for flexibility and consistency with object types
    // This point is fairly unimportant
    // NOTE: Again we haven't dealt with the optional trailing comma because it's annoying
    // to write into the grammar
    (typeNodeSequence ("," / ";"))*
"]"

// This is just generic grouping syntax, all popular type systems support 
// parenthesized type sequences
parenthesizedTypeIshSequence = "("
    // Nothing much to say here, we just allow whatever
    (typeNodeSequence / "," / ";")*
")"

// This covers any usage of applying a generic to some other typeNode
applyGenericTypeIsh =
    // The <...> syntax is used by all popular type systems and is overwhelming common
    // even in other languages, so we should parse it as such
    // We allow ";" in addition to "," just for flexibility
    // NOTE: Again trailing commas not dealt with
    typeNode "<" (typeNodeSequence / "," / ";")* ">"

// This allows type systems to add arbitrary type operators, because of the grammar
// we restrict all bracket-style tokens like "(", "<", "["
unknownOperator = (
   / "~" / "!" / "@" 
   / "#" / "%" / "^"
   / "&" / "*" / "-" 
   / "+" / "=" / "|"
   / "?" / "/" / "."
   / ":"
)*

typeNode =
    / stringToken
    / numberToken
    / bigintToken
    / identifierOrKeywordToken
    / objectTypeIsh
    / arrayTypeIsh
    / parenthesizedTypeIshSequence
    / unknownOperator

typeNodeSequence = typeNode*

We would also have the special additions to the JS grammar parse in these ways:

// Note this CAN'T be a typeNodeSequence as we need to know where to stop
// we also have an optional generic on the identifier name
typeDeclaration = "type" identifierName ("<" (typeNodeSequence (";" | ","))* ">")? "=" typeNode

// We support interface SomeInterface { ... }
interfaceDeclaration = "interface" identifierName objectTypeIsh

// Variable annotations can be sequences, as long as they don't include "="
variableDeclaration = "const" identifierName ":" (! "=" typeNode)* "=" EXPRESSION

// And so on for the other constructs...

AST Format

Now you might look at the above and question whether or not it really adds anything over just a "well-matched" covering grammar. I would say yes, in particular consider the AST node for an example similar to the mapped type from above:

{
    [Key in keyof O as Exclude<Key, "foo">]: O[Key] extends "baz" | "bar" ? [3, 4] : never
}

We would get out the following AST:

{
    kind: "objectType",
    entries: [
        { 
            key: {
                kind: "arrayType",
                entries: [
                    {
                        kind: "typeNodeSequence",
                        // Note that while we don't know what tokens here are unary or binary
                        // or special operators, we at least have a token list in the right
                        // scope, i.e. we don't need to look outside of square brackets
                        // to interpret this type, a simple recursive descent on "typeNodeSequence"
                        // nodes could be used to give these more precise ASTs for specific
                        // type languages
                        entries: [
                            { kind: "identifierOrKeyword", name: "Key" },
                            { kind: "identifierOrKeyword", name: "in" },
                            { kind: "identifierOrKeyword", name: "keyof" },
                            { kind: "identifierOrKeyword", name: "O" },
                            { kind: "identifierOrKeyword", name: "as" },
                            { 
                                kind: "genericApplication",
                                target: { kind: "identifierOrKeyword", name: "Exclude" },
                                typeParameters: [
                                    { kind: "identifierOrKeyword", name: "Key" },
                                    { kind: "stringLiteral", value: "foo" },
                                ],
                            },
                        ],
                    },
                ],
            },
            value: {
                kind: "typeNodeSequence",
                entries: [
                    { kind: "identifierOrKeyword", name: "O" },
                    { kind: "arrayType", entries: [
                        { kind: "identifierOrKeyword", name: "Key" }
                    ] },
                    { kind: "identifierOrKeyword", name: "extends" },
                    { kind: "stringLiteral", value: "baz" },
                    { kind: "operator", operator: "|" },
                    { kind: "stringLiteral", value: "bar" },
                    { kind: "operator", operator: "?" },
                    { kind: "arrayType", entries: [
                        { kind: "numberLiteral", value: 3 },
                        { kind: "numberLiteral", value: 4 },
                    ] },
                    { kind: "operator", operator: ":" },
                    { kind: "identifierOrKeyword", name: "never" },
                ],
            },
        },
    ],
}

One of the key takeaways here is that while the AST format doesn't understand what a "typeNodeSequence" might actually mean, it is sufficiently tokenized and even has some other recursive nodes (corresponding to the well-bracketed constructs) that allow you to reconstruct most of what you would need for any type system just given this AST.

This would be far far easier to use at runtime than a plain uninterpreted string which requires a full parse. However even beyond runtime, tooling that needs to inspect JS AST's can agree on common tokens and such so processing such type grammars is simpler (for example token colorizing, custom babel transforms, etc).

theScottyJam commented 2 years ago

May I throw a couple of edge cases in here?

Would an as operator be included? If so, how do you deal with order of operations, considering you're allowing any operator to work within the type space.

return x as y | z // Is this a bitwise-or, or a union type?

When annotating a function's return type, how do you know when the type annotation stops and the function body starts? A type annotation is allowed to contain curly brackets, but that's also how we deliminate the start of a function body.

function fn(): { x: number } | { y: number } { /* function body */ }

How should ASI be dealt with? Consider this example:

let x: { a: whatever, b: whatever, c: whatever }
       | { a: whatever, d: whatever } // This continues the type definition from the previous line
property = 2 // This line is normal JavaScript

How do we know when the type space ends and normal JavaScript starts again?

acutmore commented 2 years ago

When annotating a function's return type, how do you know when the type annotation stops and the function body starts? A type annotation is allowed to contain curly brackets, but that's also how we deliminate the start of a function body.
function fn(): { x: number } | { y: number } { /* function body */ }

Hey @theScottyJam!

In the current tentative grammar there is a specific rule for UnionType (and IntersectionType) that is essentially type <operator> type or type. Because there isn't a valid type operator (currently | or &) after the { y: number } the parser can no longer continuing consuming tokens as a type and will then have a parse goal of a function body.

theScottyJam commented 2 years ago

@acutmore

I did notice that from the current grammar. What I'm mostly referring to is the grammar being proposed in this thread. Specifically, the fact that, according to the grammar from this thread, a "type" is just a bunch of different things that can sit next to each other with no separator (where a "thing" is, e.g., a numeric literal, a keyword, type syntax for an object literal, etc). And, I'm not really sure it's possible to define a type that broadly because of the issues I mentioned above.

acutmore commented 2 years ago

Ah, I didn't read your post with the proper context loaded into my brain. Thanks for clearing things up for me 😀 I've hidden my post as 'off topic' for this thread.

ahejlsberg commented 2 years ago

@Jamesernator It would be great if you could contrast what you're proposing to what's already in the current grammar proposal here. The current proposal already makes provisions for "well-bracketed" constructs and specifically avoids specifying further grammatical rules for their contents in order to preserve the ability for type syntax to evolve independently (i.e. any new type construct can simply be put in parentheses in order to ensure it is ignored).

Jamesernator commented 2 years ago

Would an as operator be included? If so, how do you deal with order of operations, considering you're allowing any operator to work within the type space.

Yes, I just didn't add all of the special places that the proposal has for types to appear to the grammar. The main idea was that these trees would be decently specified but also flexible enough for multiple type systems to use.

How should ASI be dealt with? Consider this example:

ASI is a difficult beast that I'm not sure, presumably similar to however existing constructs do it.

When annotating a function's return type, how do you know when the type annotation stops and the function body starts? A type annotation is allowed to contain curly brackets, but that's also how we deliminate the start of a function body.

This is problematic, although it does seem to contradict the general goals in the README that type systems should be able to experiment and evolve syntax without changes to the JS language. See also below my response to @ahejlsberg's question.

@Jamesernator It would be great if you could contrast what you're proposing to what's already in the current grammar proposal here. The current proposal already makes provisions for "well-bracketed" constructs and specifically avoids specifying further grammatical rules for their contents in order to preserve the ability for type syntax to evolve independently (i.e. any new type construct can simply be put in parentheses in order to ensure it is ignored).

I would be fully supportive of a strongly specified grammar. If parenthesized token lists are the limit that type systems are happy with for their extension point, then yes the above proposal doesn't really add anything over what is specified. It just seems surprising that the grammar is so specified, given that one of the goals of the README is to "without prohibiting existing type systems from innovating in this space".

Certainly such as AST would be considerably more useful for use cases like exposing at runtime or even just to tooling. (As now the "typeNodeSequence" things I called above are even simpler to detect and use a simple recursive descent parser for things like binary operators and such).

Jamesernator commented 2 years ago

I'm going to close this as there is a similar issue about the grammar.

tc39 / proposal-type-annotations