qwertie / ecsharp

Home of LoycCore, the LES language of Loyc trees, the Enhanced C# parser, the LeMP macro preprocessor, and the LLLPG parser generator.
http://ecsharp.net
Other
177 stars 25 forks source link

Proposal: modularize the standard macro library #44

Closed jonathanvdc closed 7 years ago

jonathanvdc commented 7 years ago

Hi there. As you may have inferred from the title of this issue, I would like to modularize the standard macro library. I'll start off with some background information on how ecsc handles things, and then try to justify why I think the standard macro library should be split up. So please bear with me.

Background information

As ecsc has evolved, it has increasingly come to rely on a combination of macros and magic node types to lower existing C# constructs to lower-level EC# constructs. For example, this innocent-looking foreach loop

foreach (var item in col)
    Console.WriteLine(item);

gets macro-expanded to

#builtin_static_if(#builtin_static_is_array(#builtin_decltype(col), 1), #builtin_stash_locals(col, colLen, i, {
    var col = #builtin_restore_locals(col, colLen, i, col);
    var colLen = col.Length;
    for (#var(#builtin_decltype(colLen), i = 0); i < colLen; i++) {
        var item = col[i];
        #builtin_restore_locals(col, colLen, i, Console.WriteLine(item));
    }
}), #builtin_stash_locals(enumerator, {
    var enumerator = #builtin_restore_locals(enumerator, col).GetEnumerator();
    try {
        while (enumerator.MoveNext()) {
            var item = enumerator.Current;
            #builtin_restore_locals(enumerator, Console.WriteLine(item));
        }
    } finally {
        if (enumerator is System.IDisposable)
            ((System.IDisposable) enumerator).Dispose();
    }
}));

The lowered code is admittedly pretty cryptic, but it actually Does The Right Thing™. And you don't even have to take my word for it; if you stare at the expanded version long enough, then you'll see that first, a static if (#builtin_static_if) determines if col is a one-dimensional array (#builtin_static_is_array, #builtin_decltype). If this happens to be the case, then a simple for loop is used to iterate over col's values. Otherwise, it will use MoveNext/Current to iterate over col.

The calls to #builtin_stash_locals and #builtin_restore_locals are used to implement hygienic macros. #builtin_stash_locals hides local variable names – if the specified names are already in use by local variables, then they are removed from the local scope, and pushed onto a stack – which allows the inner expression to re-use those names safely. #builtin_restore_locals performs the opposite operations: it discards any locals that intersect with the names it specifies, and restores the old variables that were mapped to these names (if any).

A remarkable property of #builtin_static_if, #builtin_static_is_array, #builtin_decltype, #builtin_stash_locals and #builtin_restore_locals is that, from ecsc's perspective, these node types aren't special at all. Like any other node type, they simply get analyzed during the IRGen/semantic analysis phase, and that's all there is to it.

Problem

One thing that annoys me about this, though, is that – by introducing #builtin_static_if – both static_if and #builtin_static_if are now legal, and, more importantly, they behave differently. static_if is defined as a macro in LeMP.StdMacros, and it is far less powerful than #builtin_static_if because the latter construct can take full advantage of the fact that it is only evaluated during the IRGen/semantic analysis phase.

Had I used static_if in the macro expansion above, then LeMP would have told me that: "'static if' is incredibly limited right now. Currently it only supports a literal boolean or (x 'tree==' y)". That's unfortunate, especially since #builtin_static_if doesn't share those limitations.

Ideally, I'd like to "fix" this confusing situation by defining static_if as a macro that expands to #builtin_static_if. Unfortunately, I can't do that right now, because the standard macro library is structured as a single, monolithic binary; I can't just go ahead and define static_if in ecsc's macro library, as that re-definition would then conflict with the existing static_if definition from LeMP.StdMacros.

Proposal

So, here's my proposal: to divide the standard library into (roughly)

Separate implementations of the low-level commands can then be created for LeMP and ecsc. The former would probably just be a re-packaging of the current macro implementations in LeMP.StdMacros, while the latter would expand nodes such as static_if to ecsc builtins.

The higher-level standard macro library can then depend on the lower-level macros without concerning itself with the specific implementation of the primitive operations.

The move to a low-level/high-level split in the standard macros can be gradual. Initially, we could just try moving static_if and #useSequenceExpressions into a low-level "primops" library. Later on, it might prove useful to implement certain operations as heuristics in the LeMP "primops" and as exact builtins in the ecsc "primops." When I say heuristics, I'm mostly referring to things like this.

static bool IsQuickBindLhs(LNode value)
{
    if (!value.IsId)
        return true;
    return char.IsUpper(value.Name.Name.TryGet(0, '\0'));
}

LeMP can't do much better than this, but ecsc can tell with absolute certainty if the left-hand side is a value, and I think that it's a shame that we're not taking advantage of that information when it's readily available.

Anyway, no part of this proposal is missing-critical, but I do think that it'd be pretty neat if standard macros were able to take advantage of ecsc builtins under the hood. Thanks for considering my proposal.

qwertie commented 7 years ago

Before I read your message in its entirety, have you tried the MacroMode.PriorityOverride option to override static_if?

qwertie commented 7 years ago

Whew! I just committed support for preserving newlines and comments in EC#, a feature that was much harder than I thought it would be, particularly on the printer side. And I fixed a bug in LLLPG that broke the parser and had me baffled for awhile. Hopefully trivia preservation will be much easier in LES, but sadly there are two different printers to deal with (I thought I could deal with all three printers and parsers in one day, hahahaha.. er.. nope, not by a long shot).

The problem I have with "modularizing" StdMacros is in deciding where to draw the lines. It is not obvious to me what counts as "high" or "low" level. I mean, for most of the macros in StdMacros you can point to one or more programming languages where the feature implemented by that macro is built into the language, and one or more other languages where it is an "add-on". But I guess I can agree with your examples: static_if feels like a low-level feature while alt class feels like, and operates as, an add-on, even though ADTs are built into many languages as a low-level feature.

Hopefully for now you can solve the problem with MacroMode.PriorityOverride - but actually you should use PriorityInternalOverride as PriorityOverride is meant for end-users.

qwertie commented 7 years ago

Doesn't implementing IsQuickBindLhs correctly [in a macro] require an enhanced macro system like Nemerle has, with a second macro-processor pass that provides access to more type & member information? I haven't thought about that feature for a long time. (I'd love to have Nemerle-like macro hygiene, btw, but it involves concepts like "imports" and "colored identifiers" or "private symbols" or whatever the kids are calling them these days, and it seems to require closer integration with a compiler, which LeMP is not currently designed to provide.)

qwertie commented 7 years ago

Say... is there a reason you're not using macros (quote {}) to write RequiredMacros.cs?

jonathanvdc commented 7 years ago

Thanks for your replies! I'll try to respond to each of them in some arbitrary order below.

jonathanvdc commented 7 years ago

Update: I have defined overrides for the static_if and @[#static] #if macros in RedefinedMacros.cs, and the override mechanism seems to work beautifully. Thanks for helping me out!

(I'll close this issue now, given that my problem has been solved.)

qwertie commented 7 years ago

I guess I have waited long enough to comment...

It's pretty interesting how you've stretched out the lexical macro system to execute what would otherwise be conventional compiler passes, but something about this approach makes me uncomfortable. Let's see...

So... I think ultimately we should move away from using lexical macros to do semantic tasks, though I'm not prepared at the moment to suggest what to do instead.

Symbols and hygiene

I'd like to share an idea I originally had about how a Loyc compiler would work. As you know, LNode.Name is a Symbol, and while currently Symbol is basically just a string (it has an Id too, which maybe I should delete), it is deliberately not sealed, and in the back of my mind my plan was always to use Symbol for resolved references in source code.

Imagine if your compiler built its symbol-tables as usual, but the symbols in the symbol tables were actually Symbols, i.e. derived from class Symbol. Now, if a method contains this code:

var x = 23;
Foo(x);

The EC# parser, of course, produces a Loyc tree for Foo(x) which refers to a pair of ordinary global symbols. My idea was that a "symbol resolution" pass would scan over the source code, replacing global symbols with resolved symbols. The output would still be a Loyc tree and it would still print out as Foo(x), but the symbol with Name = "x" would actually "be" the local variable x and include type information so that you could look up the type of x directly from the LNode, something like ((LocalVariable)x.Name).VarType, if that makes sense. Similarly, the Symbol with Name = "Foo" would actually represent the method Foo(int).

Hygiene could also be achieved by a macro simply producing Symbols that are in a "local" SymbolPool (although I wonder if Symbol Pools are even a useful concept - non-global Symbols could be created without a pool, that's how it works in JavaScript). This doesn't work in the usual LeMP workflow (that I use), since EcsNodePrinter just prints the Symbol.Name without regard for whether it's a global symbol or not, so two different Symbols called x end up with the same name in the output. But inside a compiler that problem need not occur, and when outputting plain C# I could fix the problem by adding a unique-naming pass, before printing, whose job would be to ensure that all non-global symbols get a unique name string.

So... does such a design sound like a good idea to you?

jonathanvdc commented 7 years ago

Looks like the discussion is sort of diverging into two separate subjects. I've split my response into two sections accordingly.

Semantic macros

Truth be told, semantic macros as a replacement for a lexical macro-based design are a tough sell for me, so I'll argue against them now. (Sorry!)

Let's start off with a point of agreement. I think you're absolutely right that writing a #foreach macro is pretty clunky. It's just that the alternative so far – baking it into the compiler – is so much worse. Perhaps it's worth explaining my two reasons for writing #foreach as a macro:

  1. foreach is inherently platform-dependent. For example, .NET runtime has interface IEnumerable<T> and interface IEnumerable<T>, with foreach being little more than syntactic sugar for a loop that calls methods on these interfaces. But ecsc is not as biased toward the .NET runtime as csc and mcs, and can (at least theoretically) generate code for other platforms. By implementing foreach as a macro, a different platform-specific definition can be written for foreach. This wouldn't have been possible if I had baked foreach in the ecsc source code.
  2. Writing foreach as a lexical macro is, for all its clunkiness, still significantly less clunky than writing it as a "normal" expression/statement that the compiler can analyze. ecsc's lock implementation is significantly more complex than the foreach macro, and that complexity buys ecsc very little. Admittedly, error diagnostics for lock are much clearer than the error diagnostics for the foreach macro, but the foreach macro can be platform-specific, whereas the lock implementation is forever dependent on the existence of a type named System.Threading.Monitor, regardless of the platform for which code is generated.

These two points, especially the first, favor neither lexical nor semantic macros. Both can be used to create platform-specific definitions for (essentially platform-agnostic) constructs such as lock, using and foreach. I don't think a semantic macro will be more concise than a lexical macro with builtins (especially if that lexical macro uses quote { }), but I suppose that a semantic macro might be able to provide better diagnostics than the current lexical macros with builtins.

But I'm not convinced by your other criticisms. Specifically:

Though my main argument against semantic macros is their complexity. They add yet another major pass to the compilation process. That means that anybody who seeks to fully understand how EC# works must also learn how semantic macros work, which constructs they can affect, the order in which they are expanded, what the structure and rules of the IR they operate on are, etc.

Also, semantic macros complicate macro development. Right now, an EC# macro is always a lexical macro. But if semantic macros are introduced as well, then programmers will suddenly have to ask themselves which type of macro they should create. If semantic macros are strictly more powerful than lexical macros, then they might have to re-write entire lexical macros as semantic macros when they realize they need a feature that only semantic macros offer.

Plus, semantic macros are arguably harder to verify by the compiler than lexical macros. Right now, the (Flame) IR generated by ecsc is correct by construction. If the input LNodes contain errors that must be diagnosed at compile-time, then ecsc will do just that, and will subsequently terminate the compilation process; ecsc either reports an error or produces a correct IR tree. But a semantic macro could sneakily insert a semantically invalid construct, without the compiler taking notice. This would result in very hard-to-detect bugs.

As a final point in favor of lexical macros with builtins: this type of thing has precedent in other languages. The D programming language, for example, defines std.traits, a module that defines function templates which answer questions about the source code. That is eerily similar to ecsc's builtins. For example, isArray is the D equivalent of #builtin_is_array_type. If I recall correctly, then C++ has similar metaprogramming libraries.

Going forward, I think that it would be better to wrap ecsc's builtins into a standardized macro library, just like D has done. (Apparently, Phobos uses dmd builtins under the hood, as well: isNested seems like a fair example).

Symbols and hygiene

I very much like the idea of using separate SymbolPools to guarantee that symbols don't overlap. It certainly is a lot cleaner than the builtin-based design. I don't have a lot of spare time right now, but I think it's definitely worth implementing. I'll try to implement it in ecsc and its macro library once I get the opportunity to do so.

Or do you think that it makes more sense to just use LeMP's (future) renaming pass in ecsc? That would have the extra advantage of producing less confusing output when ecsc is instructed to print macro-expanded source code (via the -E switch).

qwertie commented 7 years ago

Semantic macros

Let me start here because I suspect you misunderstand what I was suggesting:

If the input LNodes contain errors that must be diagnosed at compile-time, then ecsc will do just that, and will subsequently terminate the compilation process; ecsc either reports an error or produces a correct IR tree. But a semantic macro could sneakily insert a semantically invalid construct, without the compiler taking notice. This would result in very hard-to-detect bugs.

I'm suggesting the same kind of semantic macros as Nemerle has. The macros would run after the list of types and methods has been built, but before method bodies have been converted to IR. You can't sneak anything invalid in if macros run before semantic analysis. (I think in Nemerle you might be able to do semantic analysis on local variables too, and create new class members.... I don't know how that works and I haven't easily found details online; if we do this we should install Nemerle and do some experiments to understand the details.)

And the second-stage macros should operate on Loyc trees so that users can easily switch which kind of macro they are writing, and so that the same MacroProcessor can still be used.

As a final point in favor of lexical macros with builtins: this type of thing has precedent in other languages. The D programming language, for example, defines std.traits, a module that defines function templates which answer questions about the source code.

Isn't that quite different? I didn't try any serious metaprogramming in D but IIRC, you can write code that calls those trait (builtin) functions directly and immediately acts on the results. That's more powerful and general than generating a syntax tree that eventually calls trait functions, as a lexical macro must do.

They add yet another major pass to the compilation process.

Since some of the compiler's own work could be implemented with them, it's not necessarily an additional pass beyond what you'd do anyway, is it? How many passes do you use already?

On chaining macros. RequiredMacros.cs defines all dependencies for the #foreach macro in the same namespace, in the same file, in the same assembly. It's all pretty cohesive. There's just no way that #foreach will get imported without its dependencies

I'll tell you a secret... I originally planned to let people write fully qualified macro names like Namespace.MacroName(...), I don't remember actually implementing that, but I'm seeing some code to support it in MacroProcessorTask.GetApplicableMacros so, yeah, maybe you can already invoke a macro without its namespace being imported.

Symbols and hygiene

Or do you think that it makes more sense to just use LeMP's (future) renaming pass in ecsc?

I'm confused because this is not an either/or question. If multiple symbol pools are used then a renaming pass is required to avoid name collisions in the text output.

Another thought, perhaps one should be able to write quote(pool) {...} to use a specified symbol pool for all identifiers in a quotation.

One more thought. For hygienic namespace lookup, to accomplish the same thing that Nemerle does, macros could instruct LeMP to look up macros from specific namespaces by wrapping their output in a certain command, say, #macroNamespaceContext().

As an example, let's say macro1 is in NamespaceA and it wants to return a call to macro2 in NamespaceB - without using a fully-qualified name - and there is also a macro2 in NamespaceC. macro1 is called from user code like this:

#importMacros(NamespaceA);
#importMacros(NamespaceC);
class Foo { ... }
macro1(otherMacro(Foo));

The user is also trying to call otherMacro from NamespaceC.

Let's say the macro originally returns something nonhygienic like this:

return quote { macro2($(node[0])) };

That is, macro1 is a useless macro that forwards the call to macro2. But the point is it wants macro2 from NamespaceB, while otherMacro should still be resolved from NamespaceC. The returned syntax tree in this case is

macro2(otherMacro(Foo));

Whereas the hygienic return would look more like this:

SymbolPool pool = new SymbolPool();
return F.Call("#macroNamespaceContext", 
    F.Literal(pool), 
    F.Literal(new DList<Symbol> {(Symbol)"NamespaceC"}),
    F.Call(pool["macro2"], node[0]));

so the first argument is a Literal containing a SymbolPool, the second argument is a Literal containing an IReadOnlyCollection<Symbol> of namespaces*, and the third argument is the actual output from the macro. Finally, the macro processor would use each Symbol's SymbolPool to decide in which list of namespaces to search for that Symbol.

(* the macro processor currently doesn't support qualified names stored as LNode - the qualified name Foo.Bar is just a Symbol with a literal dot in its name - and maybe it should stay that way since Symbols are dramatically faster than LNode as a dictionary key, or for comparisons.)

Obtaining hygiene is a bit clunky this way, but a helper function and/or macro could help.

jonathanvdc commented 7 years ago

My reply turned out to be significantly longer than I hoped it would be. Sorry about that.

Semantic macros

The macros would run after the list of types and methods has been built, but before method bodies have been converted to IR. You can't sneak anything invalid in if macros run before semantic analysis.

Right, but isn't that a little contradictory? I mean, suppose that a type Foo has been analyzed, and we now encounter an expression Foo.Bar. Foo in Foo.Bar could a type, a namespace or a value, and there's really only one way to ascertain what Foo is in the context wherein it appears: by analyzing the expression Foo.Bar. And semantic macros need to be able to answer questions like: "is Foo a type?" You probably also want semantic macros to be able to discover what Foo's type is, if it is a value rather than a type, which requires for all preceding statements and expressions to have been analyzed.

So, really, semantic macros would have to be evaluated in the middle of the semantic analysis process.

Which brings up the problem of what exactly a semantic macro returns. For example, a #foreach macro has to know what the type of its collection expression is, so it'll want to evaluate its collection argument node first, check what that collection's type is, build a loop with one or more induction variables, and only then evaluate the loop body node in the context of that loop.

At the moment, that logic is captured by the #foreach lexical macro as a Loyc tree, and the compiler then analyzes the resulting tree.

Now suppose that a semantic macro has to do the same job. There are basically two options here:

  1. It could produce an IR tree, but then the macro is given the opportunity to sneak in illegal IR.
  2. It could return a Loyc tree which is then fed to the remainder of the semantic analysis process. But then the collection expression is evaluated twice: once by the semantic macro to figure out which construction it should use, and once more by the semantic analysis, as the semantic macro will have to embed the collection expression in the Loyc tree produces. This isn't just bad for performance, it also implies that any (warning/error) diagnostic related to the collection expression is now printed twice.

Alternatively, the semantic macro could produce some kind of wishy-washy Loyc tree that contains both unanalyzed Loyc trees and IR trees. But that's just option one in disguise, as the macro could easily insert invalid IR trees in that mixed Loyc/IR tree.

I also don't feel comfortable with imposing Flame IR on the macro writer, because that implies that any breaking change to Flame is a breaking change to EC#. But if Flame IR is not used in semantic macros, then they'd require (at the very minimum) an entire type system to insulate the semantic macros from Flame's type system. That'd be a lot of work, would definitely hurt performance, make it harder for (other) people to write an EC# compiler, and would add little value to the language.

Anyway, I just like the (apparent) simplicity of lexical macros. They're fairly easy to write and understand, and I don't feel the same way about semantic macros. Maybe I'm wrong in thinking that, but the only way to know for sure what EC# semantic macros would be like is to build a prototype, and then take things from there. That'd be a massive undertaking, though, and I'd rather just keep on extending ecsc up to the point where it starts becoming a reasonable alternative to csc and mcs for some projects. Can we put semantic macros on hold until we get to that point? There's still a lot of work to be done before ecsc is mature enough to compile itself.

Isn't that quite different? I didn't try any serious metaprogramming in D but IIRC, you can write code that calls those trait (builtin) functions directly and immediately acts on the results. That's more powerful and general than generating a syntax tree that eventually calls trait functions, as a lexical macro must do.

I don't completely understand what you mean by that. Can you give me an example of "code that calls those trait (builtin) functions directly and immediately acts on the results?"

Since some of the compiler's own work could be implemented with them, it's not necessarily an additional pass beyond what you'd do anyway, is it? How many passes do you use already?

I was referring to the passes specified by the language itself: preprocessing, lexing, parsing, lexical macro expansion and semantic analysis. In hindsight, I was wrong to say that semantic macros necessitate another pass. They probably just extend the semantic analysis pass, but in an awkward way.

I'll tell you a secret... I originally planned to let people write fully qualified macro names like Namespace.MacroName(...), I don't remember actually implementing that, but I'm seeing some code to support it in MacroProcessorTask.GetApplicableMacros so, yeah, maybe you can already invoke a macro without its namespace being imported.

Well, I sure didn't see that one coming. I'm glad you're considering hygienic macro imports, though. I wouldn't mind switching to those once they become available.

Symbols and hygiene

I'm confused because this is not an either/or question. If multiple symbol pools are used then a renaming pass is required to avoid name collisions in the text output.

What I meant was that storing Symbol instances in the ecsc's symbol would solve a problem that has already been solved if LeMP is going to rename all locals anyway.

Another thought, perhaps one should be able to write quote(pool) {...} to use a specified symbol pool for all identifiers in a quotation.

Sounds like a great idea.

Obtaining hygiene is a bit clunky this way, but a helper function and/or macro could help.

Yeah, that syntax is sort of clunky. But I suppose that's okay if we can find an appropriate macro to build on top of that. I think a variant of quote (pool) { ... } that implicitly creates a pool, and then imports a set of namespaces could work nicely. Here's an example of what I imagine what that might look like (I know #quoteWithMacroImports is a bit silly; I haven't put much thought into what to call this macro).

return #quoteWithMacroImports (NamespaceC, OtherMacros)
{
    macro2($(node[0]))
};