hsutter / cppfront

A personal experimental C++ Syntax 2 -> Syntax 1 compiler
Other
5.27k stars 225 forks source link

Design for program-defined metafunctions for cppfront #909

Open JohelEGP opened 6 months ago

JohelEGP commented 6 months ago

Design for program-defined metafunctions for cppfront

Introduction

This write-up presents a design to extend cppfront to evaluate program-defined metafunctions.

Conception

Support for metafunctions was first added by commit d8c1a50f22c1b171a50e87ccdb609fb05f41c021, "First checkin of partial meta function support, with interface meta type function". Its commit message also included the following sentence.

  • There is not yet a general Cpp2 interpreter that runs inside the cppfront compiler and lets users write meta functions like interface as external code outside the compiler.

After a lot of thinking, the idea of a "Cpp2 interpreter" seemed backwards to what cppfront is. Cppfront takes Cpp2 and lowers it to Cpp1, just like Cfront takes Cpp1 and lowers it to C. Interpreting Cpp2 could then be taken to mean one of two things:

  1. Building an interpreter that is a superset of the C++ abstract machine. This way, interpreted Cpp2 (i.e., metafunctions) is just as capable as normal Cpp2 code.
  2. Building an interpreter that is a very constrained subset of Cpp2. This would be like constexpr in C++11, and would probably evolve similarly.

Interpretation 1 means changing what cppfront fundamentally is. Interpretation 2 feels unsatisfactory. It is very constrained and without the power of the whole language at your disposal.

I thus realized that there is an alternative to interpreted Cpp2. That alternative is loading a metafunction compiled in a library during the execution of cppfront. This model doesn't change what cppfront is. Additionally, a metafunction is normal Cpp2 code, just like the implementations of built-in metafunctions.

Counterpoints

In this design, a metafunction is "normal Cpp2 code". In the Circle model of meta-programming, "normal Cpp1 code" can be executed at compile-time. This has raised concerns, quoted below, that are relevant to the present design. In our case, rather than compile-time, it's during metafunction evaluation.

However, we do not believe [the Circle] metaprogramming model is the right direction for C++’s future. We raise the following concerns:

  • The ability (and potential need) to call into shared libraries from the compiler raises the kinds of security concerns that led SG7 to discard std::embed (P1040).
  • … -- P2062 The Circle Meta-model

Circle is a fork of C++ that enables arbitrary compile-time execution (e.g. a compile-time std::cout), coupled with reflection to allow powerful meta-programming. SG7 was interested in it and considered copying parts of it. However, concerns were raised about security and usability problems, so the ability to execute arbitrary code at compile-time was rejected. -- 2020-02 Prague ISO C++ Committee Trip Report — 🎉 C++20 is Done! 🎉 : cpp

Also, the committee already reviewed a paper describing the Circle evaluation model and expressed some concerns with issues related to trust and implementability, but was generally interested in being able to do more at compile-time, within reason. I didn't mention that because that's already the trajectory for constant expression evaluation.

For example, I don't see the point of adding compile-time specific I/O APIs that won't be compatible with any library; the whole idea of Circle is that you just take your existing C++ code and use it at compile time.

The ability to open a file at compile-time and the ability to execute existing code have largely orthogonal concerns. I think we should be able to execute more code at compile without having to explicitly label it constexpr, but I draw the line at allowing the compiler opening arbitrary files on the whim of some 3rd party library on my behalf. -- Part of a reply from the thread starting at https://www.reddit.com/r/cpp/comments/jf4wsw/comment/g9mxpqc/?utm_source=share&utm_medium=web2x&context=3

Alternatives

Any alternative that requires recompiling cppfront or hard-coding metafunctions isn't viable at scale.

I also considered whether we could use Cpp1's constexpr and consteval. These don't serve us if we are to use an existing cppfront program. Consider the counterpoints. Given Cpp1's if consteval, a constexpr function can't be guaranteed to not use IO.

That said, it could be possible to require a metafunction to be constexpr and to actually evaluate it during constant evaluation to produce the updated type. The technique to implement that would me similar to the one presented in Interactive C++ in a Jupyter Notebook Using Modules for Incremental Compilation - Steven R. Brandt. But that is not this design (and I haven't explored such a design).

Counter-counterpoints

Maybe a metafunction can be required to be @pure (https://github.com/hsutter/cppfront/discussions/797#discussioncomment-7860363). Then, even thought a metafunction is still normal Cpp2 code, it isn't as problematic. Although @pure still seems too restrictive.

Design

This is based on what I learned from studying the documentation of Boost.DLL.

We need to emit a metafunction as an extern "C" symbol. The mangling of a Cpp1 symbol is experimental and not as portable (https://www.boost.org/doc/libs/master/doc/html/boost_dll/mangled_import.html). When loading the symbol of a metafunction, we need to use the same emitted name. This means that we need a protocol for the symbol name and to "C namespace" it.

In its simplest form, we just need a function that, given the Cpp2 name of a metafunction (as @-used), it returns a function object that evaluates the metafunction.

There is an implementation of this design at #907. Details on how this design was applied, as well as other implementations details, can be found there.

Evolution

Name lookup

Up until now, cppfront has been able to rely on the name lookup of lowered Cpp1 code. But this design introduces an evaluation point that happens outside the C++ abstract machine. It wants to look up a name that has already been compiled in Cpp1 and use it as named in Cpp2 code before the Cpp2 code has been lowered to Cpp1.

The current design doesn't consider name lookup. It expects a metafunction name to be @-used unqualified and to follow C "namespacing" conventions.

Dependency scanning

The current design only requires specifying a protocol for lowering and loading a metafunction. To author and consume a metafunction at scale, we also need dependency scanning, pretty much like Cpp1 modules.

Many of us use a build system to manage the complexity of building Cpp1 code. We would like to avoid having cppfront run on a Cpp2 source that hasn't changed and if all of the libraries that provide the metafunctions it uses haven't changed. Conversely, we want cppfront to rerun if one of those libraries has changed.

We can't know which metafunction a Cpp2 source uses without manually duplicating this information in the build system description. cppfront can't just emit the dependency information after the fact (like Cpp1 compilers on #included headers) because the libraries need to have been built before it starts evaluating the metafunction.

It has been suggested that cppfront could have a command line argument for compiling a metafunction library. That would obviate the need for a dependency scanner, but this inversion of the build logic has drawbacks.

There was an article that I can't find, I think linked from the LLVM Discourse, about how some other language's compiler (Go or Scala?) forked itself to build a module's sources in parallel. That ended up resulting in file system races in very rare cases. They rewrote their module compilation system to not fork itself and instead rely on their build system. That fixed the issues, and even (significantly? in some cases?) reduced compile times.

I think the general issue is attempting to do what should be done at a higher level. The higher level being that of the build system. The CMake support for Cpp1 modules already went in the direction of a dependency scanner (along with a long trail of papers for proper modules support). I think it'd be unwise to go in the other direction, which doesn't even seem to have build system support.

realgdman commented 6 months ago

Is my understanding correct that by loading you mean final user's program will load some library like DLL? If that's case, I'll raise concern about read-only programs, like microcontroller ROM-mable. Actually I like to use (near-)zero cost C++ for writing low-level stuff, but with this change there can be case, where you just don't have enough RAM to load extra executable. In that case cpp2 becoming language for big platforms only.

DyXel commented 6 months ago

Is my understanding correct that by loading you mean final user's program will load some library like DLL? If that's case, I'll raise concern about read-only programs, like microcontroller ROM-mable. Actually I like to use (near-)zero cost C++ for writing low-level stuff, but with this change there can be case, where you just don't have enough RAM to load extra executable. In that case cpp2 becoming language for big platforms only.

The metaprogramming environment is just a compile-time thing, the very end result that you'd compile and run on your embedded system would be Cpp1 code, which at this stage would not involve any kind of runtime overhead (unless of course, generated by the metafunctions themselves).

DyXel commented 6 months ago

In regards to name lookup in DLLs, the secret sauce is indeed to always use extern "C". However, that doesn't stop you from creating specific mangling on top of said C-naming and/or generating type-erased wrapper functions that then call the C++-mangled ones. For example, you could potentially solve the "namespacing" problem by following these steps:

It is likely that all of this could even be done with an extra metafunction that registers somewhere outside and ensures the resulting metafunction is lowered with the correct cppfront-specific mangling (which would be then a implementation detail):

n1: namespace = {
    n2: namespace = {
        my_meta_function: @cpp2_metafunction (inout t: meta::type_declaration) = {
            // ...
        }
    }
}

Could potentially generate code similar to this:

namespace n1 {
    namespace n2 {
        auto my_meta_function(meta::type_declaration& t) -> void {
            // ...
        }
    }
}

extern "C" void __cppfront_n1_n2_my_meta_function(void* t) {
    n1::n2::my_meta_function(*static_cast<meta::type_declaration*>(t));
}

// Outside code could then:
auto* my_func = dlopen("__cppfront_n1_n2_my_meta_function");
// After resolving the full name via the exposed "tree".

This of course would still be limiting, but I think that having something greppable that we could get rid of later is very helpful in any case (im talking about @cpp2_metafunction).

JohelEGP commented 6 months ago

I have thought about that for supporting an overload with an in parameter. My main concern is that name lookup couldn't behave as it does everywhere else. Still, there is much value in this, even if initially we require to fully qualify all @-uses of program-defined metafunctions.

Doesn't extern "C" make it redundant putting the declaration in the global namespace?

DyXel commented 6 months ago

Doesn't extern "C" make it redundant putting the declaration in the global namespace?

Oh yeah, seems you are right. I guess that would simplify generation in that case. From cppreference:

When a function or a variable is declared (in any namespace) with "C" language linkage, all declarations of functions (in any namespace) and all declarations of variables in global scope with the same unqualified name must refer to the same function or variable.

This implies what you stated.

JohelEGP commented 6 months ago

The problem with having to perform name lookup in cppfront is shared with https://github.com/hsutter/cppfront/issues/666#issuecomment-1722329609. There, a wrong answer results in a compile-time error (due to an incomplete type) or a missed optimization. Here, choosing the wrong metafunction is a no-go. We just can't know what names have been made visible to lookup in imported Cpp1 code. The following uses of my_metafunction had better always refer to the same overload set:

#include <a_cpp1_header.hpp>
my_namespace: namespace = {
my_nested_namespace: namespace = {
f: (t: cpp2::meta::type_declaration) = t.my_metafunction();
my_class: @my_metafunction type = { }
}
}

That said, it could be possible to require a metafunction to be constexpr and to actually evaluate it during constant evaluation to produce the updated type. The technique to implement that would me similar to the one presented in Interactive C++ in a Jupyter Notebook Using Modules for Incremental Compilation - Steven R. Brandt. But that is not this design (and I haven't explored such a design).

That would require parser.h to be constexpr-capable. Running cpp2::meta::type_declaration::add_member during constant evaluation would be slow. Increased build times would be a let down, so I consider this path non-viable.

MaxSagebaum commented 6 months ago

I would raise the question "Why do we need overload detection?"

The interface for calling a metafunction is already defined. It needs to have one argument which is of type meta::type_declaration. Additional arguments are currently not supported. If we extend this, then the interface for all metafunctions would change.

So I currently do not see the use case for an overload detection in the library loader.

Comments to the document:

JohelEGP commented 6 months ago

We want to support both of these function types:

(inout _: cpp2::meta::type_declaration)
(_: cpp2::meta::type_declaration)

The one with the in parameter is used by the built-in @print. It would also be used by metafunctions that don't modify the type, like

  • a metafunction that generates compile-time file output (e.g., to generate code in another language, such as a Java/Swift wrapper for a C++ object by writing my_interface: @java_interface type = { ... }.

-- https://github.com/hsutter/cppfront/discussions/650#discussioncomment-7170046

MaxSagebaum commented 6 months ago

Ok, one could argue that the in case is just a special version of the inout case. Therefore, the inout case would be enough.

If both need to be supported, then it would result in a fixed set of function definitions. These could be represented by an enum and the entry name of the enum could be included in the function mangling. E.g. cpp2_metafunction_in_namespace_test_print. During the reading of the library the metainformation can be extracted from the name. It can then be checked when the metafunction is called.

One idea would be to include the declaration kind in the enum. This would allow better error messages. The enum fields could be:type_in, type_inout, function_in, function_inout, member_function_..., enum_..., namespace_...

Metafunction declaration could then include this:

print: @meta<metakind::function_in>

Maybe make it a flag enum and have the in, out as extra elements in the enum.

JohelEGP commented 6 months ago

cpp2_metafunction_in_namespace_test_print

This is all we need. When lowering the library symbol, include in somewhere in it if it has an in parameter. When loading, if the symbol without in isn't found, retry with in. This simulates overload resolution. Fortunately, a metafunction is a non-templated function with a single argument of a fixed type.

JohelEGP commented 6 months ago

The lack of name lookup is a big issue.

These aren't my concerns:

These cases, which surprise users, are my concerns:

Right now, Cpp2 source files are processed in isolation. As I stated in the opening comment, we have to do dependency scanning. This is a good opportunity to output extra semantic information to pass down to later invocations. This would work pretty much work like how BMIs are passed down to later compilations of Cpp1.

For starters, it would serve us to keep a structure of introduced names for name lookup. This would mean that convenience Cpp1 using declarations and using directives would have to be moved to Cpp2 namespaces (in a new file or within the same one by also changing its extension).

This could also serve as a starting point for other features that aren't possible to implement right now.

In a better world, all our dependencies would be C++ modules, and we would have Cpp1 runtime reflection to query this information. Then these requirements on Cpp1 using could be limited to the current TU.

JohelEGP commented 6 months ago

These cases, which surprise users, are my concerns:

  • A type metafunction with function type (inout _: type_declaration), where lookup for type_declaration finds cpp2::meta::type_declaration, isn't recognized by cppfront as a type metafunction.

I think this is a solvable problem today if I can loop over a DLL's symbols.

JohelEGP commented 6 months ago

The were also concerns when cross-compiling with the Circle model of meta-programming.

Remember that in this design a metafunction is loaded from a library. A metafunction should be processed by a cross-compiled cppfront if it depends on platform specifics or build artifacts.

If cppfront can't be cross-compiled, then metafunctions can't be used. However, a metafunction that doesn't depend on platform specifics can be processed by cppfront on the host. But what about dependencies on build artifacts? If you don't mind duplicating work on the host, it's possible to use metafunctions.

patrickdown commented 6 months ago

The pattern that is common in many applications that load plugins as dynamic libraries is to have a common extern "C" initialization function entry point. This function will take as a parameter an interface that allows the plugin to define its functions.

For example the interface might have an add_metafunction method that defines the function name, the pointer to the implementation and the context where it would be used.

JohelEGP commented 6 months ago

IIUC, that inverts the logic so that plugins register themselves, right?

patrickdown commented 6 months ago

IIUC, that inverts the logic so that plugins register themselves, right?

Yes, mostly. The application still needs to know that libraries to load but this process is just reduced to system calls to load the library and find one "C" function with a known name.

hsutter commented 5 months ago

A note about this:

  • kinds of security concerns that led SG7 to discard std::embed (P1040).

My understanding is that std::embed is well on track for C++26 with paper P1967R12. It was design-approved for C++26 at the Feb 2023 meeting, and my understanding is that the only updates being requested are wording updates.