microvm / microvm-meta

We have moved: https://gitlab.anu.edu.au/mu/general-issue-tracker
https://gitlab.anu.edu.au/mu/general-issue-tracker
3 stars 0 forks source link

IR construction API for the convenience of verification #50

Open wks opened 8 years ago

wks commented 8 years ago

Problem

The only way for the current API to transfer an IR bundle from a client to a Mu micro VM is via the load_bundle API (for example, microvm.load_bundle(""".typedef @i32 = int<32>\n"""), or use a binary format), which passes a serialised IR format (text or binary). JVM takes this approach. Its serialisation format is "Java bytecode".

An alternative way to deliver a bundle is to let the client construct each node (type, function, basic block, instruction, ...) in the bundle by calling an API function (for example, handle = microvm.make_instruction("ADD", i32, op1, op2)) which returns a handle to the node, then passing this handle to the Mu micro VM. LLVM uses a similar approach: it provides constructors to each node class, and provides a "builder" to conveniently build a CFG.

Since the construction of the IR should be off-loaded from the micro VM, there should be a client-side library (we call it libmu) which constructs the IR for the client and provides an API to the client. libmu itself talks with the micro VM through implementation-dependent "private" APIs.

It is not clear whether serialisation or construction by calling is "better" because "better" has many definitions, but the calling-based API must exist because it is reported that it is very difficult to verify a parser.

Comparing the two approaches

serialisation

construction by calling

Depending on the two languages calling each other, the cost can be trivial (such as C and C++) or huge. Java must go through JNI to call native functions. Go performs a swap-stack operation every time it calls C in order to work around blocking system calls in its M*N threading model. Other language implementations, such as Python, relies on libffi, which builds a native call by dynamically prepare its arguments, and the high-level language such as Python needs to convert C types to Python types and vice versa.

An experiment shows calling a simplest C function from Haskell introduces 30x overhead comparing to calling the same C function from another C function.

But from verification's point of view, the cost does not matter as long as it is spent in the client.

Cost of serialisation

Serialisation is not free. In a simple experiment, serialising a CFG-like data structure to an intermediate format and then parse it in another module (both written in C) introduces a 10% overhead comparing to directly constructing the target CFG structure in the receiver while assuming the sender only holds opaque references. The major cost is memory allocation (where malloc is the bottleneck) and resolving the cross-references between nodes (where the hash map is the bottleneck).

Serialisation is not free, but is reasonably cheap. When foreign function call is expensive, serialisation can be used as an alternative.

Not calling across languages

Since calling across languages is expensive, it is desirable to implement part of the libmu in the same (or similar) language the client is written in, and let libmu construct the data structure that is native to the micro VM.

For example, if the micro VM is written in C++, the libmu should construct a tree of C++ class instances as LLVM does.

Note that LLVM is designed and implemented in C++ and serves C/C++. It is not a problem if the only official API it provides is C++. There is a C binding, too, but it is trivial. However, Mu has a specification which allows multiple implementations. In this case, the micro VM core would not always be in C/C++ or any particular language.

However, if the micro VM is written in a managed language, such as RJava or Java, then it will be interesting:

Maybe. At least the verified libmu needs to be minimal.

There may be even higher-level libraries outside above in the client's world outside libmu. Those libraries are not minimal.

How many languages should be supported?

Ideally the libmu should have both intimate knowledge of a particular Mu implementation, and intimate knowledge of the language the client is written in. In theory, there can be one (or more) libmu for each (micro VM impl, client language) pair.

Since C is so popular, we will define a C API for libmu. In theory,

  1. there can be more than one C APIs
  2. there can be APIs for other languages and they may or may not look like the C API. (preferably not, to avoid paradigm impedance)

    Mu CFG and the client CFG (or AST)

How much should the client use the Mu IR CFG? i.e. should the client construct Mu IR nodes and do transformations on it as LLVM does? Probably not.

LLVM is designed to be maximum: it attempts to be maximum and its CFG contains much information for optimisation, such as the "nsw" and "nuw" flag on the "add" and the "sub" instructions.

But the Mu IR is designed to be minimal and is only designed for the micro VM to consume. It does not contain much information that benefits the client.

It is possible that a client-side library performs IR transformations, but it is doubtful whether that IR is the same Mu IR. Many optimisations, such as whether x+1 > x is always true, depends on extra information (such as the "nsw" flag in LLVM) which the Mu IR do not provide.

I @wks believe the Mu IR is only generated as the last step of the client-side transformation, i.e. the next step is to deliver it into the micro VM.

Towards the new API

Micro-VM-to-client API

The existing controlling API does not need to be changed.

The bundle loading API can be removed. i.e. "how to load bundles into the micro VM" is implementation-dependent, which actually mean libmu-dependent.

libmu-to-client API

This API needs to be carefully designed because it is part of the formal verification.

There should be a model of the internal states of libmu which includes:

During construction, each node can hold incomplete information at a given time. The Mu IR nodes may circularly refer to each other (currently they refer to each other via IDs), so it is desirable to allocate several nodes and then link them with each other.

If multiple threads can use the same libmu, the transition in its internal state must be properly handled.

Cares must be taken to select the minimum set of API functions, because the current Mu IR has 18 types and 37 instructions. The number of API functions may easily bloat to 100+ if too many CRUD commands are added.

Micro VM-to-libmu interaction

This interface does not need to be public, but the proper handling of data structures in the micro VM is important. This interface is part of the verification.

In all cases, the choice of languages does matter. Properly chosen languages for the client and the micro VM will result in high performance and verifiability.

Future languages

This section contains my @wks personal opinions. These affects my opinions on this API, too.

In the future, popular system programming languages will generally be higher-level than current languages (such as C). We already showed that a high-level language, such as RJava and Java, can produce high-quality language runtimes, such as JikesRVM.

Instead of relying on C, high-level system programming languages will gain direct control over low-level operations, such as raw memory access (pointers). These low-level stuff can be done even though the language itself still has very high-level features, such as garbage collection and object-oriented programming. It is possible (in my opinion at least) that eventually such high-level language can replace libc and directly interface with the kernel, thus eliminates the necessity of C except for some very rare cases.

This trend is already visible in several languages. C# already has unsafe pointers and unsafe native interface (P/Invoke). The Java API is mostly implemented in Java, unlike the old days when the standard Java library is mostly implemented in C++. There is also a JEP to add unsafe native interface in Java, which still has not become mainstream yet. However, OpenJDK already exposes some low-level stuff though the sun.misc.Unsafe class. RJava obviously used magic to gain low-level support. In Ruby, ruby-ffi is recommended over directly writing C modules.

With C being redundant, the high-level language (or runtime) may be optimised for internal interoperation (e.g. Java-to-Java calls will be faster and faster) at the expense of the interoperability with C (e.g. even unsafe native call may be costly. Object pinning or opaque handles are required for native programs, which would only run briefly.).

Since raw memory access will be faster while foreign function call will be slower, serialisation may have an advantage over calling (my experiment already shows this for Java and C). But this does not rule out the calling-based API because it may not be foreign calls, in which case it is still faster than serialisation. Ideally, in the meta-circular setting, if both Mu, libmu and the client are in Mu, then Mu-to-Mu function calls are virtually free.

wks commented 8 years ago

The calling-based API should not only satisfy the language the client is written in, but also the language the micro VM is written in. This affects some design choices.

For example, if the micro VM is written in C/C++/Java/RJava... (friendly to mutation, allows cyclic references), the C API could be designed as:

// defining a linked list:
// .typedef @i32 = int<32>
// .typedef @Node = struct<@i32 @NodeRef>
// .typedef @NodeRef = ref<@Node>

// Create each node first, because the type is recursive.
IntegerType *i32 = libmu->IntegerTypeCreate(0x1234 /* id */, "@i32" /* name*/, 32 /* length */);
StructType *node = libmu->StructTypeCreate(0x2345, "@Node");
RefType *node_ref = libmu->RefTypeCreate(0x3456, "@NodeRef");

// Populate their members.
Type *node_fields[] = {i32, node_ref};
libmu->StructTypeSet(node, node_fields);
libmu->RefTypeSet(node_ref, node);

If the micro VM is written in Rust (allows mutation, dislike cyclic references and we don't want to force the micro VM to use reference counting because Rifat knows it is slow, or unsafe pointers because we know it is unsafe), then the C API would look like:

// defining a linked list:
// .typedef @i32 = int<32>
// .typedef @Node = struct<@i32 @NodeRef>
// .typedef @NodeRef = ref<@Node>

// Create each node first, because the type is recursive.
IntegerType *i32 = libmu->IntegerTypeCreate(0x1234 /* id */, "@i32" /* name*/, 32 /* length */);
StructType *node = libmu->StructTypeCreate(0x2345, "@Node");
RefType *node_ref = libmu->RefTypeCreate(0x3456, "@NodeRef");

// Populate their members. Since rust does not like objects refering to things they don't own,
// the types may instead refer to each other by IDs.
ID node_fields[] = {0x1234, 0x3456};
libmu->StructTypeSet(node, node_fields);
libmu->RefTypeSet(node_ref, 0x2345);

If the micro VM is written in Haskell/ML/... (dislike mutation, dislike cyclic references), then the C API would look like:

// defining a linked list:
// .typedef @i32 = int<32>
// .typedef @Node = struct<@i32 @NodeRef>
// .typedef @NodeRef = ref<@Node>

// Create each node. The types are recursive, but we want to get rid of mutations. 
IntegerType *i32 = libmu->IntegerTypeCreate(0x1234 /* id */, "@i32" /* name*/, 32 /* length */);
ID node_fields[] = {0x1234, 0x3456};
StructType *node = libmu->StructTypeCreate(0x2345, "@Node", node_fields);

// Note that node_ref, which will have ID 0x3456, has not been created yet.
// The internal state of @Node is incomplete at this moment.

RefType *node_ref = libmu->RefTypeCreate(0x3456, "@NodeRef", 0x2345);

// Now the state is complete.
// As long as Mu or the client don't ask "What fields does @Node have?" before completion,
// there will be no problem.

Both pointer-based API and ID-based API should work for all languages. Pointers can be resolved from a HashMap<ID, Obj>, and IDs can be a common field in all objects. Both eager API and lazy API work for all languages. We can require libmu to resolve undefined IDs later. But one kind of API is always better (or "more native") for one implementation than another.