Micro VM interface changes precipitated by having C as a client

eliotmoss commented 9 years ago

While there will be a bunch of carefully worked out details that need to follow, I thought I would get the ball rolling with a quick summary / highlights of the Modula-3 approach. (See https://www.cs.purdue.edu/homes/hosking/m3/reference/m3.html for a Modula-3 reference.)

First, M3 has ordinary traced references and also untraced ones. REF T is the type of a traced reference to a T, while UNTRACED REF T is the type of an untraced one. Traced vs. untraced is a concept distinct from safe vs. unsafe. Safe use of traced and untraced keep them segregated. Untraced objects come from an explicitly managed storage area. You can use NEW on an untraced reference type to allocate in such an area.

Certain regions of code (interfaces or modules -- probably a bundle is the corresponding thing for the Micro VM) can be marked UNSAFE. Unsafe code can do additional things. These are ones that I remember:

Cast an expression to an arbitrary type (of the same size in bits) - this clearly allows interconverting between traced and untraced pointers, but also other things; this is called LOOPHOLE(e, T).
Address arithmetic (Digression: In M3 REFANY is the type of a traced reference to any object, which in safe code requires a dynamically type-checked downcast to turn into a REF T and do more interesting things. The corresponding untraced type is ADDRESS, and it has arithmetic operators.)
Up/down cast without checking
Free an untraced reference's referent (the function is called DISPOSE) Safe code must use traced/untraced only is specific ways, described in the language reference.

We may need to add to this sort of design to allow short unsafe code regions to manipulate ordinary (traced) objects safely in the presence of (say) concurrent GC. Also, an untraced version of iref would make sense for the Micro VM.

mn200 commented 9 years ago

We have been discussing this and agree that we want to have pointers manipulated by microvm programs but which point outside of the uvm's "world". It seemed to us that such pointers would look the same as our existing internal reference type, and that dealing with tracedness or not would be entirely a matter for the uvm implementation (i.e. not visible to the client).

eliotmoss commented 9 years ago

Hmm. I'm trying to think this through. It seems to me that the existing notion of irefs are intended to deal with so-called interior pointers - pointers that refer to some part of a heap allocated object but that do not necessarily point at its canonical reference point. In the case of a moving collector, an iref could therefore move, and the system would have to track the "base" object from which it was derived. The untraced pointers of C are different in that they have no connection with the GCed world at all. Further, it is possible to distinguish between safe operations and unsafe operations on untraced pointers. For example, an untraced pointer can refer to a structured object and we can safely create an (untraced) iref to a field of that object, etc. It is true that C also allows unsafe computations on pointers, and we desired to restrict those to happening only on untraced pointers, and only in modules explicitly marked unsafe. To help support one style of writing allocators and collectors, unsafe modules would be allowed to inter-convert traced and untraced pointers. But I guess what deeply distinguishes untraced from iref for me is that an untraced ref is a valid argument to free() (an operation allowed in an unsafe module) while an iref is not. I understand the yearning toward parsimony, but I think there is real utility in the distinctions I laid out.

eliotmoss commented 9 years ago

Meanwhile, I have been wondering a little about packed data structures, such as C's bit fields. For systems programming such packed data structures are convenient. However, given how most machines work, we cannot really create pointers (irefs) to fields of packed structures, at least not in the ordinary sense. I am not sure about the cleanest way to handle this. It would be nice to have a compiler / run-time handle the shifting/masking for us when reading / storing packed bit fields or bits in packed arrays, but it may require explicit new operators for supporting reading / writing packed structure fields and array elements, if we want to support this. The idea of bit sizes is there; it's just that it is not reasonable to support general pointers down to the bit level. I would be interested in other people's thoughts on this.

mn200 commented 9 years ago

We certainly agreed (trying to remember the content of the meeting) that the implementation would need to distinguish these different sorts of pointers.

As for the use of free being the user-visible difference in the usages, I think we were confused by how/why uVM code would be allocating (in) untraced regions of space. We could certainly imagine clients handing untraced refs over to the uVM to be acted on there, but if the uVM was itself to be handing memory off to other worlds, then we thought that it would just be able to allocate memory in its own heap, and to simply hand over irefs, trusting that the callee would do the right thing with these.

And wrt safe vs unsafe, I think we were of the attitude that we would happily support generally unsafe operations everywhere. So, for example, casts between different object ref types would be freely available and unchecked.

eliotmoss commented 9 years ago

Why would uVM code allocate in untraced space? Well, if C is a client, C does malloc/free, and these will have to be in untraced space. But more than that, sometimes explicit management is something a library wants to do, and can do more efficiently than a GC can. Also, in the context of a possibly moving GC, untraced is good for communication with the OS via non-moving buffers (and even mmap-ed data). Beyond that, if we choose to implement some of the system internals with uVM code using C (say), then those internals might need to work in terms of a simple untraced allocation system. Put another way, how else would the collector's own data structures get managed? Concerning the notion of unsafe modules, I am surprised that you're not on board with trying to isolate and confine the places where some of the worst things can happen. I can imagine, for example, that some clients could be written to produce entirely type-safe uVM code, and I have a sense that it would be helpful to know that. But I suppose I am even more persuaded by the improved debuggability and provability of a system where we build parts of uVM itself from uVM code and we are able to "sand box" the unsafe parts that require more careful scrutiny. While JikesRVM's magics allowed us to do bad things, the general type-safeness of the universe made it much more possible to write complex system components, debug them, and build confidence in their correct operation ...

mn200 commented 9 years ago

IRefs to bitfields is an interesting one! It perhaps deserves an issue of its own. (We have repeatedly told ourselves that irefs would be "fat" pointers, and if this is the case, then references to arbitrary fields would probably be easy enough to represent. There'd still be lots of work for the compiler to deference such pointers of course.)

wks commented 9 years ago

Currently I am pushing Mu to converge towards providing these unsafe things we mentioned here. I want Mu to support something like the "unsafe" part of the C# programming language. Ideally there could be a subset of Mu IR that provides everything C needs for native memory access.

A draft of the native interface and the AMD64 Unix-specific part is available in the Mu spec.

In summary, we have:

A ptr<T> type which points to data, and a funcptr<sig> type which points a native function. They are defined as addresses, and are not traced by the GC.
The pinning and unpinning mechanism.
The PTRCAST instruction which casts between ptr<T>, funcptr<sig> and int<n>.
All memory addressing instructions (GETFIELDIREF, GETELEMIREF, ...) and memory accessing instructions (LOAD, STORE, ...) support pointers, though the name "...IREF" may be misleading.
A CCALL instruction which calls a native function.

In this system, ptr<T> plays the part of untraced references. iref<T> currently only refers to the traced world, but I am not sure if it should be allowed to point to the raw memory.

The PTRCAST instruction casts between types by preserving the address without any runtime checking. So Mu does not really know what is an upcast or a downcast. It even allows directly casting an integer into a pointer. So address computation can be done by either using the GETFIELDIREF or similar instructions, or using arithmetic operations on integer addresses.

For traced types, the REFCAST instruction casts between two refs (object references), two irefs (internal references) or two funcs (Mu function references) without any runtime checking.

For non-traced types, currently Mu has two instructions that casts by preserving bits: BITCAST (between integers and FP) and PTRCAST (between ptr, funcptr and int). The latter is only bit-preserving if the integer type has the same length as the pointer. Otherwise it truncates/zero-extends the pointer/integer.

ref and iref can be converted to ptr by pinning. ptr cannot be converted back to ref or iref, but if a crazy metacircular Mu implementation attempts to implement its own GC in Mu IR, it may allow this conversion in an implementation-specific way (may be some extended instructions) and also worry about the movement of objects.

There is currently no "safe regions", and pointers can be used anywhere (i.e. everywhere is unsafe).

Manual memory management can be implemented outside or above Mu. For example,

Calling the malloc and free function using the CCALL instruction is one possible approach. Other C allocators (such as tcmalloc or jemalloc) can be used, too.
The client can implement its own malloc/free counterparts as Mu functions. The algorithm involves address computation, but the user simply uses whatever addresses the algorithm returns. It should, however, use the system interface (such as mmap or malloc) for bulk allocation.
The client can also implement a simple untraced bump-pointer allocator by inlining the pointer increment code into every allocation site in Mu IR.

Pointers may be obtained from native programs, such as the malloc and free functions mentioned before, as well as memory mapping. Memory mapping works directly with the address space, so it is convenient to use integers directly as addresses. Memory mapping also provides a way for inter-process communication. Graphic applications can directly fill the buffers provided by some system softwares (such as X Server or Wayland).

microvm / microvm-meta

Micro VM interface changes precipitated by having C as a client #23