Open tristanls opened 11 years ago
NOTE: this is documentation of some offline discussion
One of my ongoing concerns is the ability to run untrusted code. This spans the range from untrusted source code to untrusted binary to interacting with memory written to via Direct Memory Access (DMA). It is important to me to understand that the chain of trust is unbroken and the mechanisms make sense. This concern comes up over again and again when I think about actor configurations interacting. Therefore, before accepting powerful abstractions I need to convince myself that the chain of trust can be maintained and untrusted things can be sandboxed.
In my mind, for the purposes of multiple configurations, I think of untrusted execution in two parts, never giving up control, accessing restricted memory. More commonly, these are referred to as fault and resource isolation.
Some terminology. I will use host configuration to refer to what people typically think of as the kernel. That is, it is the software that has all the access to the machine and is tasked with providing the fault and resource isolation. The term guest configuration refers to anything that is not a host configuration, that is, the things that actually get things done for us.
In current implementation of Tart, config_dispatch(Config cfg)
procedure is responsible for making things happen. A mechanism I am comfortable with, that should address some computation taking too much time is a watchdog timer mechanism. When the timer runs out, an interrupt is generated for which the host configuration has registered a handler and should from then on be able to make the appropriate decisions with regards to the execution of the event that timed out.
This is a big part of resource isolation and a source of security in the system. There are two approaches that combined should provide protection against this pathology.
First, the assumption is that if a trusted compiler is used to compile an actor language into a binary that is compatible with Tart, then accessing restricted memory should never happen. This is a correctness-of-the-compiler problem and can be addressed by having a correct compiler.
With that out of the way, in the case of untrusted binary execution attempting to access restricted memory, hardware support in form of Memory Management Unit (MMU) should suffice. Of course it requires some elaboration, but there is a path forward there.
Additionally, another complication arises from accepting legitimate communications in form of [marshalled](http://en.wikipedia.org/wiki/Marshalling_%28computer_science%29%29 actor configurations %28i.e. messages between configurations across a) or while processing memory written to as a result of DMA. Much as in the case of the correctness-of-the-compiler case, this is a correctness-of-the-unmarshaller problem. Given that the compiler is trusted and correct, and that there exists a trusted and correct way to unmarshall any data that is inherently not trusted, then that should turn into executable code that does not access resources it should not.
In summary, fault and resource isolation between a host configuration and any combination of guest configurations (including the possibility of guest binaries) should be enforceable using a combination of hardware timer driven interrupts (watchdog timers), MMU, along with correct implementations of a compiler and unmarshaller.
While working on tartunit, I came across what I believe to be a valid and useful use case for multiple configurations running on the same CPU. We have already predicted that this need will arise and now it seems we have a tangible example that I think is worth solving.
In order to get tartunit to work initially, I hacked around and ended up creating what I like to call an inception bug. That is, I have a configuration that dispatches events. During dispatching one of these events, before returning execution, I create a new configuration, and dispatch multiple events within that configuration before finally returning control to the original dispatcher which then exits and thinks it dispatched only one event. This is clearly a problem, as in a pathological case, this recursion can go on indefinitely.
What we would like to see is some sort of budget approach to dispatching, in that any configurations that exist, cannot nest within each other but must be scheduled such that a dispatch of one event never occurs within a dispatch of another event.
This seems to give rise to a few considerations/observations that should guide design.
No nested event dispatch
It seems that there should be no nested event dispatch.
Additionally, this seems to create what we commonly refer to as the ground-out problem, for which we have found mechanisms to cope with it.
I'd like to note here, that (as illustrated in
act_serial
behavior in actor.c) calling another procedure to "handle" the event is not the same thing as having a nested configuration that dispatches events.act_serial
is still a part of a single event dispatch and configuration/sponsor does not change. A nested configuration would not have those properties.Access and allocate resources through sponsors
We have discussed using sponsors for all resource allocation and usage, but so far have been able to hand wave that away. I think that in order to implement multiple configurations we now have to figure out the details of how to do everything through sponsors.
Part of the reason for this is that when I think of how a new configuration is created I arrive at a conclusion that it cannot be created without the host configuration's knowledge. If a new configuration is created without the host configuration's knowledge, then we end up with an inception bug and in violation of the "no nested event dispatch" guideline.
If we do create a new configuration through the host configuration, then we can have a touchpoint where the newly created configuration can be given "first class" (as opposed to nested/virtual) status.
Configuration hierarchy
There seems to be a hierarchy of sorts that correlates configurations to the hardware layout. Every CPU seems to want one "host" configuration which is responsible for dispatching all "guest" configurations that are multiplexed over that one CPU. Communications between "guest" configurations that reside on the same CPU appear to be simpler than communications between "guest" configurations that reside on different CPUs.
Should the "guest" configurations use their "host" configuration as a router?
Further down the road, there is inter-machine communication, which involves marshaling messages and references across the wire.
Datacenter as a computer
Another consideration for sponsors and configurations I want to take into account is the datacenter as a computer approach taken by Google Borg/Omega and frameworks like Mesos. I will use Mesos as an example because that's the one that's open source.
Mesos consists of a master, numerous slaves, and numerous applications/frameworks. Mesos master manages Mesos slaves that run on each machine in the cluster. Mesos applications run tasks on the slaves.
A key concept to this type of computational approach is the resource offer.
Mesos master enables fine-grained resource sharing (CPU, memory, disk, etc...) across Mesos applications by making them resource offers. The Mesos master decides how many resources to offer to each application depending on policy.
Mesos applications consist of two parts, a scheduler, which registers with Mesos master to receive resource offers, and an executor process which is launched on a Mesos slave to run the application's task. While the Mesos master decides how many resources to offer, the Mesos application scheduler decides which of the resources to use.
See Mesos Architecture for more details.
I map these ideas onto Organix and actor configurations as follows.
First, I don't like the single Mesos master. I am not advocating that we have such centralized organization. However, the concept of a level of inderection between resources (hardware) and frameworks (actor configurations) that want to run on those resources is beneficial. I think there is a consideration here in the design of sponsorship and inter-configuration communications. A decentralized "master" configuration would be analogous to DNS anycast, in that a level of local knowledge is distributed amongst peers, so that "master" duties are prerformed in a distributed fashion using heuristics that result in good global behavior. An application/framework would ask any nearby peers for resources and whichever one responded first would be the one chosen (hence the DNS anycast analogy). There exist protocols for peer-to-peer distribution of data like Scuttlebutt (Gossip) and DHTs.
I can imagine a design where "host" configurations (those pinned directly to CPU one-to-one) are analagous to Mesos slaves that have resources (dispatch cycles, RAM, disk, etc.) available for computation.
I can also imagine that there are "guest" configurations, which would be analogous to Mesos applications/frameworks that want to "get stuff done". These "guest" configurations would receive resource offers from "host" configurations, and if the resources were sufficient for "guest" configuration requirements, then that "guest" configuration would run on the CPU of the "host" configuration. The matching/pairing of resource offers and "guest" configurations would happen in a distributed peer-to-peer fashion I outlined above.
I believe that a benefit of this approach would be inherent distributed and adaptive nature of Organix. Organix should be able to distribute any number of "guest" configurations with various computational needs (there are different types of resource offers and resource acceptance criteria in Mesos documentation; for example, data locality, or warm caches) amongst available "host" configurations. Furthermore, adding another hardware resource with a new "host" configuration should require no developer effort to distribute the computation further, as the new "host" configuration should joint peer-to-peer ("host"-to-"host" configuration) arrangement and start hosting "guest" configurations as needed. A killer demo of this would be to start with a single device in a cluster and a resource intensive computation. Then, as the cluster is grown, watch that computation distribute itself and speed up linearly (preferably) with the number of nodes joining the cluster.