Consider using a separate daemon process.

sholsapp commented 8 years ago

As you know, gallocy is trying to provide a "transparent" or "implicit" interface to the application. This necessarily means that we transpose replacement memory and thread interface allocators it the process. This is extremely difficult to do correctly, and in some cases we've found it impossible (e.g., in libc's case). The root problem is that we need to use the system allocator/threads at the same time we're replacing them! This turns into a never ending war against the standard library, where gallocy is constantly trying to hide its existence from the actual running process, but periodically corrupting heaps in doing so.

One way I think we can get around this is by moving as much code as possible out of the runtime API and into a separate process. This separate process is a standard C++ application: it can use libraries, use the system allocator, etc., and is truly local state specific to that node. After all, it's a heavy process. This daemon would be responsible for maintaining the distributed vmm, consensus, networking, etc., and would participate in no function interposition black magic. It would also expose an explicit interface as a library: think gallocy_malloc, gallocy_free, gallocy_pthread_create, gallocy_pthread_join, etc. We would have some serious freedom and rule out entire classes of potential errors.

At this point the runtime API is just a library that does very little more than signal handling and function interposition. It still conducts black magic, but it does so without the worry that it is going to screw up its internal state (the primary reason why we're maintaining two allocators, custom types, custom threading symbols, and more). It would use the explicit interface by implementing a custom IPC protocol that we would need to develop.

E.g., a allocation flow might be: 1) target process calls mmap, 2) runtime api intercepts and notifies daemon by IPC, 3) daemon synchronizes global address space, 4) runtime api returns.
E.g., a page transfer might be: 1) target application segfault, 2) runtime api intercepts and notifies daemon by IPC, 3) daemon fetches page from proper owner, 4) runtime api returns with data.
E.g., a request for a page might be: s1) cluster contacts a daemon, 2) daemon notifies runtime by IPC, 3) runtime API reads request for memory 0xffff0000, 4) runtime API sets read-only on 0xffff0000 and sends contents to daemon by IPC, 5) daemon sends memory to cluster and synchronizes global address space.

This decision would make a few show stopping problems, like the libc issue, tractable: this design allows for us to maintain a single system allocator, so no memory allocation mismatch is possible. As long as we can implement the runtime interface such that it doesn't use the allocator (infinite loop) we simply add a IPC sync step to every allocation or fault. That sounds possible.

This design would add at ~5 microseconds to any transaction against the daemon if we chose a fast IPC like domain sockets or shared memory. I got these numbers by running a few tests using https://github.com/rigtorp/ipc-bench.

Am I missing anything super major or would this design work and make implementing the system substantially easier (and a lot, lot, lot safer). First order of business would be an investigation into if IPC uses the allocator, right? I can put together a few experiments.

Let's think of potential problems with this design and record them here.

rverdon commented 8 years ago

I have been thinking about this design for the past couple of days and I couldn't think of any major problems besides what you mentioned at the end of your post.

First order of business would be an investigation into if IPC uses the allocator, right?

My gut reaction is that an IPC uses an allocator.

Also, I would imagine some extra work would need to go into the daemon to be able to handle multiple applications talking to it.

sholsapp commented 8 years ago

First order of business would be an investigation into if IPC uses the allocator, right?

My gut reaction is that an IPC uses an allocator.

Probably, yes, but because it's a low level system call I think that all of the memory it may use is in the kernel. Here's to hoping. I'll find out when I run the experiment.

sholsapp commented 8 years ago

I wrote this in blog form the other day when I conducted an experiment for myself... here it is... TL;DR: we can write IPC code such that it doesn't internally use memory, as long as you avoid libc usage.

This experiment tests to see if one can implement a simple application that uses Unix domain sockets for interprocess communication that does not implicitly use malloc. We say that something implicitly uses memory if it or any dependent code that it uses to acheive its goal internally uses malloc, mmap, sbrk, or equivalents, that request memory from the operating system.

The primary motivation behind this experiment is to answer questions raised in gallocy/issues/30 - which discuss a way to cleanly design a complex memory allocator..

about

There are various ways to acheive interprocess communication using files, signals, sockets, pipes, shared memory, and more. The first question we'd like to answer is: why Unix domain sockets?

The API for Unix domain sockets is similar to that of an Internet socket, but rather than using an underlying network protocol, all communication occurs entirely within the operating system kernel.

We chose to conduct the experiment with Unix domain sockets because they are efficient, simple to reason about, and fit nicely into the client/server paradigm. The same code that we use to implement a simple HTTP server on a TCP/IP stack can be used with a Unix domain socket.

Although we've chosen to move forward with Unix domain sockets in this experiment, we can implement a similar API using shared memory and memory barriers if we find this is a performance bottleneck in the future.

control

We started with a simple socket example that implements a socket server and a socket client. The two programs are hard wired to send and receive three strings. The two programs specify their desire to use Unix domain sockets during their call to socket by specyfing the AF_UNIX domain (other domains that you may already recognized are PF_INET for IPv4 protocols or PF_INET6 for IPv6 protocols).

We compile the socket server and client as stand alone applications and run them in separate terminals. Note, you will not see the output from either application until both applications are started.

First, compile and run the server:

gcc socket_server.c -o server
./server
This is the first string from the client.
This is the second string from the client.
This is the third string from the client.

Second, compile and run the client:

gcc socket_client.c -o client
./client
This is the first string from the server.
This is the second string from the server.
This is the third string from the server.

We know that both applications compile and execute as expected and therefore establish a reasonable baseline for behavior.

experiment

We include a custom malloc definition using function interposition so that we can notify ourselves via standard output if memory is implicitly allocated by these programs.

When a program that uses dynamic libraries is compiled, a list of undefined symbols is included in the binary, along with a list of libraries the program is linked with. At runtime, each symbol is resolved using the first library that provides it.

This custom definition simply prints to standard output before calling the standard library's original version of malloc, which is acheived using dlsym to find the next definition of malloc after our own.

#define _GNU_SOURCE
#include <dlfcn.h>
#include <stdio.h>
#include <stdlib.h>
typedef void *(*malloc_function)(size_t);
void *malloc(size_t sz) {
    printf("custom malloc(%d)\n", sz);
    malloc_function __libc_malloc = (malloc_function) ((int *) dlsym(RTLD_NEXT, "malloc"));
    return __libc_malloc(sz);
}

We compile our custom malloc code into a shared object.

gcc -fPIC -shared my_malloc.c -o libmymalloc.so

Using the LD_PRELOAD environment variable we intercept calls to malloc within an application of our choice. Note, we could also choose to link our shared library to the the application, but we chose to use LD_PRELOAD as it doesn't require that we relink the the application. We choose to client arbitrarily.

LD_PRELOAD=`pwd`/libmymalloc.so ./client
custom malloc(568)
This is the first string from the server.
This is the second string from the server.
This is the third string from the server.

We see that client is implicitly using memory since our custom malloc implementation reports on standard output that 568 bytes have been allocated. We can use tools like gdb with breakpoints on malloc to learn that it is the standard library (i.e., libc) that is the caller of malloc, in particular, calls to fdopen.

We verified this was the case by rewriting socket_client.c so that it does not use the standard library. This involved replacing libc functions fdopen and fgetc with equivalent code that only uses system calls. In the examples disucssed in this experiment, those functions were socket, read, and write.

After adjusting client application so that it no longer used standard library routines that implicitly used memory, we saw the output revert back to the control output. This indicates that malloc is not implicitly used in the applicaiton, which implies that the Unix domain sockets implementation also doe snot implicitly use memory.

conclusion

We can implement a simple IPC system using Unix domain socks that does not implicitly use malloc. This is important for the discussion in gallocy/issues/30 since it implies that a multiprocess design can be written in a way that makes problems discussed in gallocy/issues/20 and gallocy/issues/25 unncessary to solve.

sholsapp / gallocy