Open sholsapp opened 8 years ago
I have been thinking about this design for the past couple of days and I couldn't think of any major problems besides what you mentioned at the end of your post.
First order of business would be an investigation into if IPC uses the allocator, right?
My gut reaction is that an IPC uses an allocator.
Also, I would imagine some extra work would need to go into the daemon to be able to handle multiple applications talking to it.
First order of business would be an investigation into if IPC uses the allocator, right?
My gut reaction is that an IPC uses an allocator.
Probably, yes, but because it's a low level system call I think that all of the memory it may use is in the kernel. Here's to hoping. I'll find out when I run the experiment.
I wrote this in blog form the other day when I conducted an experiment for myself... here it is... TL;DR: we can write IPC code such that it doesn't internally use memory, as long as you avoid libc usage.
This experiment tests to see if one can implement a simple application that
uses Unix domain sockets for interprocess communication that does not
implicitly use malloc
. We say that something implicitly uses memory if it
or any dependent code that it uses to acheive its goal internally uses
malloc
, mmap
, sbrk
, or equivalents, that request memory from the
operating system.
The primary motivation behind this experiment is to answer questions raised in gallocy/issues/30 - which discuss a way to cleanly design a complex memory allocator..
There are various ways to acheive interprocess communication using files, signals, sockets, pipes, shared memory, and more. The first question we'd like to answer is: why Unix domain sockets?
The API for Unix domain sockets is similar to that of an Internet socket, but rather than using an underlying network protocol, all communication occurs entirely within the operating system kernel.
We chose to conduct the experiment with Unix domain sockets because they are efficient, simple to reason about, and fit nicely into the client/server paradigm. The same code that we use to implement a simple HTTP server on a TCP/IP stack can be used with a Unix domain socket.
Although we've chosen to move forward with Unix domain sockets in this experiment, we can implement a similar API using shared memory and memory barriers if we find this is a performance bottleneck in the future.
We started with a simple socket
example that implements a socket
server and a socket client. The two programs are hard wired to send and receive
three strings. The two programs specify their desire to use Unix domain sockets
during their call to socket
by specyfing the AF_UNIX
domain (other domains
that you may already recognized are PF_INET
for IPv4 protocols or PF_INET6
for IPv6 protocols).
We compile the socket server and client as stand alone applications and run them in separate terminals. Note, you will not see the output from either application until both applications are started.
First, compile and run the server:
gcc socket_server.c -o server
./server
This is the first string from the client.
This is the second string from the client.
This is the third string from the client.
Second, compile and run the client:
gcc socket_client.c -o client
./client
This is the first string from the server.
This is the second string from the server.
This is the third string from the server.
We know that both applications compile and execute as expected and therefore establish a reasonable baseline for behavior.
We include a custom malloc
definition using function interposition so that
we can notify ourselves via standard output if memory is implicitly allocated
by these programs.
When a program that uses dynamic libraries is compiled, a list of undefined symbols is included in the binary, along with a list of libraries the program is linked with. At runtime, each symbol is resolved using the first library that provides it.
This custom definition simply prints to standard output before calling the
standard library's original version of malloc
, which is acheived using
dlsym
to find the next definition of malloc
after our own.
#define _GNU_SOURCE
#include <dlfcn.h>
#include <stdio.h>
#include <stdlib.h>
typedef void *(*malloc_function)(size_t);
void *malloc(size_t sz) {
printf("custom malloc(%d)\n", sz);
malloc_function __libc_malloc = (malloc_function) ((int *) dlsym(RTLD_NEXT, "malloc"));
return __libc_malloc(sz);
}
We compile our custom malloc
code into a shared object.
gcc -fPIC -shared my_malloc.c -o libmymalloc.so
Using the LD_PRELOAD
environment variable we intercept calls to malloc
within an application of our choice. Note, we could also choose to link our
shared library to the the application, but we chose to use LD_PRELOAD
as it
doesn't require that we relink the the application. We choose to client
arbitrarily.
LD_PRELOAD=`pwd`/libmymalloc.so ./client
custom malloc(568)
This is the first string from the server.
This is the second string from the server.
This is the third string from the server.
We see that client is implicitly using memory since our custom malloc
implementation reports on standard output that 568 bytes have been allocated.
We can use tools like gdb with breakpoints on malloc
to learn that it is
the standard library (i.e., libc) that is the caller of malloc
, in
particular, calls to fdopen
.
We verified this was the case by rewriting socket_client.c so that it does not
use the standard library. This involved replacing libc functions fdopen
and
fgetc
with equivalent code that only uses system calls. In the examples
disucssed in this experiment, those functions were socket
, read
, and
write
.
After adjusting client application so that it no longer used standard library
routines that implicitly used memory, we saw the output revert back to the
control output. This indicates that malloc
is not implicitly used in the
applicaiton, which implies that the Unix domain sockets implementation also doe
snot implicitly use memory.
We can implement a simple IPC system using Unix domain socks that does not
implicitly use malloc
. This is important for the discussion in
gallocy/issues/30 since it
implies that a multiprocess design can be written in a way that makes problems
discussed in gallocy/issues/20
and gallocy/issues/25
unncessary to solve.
As you know, gallocy is trying to provide a "transparent" or "implicit" interface to the application. This necessarily means that we transpose replacement memory and thread interface allocators it the process. This is extremely difficult to do correctly, and in some cases we've found it impossible (e.g., in libc's case). The root problem is that we need to use the system allocator/threads at the same time we're replacing them! This turns into a never ending war against the standard library, where gallocy is constantly trying to hide its existence from the actual running process, but periodically corrupting heaps in doing so.
One way I think we can get around this is by moving as much code as possible out of the runtime API and into a separate process. This separate process is a standard C++ application: it can use libraries, use the system allocator, etc., and is truly local state specific to that node. After all, it's a heavy process. This daemon would be responsible for maintaining the distributed vmm, consensus, networking, etc., and would participate in no function interposition black magic. It would also expose an explicit interface as a library: think gallocy_malloc, gallocy_free, gallocy_pthread_create, gallocy_pthread_join, etc. We would have some serious freedom and rule out entire classes of potential errors.
At this point the runtime API is just a library that does very little more than signal handling and function interposition. It still conducts black magic, but it does so without the worry that it is going to screw up its internal state (the primary reason why we're maintaining two allocators, custom types, custom threading symbols, and more). It would use the explicit interface by implementing a custom IPC protocol that we would need to develop.
This decision would make a few show stopping problems, like the libc issue, tractable: this design allows for us to maintain a single system allocator, so no memory allocation mismatch is possible. As long as we can implement the runtime interface such that it doesn't use the allocator (infinite loop) we simply add a IPC sync step to every allocation or fault. That sounds possible.
This design would add at ~5 microseconds to any transaction against the daemon if we chose a fast IPC like domain sockets or shared memory. I got these numbers by running a few tests using https://github.com/rigtorp/ipc-bench.
Am I missing anything super major or would this design work and make implementing the system substantially easier (and a lot, lot, lot safer). First order of business would be an investigation into if IPC uses the allocator, right? I can put together a few experiments.
Let's think of potential problems with this design and record them here.