Improving Fork Performance with Zombie Pools

vahldiek commented 4 years ago

Description of the Problem

In Linux-SGX PAL fork is implemented via forking to a new process, creating a new SGX enclave, and restoring a memory checkpoint from the parent process/SGX enclave. As a result, applications using fork system call suffer from high overheads to create new processes (and SGX enclaves) when compared to their non-SGX alternatives. The main overhead stems from creating an SGX enclave possibly with GB's of enclave memory for every fork. This pattern is common among server applications such as Apache HTTP, nginx, or redis.

Proposed Solution: Zombie Pools

We suggest to amortize the time to create processes over several fork invocations. We recognize that a forked process could be reused subsequent to the exit system call by another fork in a different process. This would allow to instantiate a new Graphene process without requiring reinitializing the SGX enclave or create a new process and only requires to cleanup and restore a new checkpoint.

While this idea can be implemented in a general way with a global zombie pool, we think that initially it should be implemented as a per process zombie pool. This holds several advantages which simplify the implementation and does not require global coordination and applies to the major impacted workloads such as server applications.

Initially when Graphene starts, it starts as usual creating an enclave. Once this process forks for the first time, it would work as it does today (creating a new process, creating a new enclave, and restoring a checkpoint). Once the child has finished and called exit(), instead of exiting the process the child would notify the parent about the exit and wait on a response from the parent. On the parent side, the exit message from the child results in storing the zombie child in a free list. This free list is used once a new fork occurs within the parent. At this point Graphene would reuse the zombie child by issuing a new checkpoint. At this point it skips the creation of a new process via fork and creating a new SGX enclave.

We're assuming that the child exited with a successful exit code. In addition, this only works for fork and we do not consider exec, since exec loads a different manifest with different layouts and MRENCLAVE. Using it for exec is possible but requires additional considerations such as zombie pools per manifest. Also once a parent exits, it informs all children to exit. This limits the length of zombie pool chains to a single child. While this limits the applicability, we think it is important to not leave exhaustive amounts of resources unused. We therefore suggest the following lifecycle for processes:

Normal mode: Process started as before

Transition to zombie mode on exit

Zombie mode: Process exited and will wait on message from parent

Transition to die on exit, if parent doesn't exit
Transition to normal mode, if parent sends new checkpoint
Transition to die, if parent sends exit message

Implementation Details

We suggest an implementation in the libraryOS layer. Such optimization should be available to all PAL layers to optimize their fork performance. We briefly structure the work into 4 main tasks and describe their possible implementation.

Keep list of Zombies
- Define list of Zombies in shim_process.h in struct shim_process
- Intercept exit message of child (shim_ipc_child.c in fct ipc_cld_exit_callback)
  - Add this child to the zombie list
  - May consider changing the message to tell that it is going to zombie mode instead of exit callback (to differentiate between error and normal exit)
  - Currently message includes exit code and and term signal (may not be necessary)
Don't exit, goto zombie mode
- Intercept exit of a process
  - shim_exit.c (libos_exit and libos_clean_and_exit)
- Kill all children (if exist)
  - Send term message to zombie children
  - New message
- Keep IPC to parent
  - Split del_all_ipc_ports implementation into parent and all other IPC
- May need PAL cleanup of state
  - PAL objects may require cleanup
- Wait on IPC to parent
  - In libos_clean_and_exit wait for parent message to either terminate or restart process
Create child from zombie
- Intercept fork and checkpoint restore (shim_checkpoint.c in create_process_and_send_checkpoint)
- Check that call is for fork and not exec (argument exec is not set)
- Find zombie from zombie list
- If zombie is available, checkpoint and restore
- Otherwise create new process and go through the normal creation
Manifest option for fork pooling
- Define libos.fork_pooling = 0/1
- All implementation should only be enabled when libos.fork_pooling = 1
- Define global variable in shim_init.c and set it in shim_init.c (~ line 500)
  - Use toml_int_in to extract the integer value of libos.fork_pooling

What does this not solve?

The described approach and its implementation suggestion is limited at two points. First, it does not support exec which is common in applications that rely on the system libc function to spawn new shell executions. Second, it does not allow chains of zombies pools to exist. As a result, the particular case where an application executes sh -c ldconfig in a new process is not speed up (only the first invocation of sh may use a zombie from a pool, the subsequent fork into ldconfig has no zombie). While we think that these cases are common, they typically appear several times at the beginning of an application while forking could occur throughout the lifetime of the application. In addition, the approach could be altered to allow for these cases eventually and further improve performance of more use cases.

We would like to solicit your feedback on the proposal.

dimakuv commented 4 years ago

Thanks, Anjo.

This helps greatly for applications that frequently fork children during runtime. E.g., web/database applications that fork a child for every client connection (PostgreSQL). On one of such applications, we observe that a typical run (with ~500 forks) takes 3 hours instead of 5 minutes (36x runtime overhead) due to enclave creation on every fork.

yamahata commented 4 years ago

When recycling zombie, its state needs to be re-initialized before receiving checkpoint. i.e. bring its statue into known (initial) state. There are several ways.

For memory, One simple way is to stash the original image of PAL and LibOS(and app binary image) in reserved area as read only and copy into the actual area. If we can trust the file of PAL and LibOS(e.g. by checking hash value), re-reading them into memory will be another option. This implies some small executable is needed in addition to Pal and LibOS to handle it. Another approach is to make LibOS release all the unused memory on shim_do_exit().(or reinitiazation). I'm not sure how hard it would be without auditing the code in such context.

For other resources, e.g. opened file, they needs to be released on exit correctly. Anyway LibOS is tracking them to some extent.

Once re-initialization is implemented and hash value for executable is known, zombie approach would be applied to exec case.

some random thoughts:

In Pal/Linux-SGX case, shared libraries that is known to be loaded(or whatever files known to be read) can be also initially loaded when building enclave in memory. Then ocall to read shared library can be eliminated. This helps also to shorten normal startup time. Anyway measurement(how ocall slows down fork/startup) should be done.
Do we want to control the total number of zombies? Given the above example, it would be too-early optimization. Simple timeout to kill unused zombie would be enough at first.

mkow commented 3 years ago

I think we'll have to wait with implementing this until we rewrite IPC (#2107).

gramineproject / graphene