shadow / shadow

Shadow is a discrete-event network simulator that directly executes real application code, enabling you to simulate distributed systems with thousands of network-connected processes in realistic and scalable private network experiments using your laptop, desktop, or server running Linux.
https://shadow.github.io
Other
1.43k stars 237 forks source link

Support subprocess creation and management (was "fork and exec") #1987

Closed sporksmith closed 12 months ago

sporksmith commented 2 years ago

Supporting fork and exec would make it easier to delegate complexity to wrapper shell or python scripts instead of adding more features to Shadow itself.

e.g. rather than Shadow natively supporting compression (#1554), a user could use a config like:

-path /bin/bash
-args -c "tor | gzip -"

This could also be used to address clean shutdown of processes (#1491) with something like:

hostname:
  processes:
  - path: /bin/bash
    args: -c "tor -f torrc & PID=$! ; sleep 100 && kill $PID"

Using killall #1986 might be a better alternative for this particular case, but a shell script would allow for greater flexibility

The biggest potential blocker right now is that not all file descriptors support duplication yet. However, unix pipes and regular files are probably sufficient to cover a lot of use cases. In the meantime we can log a warning for any file descriptors we can't duplicate into the child.

stevenengler commented 2 years ago

The biggest potential blocker right now is that not all file descriptors support duplication yet. However, unix pipes and regular files are probably sufficient to cover a lot of use cases. In the meantime we can log a warning for any file descriptors we can't duplicate into the child.

We now support duplicating all descriptor types. There is still an existing issue with TCP sockets that would prevent accept() calls on duplicated listening sockets from working correctly in a forked process.

robgjansen commented 2 years ago

FYI: Someone at USENIX ATC'22 told me that support for fork would unlock a lot of value for their use cases.

stevenengler commented 2 years ago

We may also need to consider futexes shared between processes:

https://github.com/shadow/shadow/blob/ee157d00e39c2c7b1c6a7a3467d73f3bb1b2a49a/src/main/host/syscall/futex.c#L143-L147

sporksmith commented 1 year ago

Another motivating example: arti's way of using the obs4 pluggable transport currently involves forking and execing an obs4 process.

Investigating whether it makes sense to add an alternative mechanism to arti that would allow it to use an independently started process...

trinity-1686a commented 1 year ago

fwiw, using fork/exec to start a pluggable transport such as obfs4 is not an arti thing, it also concerns tor, and is actually a requirement from the PT spec.

The parent (arti/tor/...) also wants access to PT stdout, so such a solution based on independently started process would probably require some form of UDS/named pipes

sporksmith commented 1 year ago

fwiw, using fork/exec to start a pluggable transport such as obfs4 is not an arti thing, it also concerns tor, and is actually a requirement from the PT spec.

The parent (arti/tor/...) also wants access to PT stdout, so such a solution based on independently started process would probably require some form of UDS/named pipes

@trinity-1686a In practice tor definitely supports a separately started proxy. e.g. the ClientTransportPlugin config param optionally accepts a socks IP and port instead of a binary to exec.

From my quick, possibly incorrect, read, I think the spec is intended to allow this, and is just made a little confusing here by trying to be both general and concise. I think the "parent process" that initially forks the PT process could be a shell rather than tor or arti themselves. The rest of the spec seems to indicate that all communication between tor/arti and the PT would be over the socks connection, with some configuration passed around in environment variables. (e.g. nothing about the client needing to capture stdin/stdout or use a named pipe etc)

sporksmith commented 1 year ago

Starting to look at this now. My plan is roughly:

A couple other thoughts:

sporksmith commented 1 year ago

Thinking a bit about how seccomp filters are going to work after exec.

Currently the filter is created and loaded in the LD_PRELOAD'd shim during initialization. It assumes that the SIGSYS signal handler has already been installed (earlier in shim initialization), which is where we route syscalls we want to emulate. We allow native syscalls from the shim itself by inspecting the instruction pointer of the call site and seeing if it's in one of the shim's functions.

Both of these will go wrong if we allow the managed process itself to exec. A syscall could be made before we've had a chance to reinstall the SIGSYS handler, which would result in a crash. The shim may also be loaded at a different address, so it's native syscall functions would no longer be correctly allow-listed. (And some other random code will be allow-listed).

We can't uninstall the filter before doing the exec - that's impossible by design.

I don't think there's a feasible way to have some sort of "shadow cookie" that the seccomp filter recognizes. The filter is BPF (not eBPF), and only gets read-only access to the instruction pointer and syscall number and args. If we stored a cookie in, for example, the high bits of the syscall number, we wouldn't be able to clear the cookie before allowing the syscall, so Linux would reject it as an invalid syscall number.

We can't access the program's other registers or memory, so e.g. storing a cookie on the stack also won't work.

It might be possible to replace our seccomp usage with eBPF, which has richer functionality, but using eBPF typically requires root access.

The best solution I can think of is to handle exec by killing the native process and spawning a fresh process from shadow's process. We'll need to be careful to migrate over simulated state that is preserved across exec, such as file descriptors.

sporksmith commented 12 months ago

Closing this, since the MVP is done. Leaving open the execveat and vfork issues, and the "support subprocess creation" milestone, to track further enhancements and fixes.