metalbear-co / mirrord

Connect your local process and your cloud environment, and run local code in cloud conditions.
https://mirrord.dev
MIT License
3.74k stars 101 forks source link

Possible issues with stack #2624

Closed netguino closed 1 month ago

netguino commented 1 month ago

Bug Description

It seems that whenever we invoke mirrord exec, we are running out of stack, and mirrord is silently disappearing.

In this case, we are running ruby version 3.2.1 on linux.

Can confirm this doesnt happen on mac.

Can confirm 3.111.0 does not exhibit this issue

Steps to Reproduce

  1. Update to 3.112.0
  2. Run processes that has lots of childs with mirrord exec
  3. Observe mirrord disappear into the ether...

Backtrace

SocketError: Failed to open TCP connection to risk.default.svc.cluster.local:80 (getaddrinfo: Temporary failure in name resolution)
            test/helpers/exemplar.rb:78:in `get_request'

Relevant Logs

...
2024-07-31T18:27:51.426372Z TRACE ThreadId(01) mirrord_layer::socket::hooks: hooked "dup2"
2024-07-31T18:27:51.426381Z TRACE ThreadId(01) mirrord_layer::socket::hooks: hooked "getpeername"
2024-07-31T18:27:51.426390Z TRACE ThreadId(01) mirrord_layer::socket::hooks: hooked "getsockname"
2024-07-31T18:27:51.426398Z TRACE ThreadId(01) mirrord_layer::socket::hooks: hooked "gethostname"
2024-07-31T18:27:51.426409Z TRACE ThreadId(01) mirrord_layer::socket::hooks: hooked "accept4"
2024-07-31T18:27:51.426418Z TRACE ThreadId(01) mirrord_layer::socket::hooks: hooked "dup3"
2024-07-31T18:27:51.426425Z TRACE ThreadId(01) mirrord_layer::socket::hooks: hooked "accept"
2024-07-31T18:27:51.426437Z TRACE ThreadId(01) mirrord_layer::socket::hooks: hooked "gethostbyname"
2024-07-31T18:27:51.426445Z TRACE ThreadId(01) mirrord_layer::socket::hooks: hooked "getaddrinfo"
2024-07-31T18:27:51.426459Z TRACE ThreadId(01) mirrord_layer::socket::hooks: hooked "freeaddrinfo"
2024-07-31T18:27:51.426468Z TRACE ThreadId(01) mirrord_layer::exec_hooks::hooks: hooked "execv"
2024-07-31T18:27:51.426477Z TRACE ThreadId(01) mirrord_layer::exec_hooks::hooks: hooked "execve"
2024-07-31T18:27:51.426695Z  INFO ThreadId(01) mirrord_layer: Initializing mirrord-layer!
2024-07-31T18:27:51.426726Z DEBUG ThreadId(01) mirrord_layer: Loaded into executable executable=Some("/usr/bin/diff") args=Some(ExecuteArgs { exec_name: "diff", invoked_as: "diff", args: ["diff", "-u", "/tmp/expect20240731-1113580-sp8nfx", "/tmp/butwas20240731-1113580-fa2ke0"] }) pid=1113859 parent_pid=1113580 env_vars=Vars { 
...
("RAILS_ENV", "test"), ("RACK_ENV", "development")] }

Your operating system and version

Linux 6.8.5 kernel

Local process

bundle

Local process version

ruby 3.21. bundler 2.4.6

Additional Info

Unfortunately no bypass on socket logs :

...
2024-07-31T18:53:07.029355Z TRACE ThreadId(01) mirrord_layer::socket::ops: getaddrinfo -> result 0x00000000043c2ec0
2024-07-31T18:53:07.029449Z TRACE ThreadId(01) mirrord_layer::socket::ops: in connect LazyLock(
2024-07-31T18:53:07.029908Z TRACE ThreadId(01) mirrord_layer::socket::ops: we are connected Connected {
2024-07-31T18:53:07.029964Z TRACE ThreadId(01) mirrord_layer::socket::ops: getsockname -> local_address Ip(
2024-07-31T18:53:07.138523Z TRACE ThreadId(01) mirrord_layer::socket::ops: getaddrinfo -> result 0x0000000004458b00
2024-07-31T18:53:07.138609Z TRACE ThreadId(01) mirrord_layer::socket::ops: in connect LazyLock(
2024-07-31T18:53:07.139432Z TRACE ThreadId(01) mirrord_layer::socket::ops: we are connected Connected {
2024-07-31T18:53:07.139506Z TRACE ThreadId(01) mirrord_layer::socket::ops: getsockname -> local_address Ip(
...
netguino commented 1 month ago

A simple way to reproduce:

https://gist.github.com/netguino/f1b76a5256637379f37bbdddc4b74f45

Use these files, and run mirrord exec -- bundle exec rake test

It seems that the system call is the one causing the issue.

meowjesty commented 1 month ago

Looks like commenting out the DetourGuard stops this issue from being triggered. Using hook_guard_fn also breaks stuff (tested it for clarity of mind).

I'm not sure why this happens though, maybe it blows up the stack because we keep creating these guards and they stay alive forever (since execve doesn't really return)?

#[mirrord_layer_macro::hook_fn]
pub(crate) unsafe extern "C" fn execve_detour(
    path: *const c_char,
    argv: *const *const c_char,
    envp: *const *const c_char,
) -> c_int {
    use crate::{common::CheckedInto, detour::DetourGuard};

    // let _guard = DetourGuard::new();

    // Hopefully `envp` is a properly null-terminated list.
    if let Detour::Success(envp) = prepare_execve_envp(envp.checked_into()) {
        FN_EXECVE(path, argv, envp.leak())
    } else {
        FN_EXECVE(path, argv, envp)
    }
}

@Razz4780 any ideas?

aviramha commented 1 month ago

from thread_local

/// # Platform-specific behavior
///
/// Note that a "best effort" is made to ensure that destructors for types
/// stored in thread local storage are run, but not all platforms can guarantee
/// that destructors will be run for all types in thread local storage. For
/// example, there are a number of known caveats where destructors are not run:
///
/// 1. On Unix systems when pthread-based TLS is being used, destructors will
///    not be run for TLS values on the main thread when it exits. Note that the
///    application will exit immediately after the main thread exits as well.
/// 2. On all platforms it's possible for TLS to re-initialize other TLS slots
///    during destruction. Some platforms ensure that this cannot happen
///    infinitely by preventing re-initialization of any slot that has been
///    destroyed, but not all platforms have this guard. Those platforms that do
///    not guard typically have a synthetic limit after which point no more
///    destructors are run.
/// 3. When the process exits on Windows systems, TLS destructors may only be
///    run on the thread that causes the process to exit. This is because the
///    other threads may be forcibly terminated.

I think exec might trigger calling destructors, which clean it up perhaps we need to leak the value before exec?

Razz4780 commented 1 month ago

Tried to reproduce on:

  1. Linux 6.5.0
  2. Ruby 3.2.1
  3. Bundler 2.4.6

Ran tests with:

  1. No guard in execve hook - intproxy-no-guard.log
  2. Normal guard in execve hook - intproxy-normal-guard.log
  3. Guard that is immediately leaked with std::mem::forget in execve hook - intproxy-guard-forget.log

Looks like there's no difference between cases 2. and 3. DNS resolution is correctly hooked in only one of the tests (you can see one GetAddrInfoRequest for py-serv). In case 1. the process hangs and stopping it with ctrl+c does not trigger intproxy exit (intproxy lingers until manually killed)