tikv / grpc-rs

The gRPC library for Rust built on C Core library and futures
Apache License 2.0
1.81k stars 253 forks source link

rare crash in pthread_detach() on dropping grpcio::client::Client #653

Open pdmorrow opened 2 months ago

pdmorrow commented 2 months ago

Describe the bug

When a grpcio::client::Client object went out of scope and each of the objects elements where dropped, I observed a crash when attempting to drop an "Environment" object:

(gdb) bt
#0  ___pthread_detach (th=139936879605440) at ./nptl/pthread_detach.c:48
#1  0x000055d69512f660 in core::ptr::drop_in_place<std::sys::pal::unix::thread::Thread> () at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515
#2  core::ptr::drop_in_place<std::thread::JoinInner<()>> ()
    at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515
#3  core::ptr::drop_in_place<std::thread::JoinHandle<()>> ()
    at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515
#4  core::ptr::drop_in_place<[std::thread::JoinHandle<()>]> ()
    at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515
#5  alloc::vec::{impl#24}::drop<std::thread::JoinHandle<()>, alloc::alloc::Global> (
    self=0x55d696f1e8a8) at /usr/src/rustc-1.78.0/library/alloc/src/vec/mod.rs:3205
#6  core::ptr::drop_in_place<alloc::vec::Vec<std::thread::JoinHandle<()>, alloc::alloc::Global>> () at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515                  
#7  core::ptr::drop_in_place<grpcio::env::Environment> ()
    at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515
#8  alloc::sync::Arc<grpcio::env::Environment, alloc::alloc::Global>::drop_slow<grpcio::env::Environment, alloc::alloc::Global> (self=<optimized out>)                        
    at /usr/src/rustc-1.78.0/library/alloc/src/sync.rs:1804
#9  0x000055d69512f726 in alloc::sync::{impl#33}::drop<grpcio::env::Environment, alloc::alloc::Global> (self=0x55d696f20130)                                                  
    at /usr/src/rustc-1.78.0/library/alloc/src/sync.rs:2459
#10 core::ptr::drop_in_place<alloc::sync::Arc<grpcio::env::Environment, alloc::alloc::Global>> () at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515                    
#11 core::ptr::drop_in_place<grpcio::channel::ChannelInner> ()
    at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515
#12 alloc::sync::Arc<grpcio::channel::ChannelInner, alloc::alloc::Global>::drop_slow<grpcio::channel::ChannelInner, alloc::alloc::Global> (self=<optimized out>)              
    at /usr/src/rustc-1.78.0/library/alloc/src/sync.rs:1804
#13 0x000055d695129a21 in alloc::sync::{impl#33}::drop<grpcio::channel::ChannelInner, alloc::alloc::Global> (self=0x7ffd4eb24cc8)                                             
    at /usr/src/rustc-1.78.0/library/alloc/src/sync.rs:2459
#14 core::ptr::drop_in_place<alloc::sync::Arc<grpcio::channel::ChannelInner, alloc::alloc::Global>> () at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515               
#15 core::ptr::drop_in_place<grpcio::channel::Channel> ()
    at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515
#16 core::ptr::drop_in_place<grpcio::client::Client> ()
    at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515
#17 core::ptr::drop_in_place<onboarding::proto::onboarding_gw_service_grpc::OnboardingGwClient> () at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515                   
#18 0x000055d69512a929 in onboarding_geolocation_client::send_geolocation_ping (
    interface_name=..., address=..., hostname=..., cert_chain_bytes=..., 
    client_key=0x7ffd4eb24c30, client_cert=<optimized out>)
    at src/bin/onboarding-geolocation-client.rs:104
#19 onboarding_geolocation_client::main ()
    at src/bin/onboarding-geolocation-client.rs:164
(gdb) 

In our application we have:

/// Construct a gRPC channel and send a geolocation message
fn send_geolocation_ping(
    interface_name: &str,
    address: SocketAddr,
    hostname: &str,
    cert_chain_bytes: &[u8],
    client_cert: &openssl::x509::X509,
    client_key: &onboarding::PrivateKey,
) -> Result<(), Error> {
    // 1. Make a gRPC channel (securely, if we have a root cert)

    let env = std::sync::Arc::new(grpcio::EnvBuilder::new().build());

    // We are connecting by IP address, but we must ensure the SSL certificate
    // contains the right hostname
    let mut builder = grpcio::ChannelBuilder::new(env).override_ssl_target(hostname);

    if !cert_chain_bytes.is_empty() {
        // Only set up an encrypted connection if we have the server-side certificates
        let creds = grpcio::ChannelCredentialsBuilder::new()
            .root_cert(cert_chain_bytes.to_owned())
            .cert(client_cert.to_pem()?, client_key.to_grpc_cert_bytes())
            .build();
        builder = builder.set_credentials(creds);
    }
    let ch = builder.connect(&address.to_string());

    // 2. Make an Onboarding Gateway Client that uses that gRPC channel

    let gw: onboarding::proto::onboarding_gw_service_grpc::OnboardingGwClient = onboarding::proto::onboarding_gw_service_grpc::OnboardingGwClient::new(ch);

    // 3. Build the request

    let mut req = onboarding::proto::onboarding_gw_messages::DiscoverGeoIPLocationRequest::new();
    req.set_interface_name(interface_name.to_owned());

    // 4. Send it (synchronously) and wait for the response

    let opt = grpcio::CallOption::default().timeout(ONBOARDING_GW_GRPC_TIMEOUT);
    let _rsp = gw
        .discover_geo_ip_location_opt(&req, opt)
        .map_err(onboarding::error::GrpcError::from)?;

    Ok(())
}

The routine is called from the processes main thread and usually the process exits having been able to perform the RPC. The Environment struct looks like:

pub struct Environment {
    cqs: Vec<CompletionQueue>,
    idx: AtomicUsize,
    _handles: Vec<JoinHandle<()>>,
}

So it seems on the "_handles" field being dropped we've ended up in this corrupted state. The crashing frame is:

(gdb) f 0
#0  ___pthread_detach (th=139936879605440) at ./nptl/pthread_detach.c:48
48      ./nptl/pthread_detach.c: No such file or directory.
(gdb) p *((struct pthread *) 139936879605440)
Cannot access memory at address 0x7f4597fff6c0
(gdb) 

To Reproduce

I've been unable to reproduce this at all sadly so understand if nothing much can be done.

Expected behavior

No crash.

System information

Additional context

Seems very rare, we run this process every 30s on 1000s of machines and have never seen this crash other than once. Thus I don't expect much to happen, perhaps this is some horrendous memory corruption.

I did wonder if it's worth joining all the thread handles in the impl of the drop trait for Environment? C.F.

https://github.com/tikv/grpc-rs/blob/246308142af9bc61dd637de088607df3856c4991/src/env.rs#L163

BusyJay commented 2 months ago

Seems more like a Rust bug to me. 🤔

How about reusing Environment like putting it to some global states? Repeatedly creating threads is not good for performance.