When a grpcio::client::Client object went out of scope and each of the objects elements where dropped, I observed a crash when attempting to drop an "Environment" object:
(gdb) bt
#0 ___pthread_detach (th=139936879605440) at ./nptl/pthread_detach.c:48
#1 0x000055d69512f660 in core::ptr::drop_in_place<std::sys::pal::unix::thread::Thread> () at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515
#2 core::ptr::drop_in_place<std::thread::JoinInner<()>> ()
at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515
#3 core::ptr::drop_in_place<std::thread::JoinHandle<()>> ()
at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515
#4 core::ptr::drop_in_place<[std::thread::JoinHandle<()>]> ()
at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515
#5 alloc::vec::{impl#24}::drop<std::thread::JoinHandle<()>, alloc::alloc::Global> (
self=0x55d696f1e8a8) at /usr/src/rustc-1.78.0/library/alloc/src/vec/mod.rs:3205
#6 core::ptr::drop_in_place<alloc::vec::Vec<std::thread::JoinHandle<()>, alloc::alloc::Global>> () at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515
#7 core::ptr::drop_in_place<grpcio::env::Environment> ()
at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515
#8 alloc::sync::Arc<grpcio::env::Environment, alloc::alloc::Global>::drop_slow<grpcio::env::Environment, alloc::alloc::Global> (self=<optimized out>)
at /usr/src/rustc-1.78.0/library/alloc/src/sync.rs:1804
#9 0x000055d69512f726 in alloc::sync::{impl#33}::drop<grpcio::env::Environment, alloc::alloc::Global> (self=0x55d696f20130)
at /usr/src/rustc-1.78.0/library/alloc/src/sync.rs:2459
#10 core::ptr::drop_in_place<alloc::sync::Arc<grpcio::env::Environment, alloc::alloc::Global>> () at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515
#11 core::ptr::drop_in_place<grpcio::channel::ChannelInner> ()
at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515
#12 alloc::sync::Arc<grpcio::channel::ChannelInner, alloc::alloc::Global>::drop_slow<grpcio::channel::ChannelInner, alloc::alloc::Global> (self=<optimized out>)
at /usr/src/rustc-1.78.0/library/alloc/src/sync.rs:1804
#13 0x000055d695129a21 in alloc::sync::{impl#33}::drop<grpcio::channel::ChannelInner, alloc::alloc::Global> (self=0x7ffd4eb24cc8)
at /usr/src/rustc-1.78.0/library/alloc/src/sync.rs:2459
#14 core::ptr::drop_in_place<alloc::sync::Arc<grpcio::channel::ChannelInner, alloc::alloc::Global>> () at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515
#15 core::ptr::drop_in_place<grpcio::channel::Channel> ()
at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515
#16 core::ptr::drop_in_place<grpcio::client::Client> ()
at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515
#17 core::ptr::drop_in_place<onboarding::proto::onboarding_gw_service_grpc::OnboardingGwClient> () at /usr/src/rustc-1.78.0/library/core/src/ptr/mod.rs:515
#18 0x000055d69512a929 in onboarding_geolocation_client::send_geolocation_ping (
interface_name=..., address=..., hostname=..., cert_chain_bytes=...,
client_key=0x7ffd4eb24c30, client_cert=<optimized out>)
at src/bin/onboarding-geolocation-client.rs:104
#19 onboarding_geolocation_client::main ()
at src/bin/onboarding-geolocation-client.rs:164
(gdb)
In our application we have:
/// Construct a gRPC channel and send a geolocation message
fn send_geolocation_ping(
interface_name: &str,
address: SocketAddr,
hostname: &str,
cert_chain_bytes: &[u8],
client_cert: &openssl::x509::X509,
client_key: &onboarding::PrivateKey,
) -> Result<(), Error> {
// 1. Make a gRPC channel (securely, if we have a root cert)
let env = std::sync::Arc::new(grpcio::EnvBuilder::new().build());
// We are connecting by IP address, but we must ensure the SSL certificate
// contains the right hostname
let mut builder = grpcio::ChannelBuilder::new(env).override_ssl_target(hostname);
if !cert_chain_bytes.is_empty() {
// Only set up an encrypted connection if we have the server-side certificates
let creds = grpcio::ChannelCredentialsBuilder::new()
.root_cert(cert_chain_bytes.to_owned())
.cert(client_cert.to_pem()?, client_key.to_grpc_cert_bytes())
.build();
builder = builder.set_credentials(creds);
}
let ch = builder.connect(&address.to_string());
// 2. Make an Onboarding Gateway Client that uses that gRPC channel
let gw: onboarding::proto::onboarding_gw_service_grpc::OnboardingGwClient = onboarding::proto::onboarding_gw_service_grpc::OnboardingGwClient::new(ch);
// 3. Build the request
let mut req = onboarding::proto::onboarding_gw_messages::DiscoverGeoIPLocationRequest::new();
req.set_interface_name(interface_name.to_owned());
// 4. Send it (synchronously) and wait for the response
let opt = grpcio::CallOption::default().timeout(ONBOARDING_GW_GRPC_TIMEOUT);
let _rsp = gw
.discover_geo_ip_location_opt(&req, opt)
.map_err(onboarding::error::GrpcError::from)?;
Ok(())
}
The routine is called from the processes main thread and usually the process exits having been able to perform the RPC. The Environment struct looks like:
So it seems on the "_handles" field being dropped we've ended up in this corrupted state. The crashing frame is:
(gdb) f 0
#0 ___pthread_detach (th=139936879605440) at ./nptl/pthread_detach.c:48
48 ./nptl/pthread_detach.c: No such file or directory.
(gdb) p *((struct pthread *) 139936879605440)
Cannot access memory at address 0x7f4597fff6c0
(gdb)
To Reproduce
I've been unable to reproduce this at all sadly so understand if nothing much can be done.
Expected behavior
No crash.
System information
CPU architecture: x64_64
Distribution and kernel version: Debian 12, 6.1.0-24-amd64.
SELinux on?: No SELinux
Any other system details we should know?: The system has a total of 72 CPU cores, though only cores 0-11 are available to the process in question (via cpuaffinity).
Additional context
Seems very rare, we run this process every 30s on 1000s of machines and have never seen this crash other than once. Thus I don't expect much to happen, perhaps this is some horrendous memory corruption.
I did wonder if it's worth joining all the thread handles in the impl of the drop trait for Environment? C.F.
Describe the bug
When a grpcio::client::Client object went out of scope and each of the objects elements where dropped, I observed a crash when attempting to drop an "Environment" object:
In our application we have:
The routine is called from the processes main thread and usually the process exits having been able to perform the RPC. The Environment struct looks like:
So it seems on the "_handles" field being dropped we've ended up in this corrupted state. The crashing frame is:
To Reproduce
I've been unable to reproduce this at all sadly so understand if nothing much can be done.
Expected behavior
No crash.
System information
Additional context
Seems very rare, we run this process every 30s on 1000s of machines and have never seen this crash other than once. Thus I don't expect much to happen, perhaps this is some horrendous memory corruption.
I did wonder if it's worth joining all the thread handles in the impl of the drop trait for Environment? C.F.
https://github.com/tikv/grpc-rs/blob/246308142af9bc61dd637de088607df3856c4991/src/env.rs#L163