grpc / grpc

The C based gRPC (C++, Python, Ruby, Objective-C, PHP, C#)
https://grpc.io
Apache License 2.0
41.8k stars 10.53k forks source link

[Ruby] SEGV in `grpc_cq_pollset` #35310

Open casperisfine opened 10 months ago

casperisfine commented 10 months ago

What version of gRPC and what language are you using?

What operating system (Linux, Windows,...) and version?

Ubuntu 20.04 LTS

What runtime / compiler are you using (e.g. python version or version of gcc)

Ruby 3.2.2

gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0

What did you do?

I don't know much about the application, I just see it crashing ~300 times a week. All I know is that it uses the grpc gem to respond to grpc calls.

What did you expect to see?

Not crash.

What did you see instead?

Many crashes.

The backtrace of the crashing thread:

(gdb) bt
#0  __GI_abort () at abort.c:107
#1  0x000055759e25feca in die () at error.c:817
#2  rb_bug_for_fatal_signal (default_sighandler=0x0, sig=sig@entry=11, ctx=ctx@entry=0x7fd333dfc3c0, fmt=fmt@entry=0x55759e67d76a "Segmentation fault at %p") at error.c:817
#3  0x000055759e3c1efd in sigsegv (sig=11, info=0x7fd333dfc4f0, ctx=0x7fd333dfc3c0) at signal.c:964
#4  <signal handler called>
#5  0x00007fd34016b579 in grpc_cq_pollset (cq=cq@entry=0x7fd33ed46200) at src/core/lib/surface/completion_queue.cc:1411
#6  0x00007fd3401439aa in grpc_core::FilterStackCall::SetCompletionQueue (this=0x7fd32f726070, cq=0x7fd33ed46200) at src/core/lib/surface/call.cc:928
#7  0x00007fd34017215e in grpc_core::Server::CallData::Publish (this=0x7fd32f726fa0, cq_idx=0, rc=0x7fd33ff80260) at src/core/lib/surface/server.cc:1683
#8  0x00007fd340174ee0 in grpc_core::Server::CallData::StartNewRpc (this=0x7fd32f726fa0, elem=0x7fd32f726ee0) at third_party/abseil-cpp/absl/status/status.h:882
#9  0x00007fd34014f1ab in grpc_core::Closure::Run (location=..., error=..., closure=<optimized out>) at third_party/abseil-cpp/absl/status/status.h:871
#10 grpc_core::FilterStackCall::BatchControl::PostCompletion (this=0x7fd32f7e4a20) at src/core/lib/surface/call.cc:1356
#11 0x00007fd34014f629 in grpc_core::FilterStackCall::BatchControl::FinishStep (this=<optimized out>, op=op@entry=grpc_core::FilterStackCall::PendingOp::kRecvInitialMetadata) at src/core/lib/surface/call.cc:1375
#12 0x00007fd3401511bc in grpc_core::FilterStackCall::BatchControl::ReceivingInitialMetadataReady (this=<optimized out>, error=...) at src/core/lib/surface/call.cc:1483
#13 0x00007fd34015130a in operator() (error=..., error=..., __closure=0x0, bctl=<optimized out>) at third_party/abseil-cpp/absl/status/status.h:871
#14 _FUN () at src/core/lib/surface/call.cc:1750
#15 0x00007fd34016fc8a in grpc_core::Closure::Run (location=..., error=..., closure=0x7fd32f726e08) at third_party/abseil-cpp/absl/status/status.h:871
#16 grpc_core::Server::CallData::RecvInitialMetadataReady (arg=<optimized out>, error=...) at src/core/lib/surface/server.cc:1850
#17 0x00007fd34010bf75 in exec_ctx_run (closure=0x7fd32f727088) at third_party/abseil-cpp/absl/status/status.h:853
#18 grpc_core::ExecCtx::Flush (this=this@entry=0x7fd333dfd340) at src/core/lib/iomgr/exec_ctx.cc:84
#19 0x00007fd340401de8 in grpc_core::ExecCtx::~ExecCtx (this=0x7fd333dfd340, __in_chrg=<optimized out>) at ./src/core/lib/iomgr/exec_ctx.h:130
#20 grpc_event_engine::experimental::(anonymous namespace)::EventEngineEndpointWrapper::FinishPendingRead (this=0x7fd3388fda80, status=...) at src/core/lib/iomgr/event_engine_shims/endpoint.cc:132
#21 0x00007fd340402357 in operator() (__closure=<optimized out>, status=..., __closure=<optimized out>, status=...) at third_party/abseil-cpp/absl/status/status.h:871
#22 absl::lts_20230802::base_internal::Callable::Invoke<grpc_event_engine::experimental::(anonymous namespace)::EventEngineEndpointWrapper::Read(grpc_closure*, grpc_slice_buffer*, const grpc_event_engine::experimental::EventEngine::Endpoint::ReadArgs*)::<lambda(absl::lts_20230802::Status)>&, absl::lts_20230802::Status> (f=...) at third_party/abseil-cpp/absl/base/internal/invoke.h:185
#23 absl::lts_20230802::base_internal::invoke<grpc_event_engine::experimental::(anonymous namespace)::EventEngineEndpointWrapper::Read(grpc_closure*, grpc_slice_buffer*, const grpc_event_engine::experimental::EventEngine::Endpoint::ReadArgs*)::<lambda(absl::lts_20230802::Status)>&, absl::lts_20230802::Status> (
    f=...) at third_party/abseil-cpp/absl/base/internal/invoke.h:212
#24 absl::lts_20230802::internal_any_invocable::InvokeR<void, grpc_event_engine::experimental::(anonymous namespace)::EventEngineEndpointWrapper::Read(grpc_closure*, grpc_slice_buffer*, const grpc_event_engine::experimental::EventEngine::Endpoint::ReadArgs*)::<lambda(absl::lts_20230802::Status)>&, absl::lts_20230802::Status> (f=...) at third_party/abseil-cpp/absl/functional/internal/any_invocable.h:132
#25 absl::lts_20230802::internal_any_invocable::LocalInvoker<false, void, grpc_event_engine::experimental::(anonymous namespace)::EventEngineEndpointWrapper::Read(grpc_closure*, grpc_slice_buffer*, const grpc_event_engine::experimental::EventEngine::Endpoint::ReadArgs*)::<lambda(absl::lts_20230802::Status)>&, absl::lts_20230802::Status>(absl::lts_20230802::internal_any_invocable::TypeErasedState *) (state=<optimized out>) at third_party/abseil-cpp/absl/functional/internal/any_invocable.h:310
#26 0x00007fd34051d3e4 in absl::lts_20230802::internal_any_invocable::Impl<void (absl::lts_20230802::Status)>::operator()(absl::lts_20230802::Status) (args#0=..., this=0x7fd333dfd430) at third_party/abseil-cpp/absl/functional/internal/any_invocable.h:868
#27 grpc_event_engine::experimental::PosixEndpointImpl::HandleRead (this=0x7fd338984400, status=...) at src/core/lib/event_engine/posix_engine/posix_endpoint.cc:588
#28 0x00007fd34051d870 in operator() (__closure=<optimized out>, __closure=<optimized out>, status=...) at third_party/abseil-cpp/absl/status/status.h:853
#29 absl::lts_20230802::base_internal::Callable::Invoke<grpc_event_engine::experimental::PosixEndpointImpl::PosixEndpointImpl(grpc_event_engine::experimental::EventHandle*, grpc_event_engine::experimental::PosixEngineClosure*, std::shared_ptr<grpc_event_engine::experimental::EventEngine>, grpc_event_engine::experimental::MemoryAllocator&&, const grpc_event_engine::experimental::PosixTcpOptions&)::<lambda(absl::lts_20230802::Status)>&, absl::lts_20230802::Status> (f=...) at third_party/abseil-cpp/absl/base/internal/invoke.h:185
#30 absl::lts_20230802::base_internal::invoke<grpc_event_engine::experimental::PosixEndpointImpl::PosixEndpointImpl(grpc_event_engine::experimental::EventHandle*, grpc_event_engine::experimental::PosixEngineClosure*, std::shared_ptr<grpc_event_engine::experimental::EventEngine>, grpc_event_engine::experimental::MemoryAllocator&&, const grpc_event_engine::experimental::PosixTcpOptions&)::<lambda(absl::lts_20230802::Status)>&, absl::lts_20230802::Status> (f=...) at third_party/abseil-cpp/absl/base/internal/invoke.h:212
#31 absl::lts_20230802::internal_any_invocable::InvokeR<void, grpc_event_engine::experimental::PosixEndpointImpl::PosixEndpointImpl(grpc_event_engine::experimental::EventHandle*, grpc_event_engine::experimental::PosixEngineClosure*, std::shared_ptr<grpc_event_engine::experimental::EventEngine>, grpc_event_engine::experimental::MemoryAllocator&&, const grpc_event_engine::experimental::PosixTcpOptions&)::<lambda(absl::lts_20230802::Status)>&, absl::lts_20230802::Status> (f=...) at third_party/abseil-cpp/absl/functional/internal/any_invocable.h:132
#32 absl::lts_20230802::internal_any_invocable::LocalInvoker<false, void, grpc_event_engine::experimental::PosixEndpointImpl::PosixEndpointImpl(grpc_event_engine::experimental::EventHandle*, grpc_event_engine::experimental::PosixEngineClosure*, std::shared_ptr<grpc_event_engine::experimental::EventEngine>, grpc_event_engine::experimental::MemoryAllocator&&, const grpc_event_engine::experimental::PosixTcpOptions&)::<lambda(absl::lts_20230802::Status)>&, absl::lts_20230802::Status>(absl::lts_20230802::internal_any_invocable::TypeErasedState *) (state=<optimized out>)
    at third_party/abseil-cpp/absl/functional/internal/any_invocable.h:310
#33 0x00007fd3403e85d2 in absl::lts_20230802::internal_any_invocable::Impl<void (absl::lts_20230802::Status)>::operator()(absl::lts_20230802::Status) (args#0=..., this=<optimized out>) at third_party/abseil-cpp/absl/status/status.h:774
#34 grpc_event_engine::experimental::PosixEngineClosure::Run (this=0x7fd338965780) at ./src/core/lib/event_engine/posix_engine/posix_engine_closure.h:53
#35 0x00007fd3403f6253 in grpc_event_engine::experimental::WorkStealingThreadPool::ThreadState::Step (this=<optimized out>) at src/core/lib/event_engine/thread_pool/work_stealing_thread_pool.cc:428
#36 grpc_event_engine::experimental::WorkStealingThreadPool::ThreadState::Step (this=<optimized out>) at src/core/lib/event_engine/thread_pool/work_stealing_thread_pool.cc:421
#37 0x00007fd3403f6650 in grpc_event_engine::experimental::WorkStealingThreadPool::ThreadState::ThreadBody (this=this@entry=0x7fd33fcd0f80) at src/core/lib/event_engine/thread_pool/work_stealing_thread_pool.cc:394
#38 0x00007fd3403f676c in operator() (__closure=0x0, arg=0x7fd33fcd0f80) at src/core/lib/event_engine/thread_pool/work_stealing_thread_pool.cc:207
#39 _FUN () at src/core/lib/event_engine/thread_pool/work_stealing_thread_pool.cc:209
#40 0x00007fd340196bd7 in operator() (__closure=0x0, v=<optimized out>) at src/core/lib/gprpp/posix/thd.cc:145
#41 _FUN () at src/core/lib/gprpp/posix/thd.cc:150
#42 0x00007fd347d31609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#43 0x00007fd347c3b133 in __GI___libc_allocate_once_slow (place=0x7fd33ed46200, allocate=0x7fd33ed46200, deallocate=0x0, closure=0x0) at allocate_once.c:27

The last function arguments:

(gdb) p *cq
$2 = {owning_refs = {value_ = {<std::__atomic_base<long>> = {static _S_alignment = 8, _M_i = 3}, <No data fields>}}, 
  padding_1 = "PK\320@\323\177\000\000\250,\020A\323\177\000\000Q\336\000\000\000\000\000\000\001\000\000\000\000\000\000\000PK\320@\323\177\000\000\250,\020A\323\177\000\000\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000", mu = 0x7fd340d05078, 
  padding_2 = "\020l\337?\323\177\000\000\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\b\223\337@\323\177\000\000\270\354\330?\323\177\000\000\000\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000\360\226\337@\323\177\000", vtable = 0x7fd33fe4d580, 
  padding_3 = "\001g\000\000\000\000\000\000\001\000\000\000\000\000\000\000\360\226\337@\323\177\000\000\200\325\344?\323\177\000\000\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000 \267\337@\323\177\000\000\020\t\343?\323\177\000", poller_vtable = 0x0, pollset_shutdown_done = {next_data = {
      next = 0x1, mpscq_node = {space_ = "\001\000\000\000\000\000\000"}, scratch = 1}, cb = 0x7fd340dfac30, cb_arg = 0x7fd33fda5b88, error_data = {error = 0, scratch = 0}}, num_polls = 1}

All the other threads: https://gist.github.com/casperisfine/19cb9dddb11c743f868f09133f636a1f

casperisfine commented 10 months ago

A small extra piece of information, while we had a few SEGV prior, it really started to become frequent right after the upgrade from 1.59.2 to 1.60.0.

sbfaulkner commented 10 months ago

reverting to 1.59.2 did not seem to resolve this we are reverting google-protobuf to 3.25.0 (from 3.25.1) as well now

chadlwilson commented 9 months ago

I'm not a maintainer, but just curious - do you build the native gems from source for grpc and google-protobuf or install the pre-compiled native gems via bundler/Gemfile.lock ?

sbfaulkner commented 9 months ago

been OOO for a couple of weeks, but before we left we discovered this was likely NOT due to any change here, and seems to occur on older versions as well

we did however discover that the crashes seem to occur when shutting down (eg. when HPA scales down replicas of the service)

chadlwilson commented 9 months ago

It's probably still relevant to maintainers whether you use pre-built extension binaries or build from source when creating your container images.

casperisfine commented 9 months ago

do you build the native gems from source for grpc and google-protobuf or install the pre-compiled native gems via bundler/Gemfile.lock ?

The later.