grpc / grpc

The C based gRPC (C++, Python, Ruby, Objective-C, PHP, C#)
https://grpc.io
Apache License 2.0
41.62k stars 10.5k forks source link

[Ruby] [1.61.0] crash: terminate called after throwing an instance of 'std::logic_error' what(): basic_string::_S_construct null not valid #35904

Open cswilliams opened 7 months ago

cswilliams commented 7 months ago

What version of gRPC and what language are you using?

1.61.0 version of the grpc ruby gem

What operating system (Linux, Windows,...) and version?

Alpine 3.19.1

What runtime / compiler are you using (e.g. python version or version of gcc)

ruby 3.1.4

What did you do?

Running the following code appears to intermittently crash our sidekiq process:

client = ::Google::Cloud::AIPlatform::V1::PredictionService::Client.new do |config|
  config.endpoint = CONFIG_ENDPOINT
end

instance = Google::Protobuf::Value.new(
  struct_value: Google::Protobuf::Struct.new(
    fields: {
      "mimeType" => Google::Protobuf::Value.new(string_value: "text/plain"),
      "content" => Google::Protobuf::Value.new(string_value: description),
    },
  ),
)
instances = [instance]

client.predict(endpoint: ENDPOINT, instances:)

with the following error:

I0213 17:31:53.052550845 4590 rb_grpc.c:331] GRPC_RUBY: grpc_ruby_init_threads g_bg_thread_init_done=1
D0213 17:31:53.052578563 4590 rb_grpc.c:351] GRPC_RUBY: grpc_ruby_init - g_enable_fork_support=0 prev g_grpc_ruby_init_count:240
I0213 17:31:53.052592997 4590 rb_grpc.c:331] GRPC_RUBY: grpc_ruby_init_threads g_bg_thread_init_done=1
D0213 17:31:53.052602984 4590 rb_grpc.c:351] GRPC_RUBY: grpc_ruby_init - g_enable_fork_support=0 prev g_grpc_ruby_init_count:241
I0213 17:31:53.052799685 4590 rb_grpc.c:331] GRPC_RUBY: grpc_ruby_init_threads g_bg_thread_init_done=1
D0213 17:31:53.052814241 4590 rb_grpc.c:351] GRPC_RUBY: grpc_ruby_init - g_enable_fork_support=0 prev g_grpc_ruby_init_count:242
I0213 17:31:53.052857734 4590 rb_grpc.c:331] GRPC_RUBY: grpc_ruby_init_threads g_bg_thread_init_done=1
D0213 17:31:53.052868997 4590 rb_grpc.c:351] GRPC_RUBY: grpc_ruby_init - g_enable_fork_support=0 prev g_grpc_ruby_init_count:243
I0213 17:31:53.053055788 4590 rb_grpc.c:331] GRPC_RUBY: grpc_ruby_init_threads g_bg_thread_init_done=1
D0213 17:31:53.053069631 4590 rb_grpc.c:351] GRPC_RUBY: grpc_ruby_init - g_enable_fork_support=0 prev g_grpc_ruby_init_count:244
I0213 17:31:53.053083193 4590 rb_grpc.c:331] GRPC_RUBY: grpc_ruby_init_threads g_bg_thread_init_done=1
D0213 17:31:53.053092618 4590 rb_grpc.c:351] GRPC_RUBY: grpc_ruby_init - g_enable_fork_support=0 prev g_grpc_ruby_init_count:245
I0213 17:31:53.053246729 4590 rb_grpc.c:331] GRPC_RUBY: grpc_ruby_init_threads g_bg_thread_init_done=1
D0213 17:31:53.053260180 4590 rb_grpc.c:351] GRPC_RUBY: grpc_ruby_init - g_enable_fork_support=0 prev g_grpc_ruby_init_count:246
I0213 17:31:53.053289163 4590 rb_grpc.c:331] GRPC_RUBY: grpc_ruby_init_threads g_bg_thread_init_done=1
D0213 17:31:53.053299572 4590 rb_grpc.c:351] GRPC_RUBY: grpc_ruby_init - g_enable_fork_support=0 prev g_grpc_ruby_init_count:247
I0213 17:31:53.053419464 4590 rb_grpc.c:331] GRPC_RUBY: grpc_ruby_init_threads g_bg_thread_init_done=1
D0213 17:31:53.053434202 4590 rb_grpc.c:351] GRPC_RUBY: grpc_ruby_init - g_enable_fork_support=0 prev g_grpc_ruby_init_count:248
I0213 17:31:53.053447354 4590 rb_grpc.c:331] GRPC_RUBY: grpc_ruby_init_threads g_bg_thread_init_done=1
D0213 17:31:53.053457543 4590 rb_grpc.c:351] GRPC_RUBY: grpc_ruby_init - g_enable_fork_support=0 prev g_grpc_ruby_init_count:249
I0213 17:31:53.053574807 4590 rb_grpc.c:331] GRPC_RUBY: grpc_ruby_init_threads g_bg_thread_init_done=1
D0213 17:31:53.053588937 4590 rb_grpc.c:351] GRPC_RUBY: grpc_ruby_init - g_enable_fork_support=0 prev g_grpc_ruby_init_count:250
I0213 17:31:53.053614835 4590 rb_grpc.c:331] GRPC_RUBY: grpc_ruby_init_threads g_bg_thread_init_done=1
D0213 17:31:53.053625243 4590 rb_grpc.c:351] GRPC_RUBY: grpc_ruby_init - g_enable_fork_support=0 prev g_grpc_ruby_init_count:251
terminate called after throwing an instance of 'std::logic_error'
what(): basic_string::_S_construct null not valid

We run the above ruby code 100-150 times a day and have seen this crash 6 times over the last month and a half. The crash seems to have started when we merged in a PR that migrated from Google AutoML to Google AI Platform. This particular PR updated the grpc gem from 1.53.1 to 1.60.0 and introduced the google-cloud-ai_platform gem. We've since tried upgrading grpc to 1.61.0, but the crash remains.

Any help would be appreciated. Thanks!

apolcyn commented 7 months ago

Unfortunately, we lack test coverage on alpine linux, and it's difficult to say what's gone wrong from the info here.

My best guess would be to compile from source, in debug mode, and try to see if you can examine the stack trace. That could be done by fetching a source gem and building it under GRPC_CONFIG=dbg environment variable.

Alternatively, running under GRPC_VERBOSITY=DEBUG and GRPC_TRACE=all might shed some light if we're lucky enough to capture something suspect in those logs.

alemr-roo commented 7 months ago

In case it helps: we've also been experiencing similar crashes in Sidekiq processes. From my tests, it happens with all versions of the gem from 1.51.0 (included). I haven't seen any crashes on 1.50.0.

Specifically, we're seeing crashes in 2 different apps. One makes a gRPC call every 5 minutes and the error pops up at random times. The other app has a more nuanced and interesting setup: the Sidekiq worker runs every 5 minutes, but it only makes gRPC calls within specific time windows, which correlate with the crashes. So for example, the worker will make gRPC calls between 13:00 and 15:00, and then we'll get a crash around 16:00. Or it will make calls between 19:00 and 22:00, and we'll get a crash at 23:00. The coincidence seems rather uncanny and it makes us think of something to do with connection lifetime, timeouts or similar.

I'm adding all these details in case they can spark some idea in the mind of contributors, as I have a very limited understanding of this code. In the meantime, we'll see if the debug flags can give us any more data to analyse.

alemr-roo commented 7 months ago

Here's a snippet of a log for one of the crashed services, once we set GRPC_VERBOSITY=DEBUG and GRPC_TRACE=all:

2024-02-17T15:55:02.567+00:00   I0217 15:55:02.567038869 63 timer_manager.cc:191] sleep for a 284 milliseconds
2024-02-17T15:55:02.567+00:00   D0217 15:55:02.567061520 43 timer_generic.cc:341] TIMER 0x7fe7b7de4560: SET 82506369 now 82501369 call 0x7fe7b7de4598[0x7fe7b9315e90]
2024-02-17T15:55:02.567+00:00   D0217 15:55:02.567078140 43 timer_generic.cc:377] .. add to shard 13 with queue_deadline_cap=82501756 => is_first_timer=false
2024-02-17T15:55:02.567+00:00   D0217 15:55:02.567088220 43 timer_generic.cc:696] TIMER CHECK BEGIN: now=82501369 next=9223372036854775807 tls_min=82501369 glob_min=82501653
2024-02-17T15:55:02.567+00:00   D0217 15:55:02.567099331 43 timer_generic.cc:722] TIMER CHECK END: r=1; next=82501653
2024-02-17T15:55:02.567+00:00   I0217 15:55:02.567105951 43 timer_manager.cc:201] sleep until kicked
2024-02-17T15:55:02.738+00:00   I0217 15:55:02.738145847 45 tcp_posix.cc:1104] TCP:0x7fe7b666afc0 got_read: OK
2024-02-17T15:55:02.738+00:00   I0217 15:55:02.738186228 45 tcp_posix.cc:894] TCP:0x7fe7b666afc0 do_read
2024-02-17T15:55:02.747+00:00   terminate called after throwing an instance of 'std::logic_error'
2024-02-17T15:55:02.747+00:00   what(): basic_string::_S_construct null not valid

I can provide previous lines too, but they all seem to be more or less the same combination of timers, sleeps, shard operations and so on.

cswilliams commented 7 months ago

Interesting. Hopefully that helps them narrow it down. A couple hours after posting this issue, I ended up ripping out our code that was using google-cloud-ai_platform (and indirectly grpc) and just rewrote it to hit their REST based API using ruby's net/http. I'm not sure if every google service has a corresponding rest based api (in place of a grpc based one), but that might be an option for anyone affected by this. It also allowed us to remove the google-protobuf gem from our project which had caused us some installation headaches on Alpine, so that was another nice benefit.

apolcyn commented 7 months ago

cc @alto-ruby

awbuana commented 6 months ago

We had same error on Ruby 3.2.2, grpc 1.56.2 on alpine:3.16. Our google pubsub subscriber stopped pulling any message after that error.

karunk commented 4 months ago

Was this resolved? I am facing this issue with 'grpc', '= 1.53' and ruby '3.0.5' as well..

karunk commented 4 months ago

We seemed to have solved the errors we got by downgrading our service to version 1.50

The errors we got were : terminate called after throwing an instance of 'std::logic_error what(): basic_string::_S_construct null not valid

We get errors even if we use version 1.51.

Are there any plans to prioritise this issue / bug happening form version 1.51 ?

alto-ruby commented 3 months ago

Will take a look but changes between 1.50 and 1.51 is only #31739

ilyasgaraev commented 1 month ago

We have similar crashes only in Sidekiq, reproducible on grpc 1.64.0 and 1.65.2 (Alpine 3.20, Ruby 3.3.4). Only downgrade to 1.60.2 solved the issue.