Open cswilliams opened 7 months ago
Unfortunately, we lack test coverage on alpine linux, and it's difficult to say what's gone wrong from the info here.
My best guess would be to compile from source, in debug mode, and try to see if you can examine the stack trace. That could be done by fetching a source gem and building it under GRPC_CONFIG=dbg
environment variable.
Alternatively, running under GRPC_VERBOSITY=DEBUG
and GRPC_TRACE=all
might shed some light if we're lucky enough to capture something suspect in those logs.
In case it helps: we've also been experiencing similar crashes in Sidekiq processes. From my tests, it happens with all versions of the gem from 1.51.0 (included). I haven't seen any crashes on 1.50.0.
Specifically, we're seeing crashes in 2 different apps. One makes a gRPC call every 5 minutes and the error pops up at random times. The other app has a more nuanced and interesting setup: the Sidekiq worker runs every 5 minutes, but it only makes gRPC calls within specific time windows, which correlate with the crashes. So for example, the worker will make gRPC calls between 13:00 and 15:00, and then we'll get a crash around 16:00. Or it will make calls between 19:00 and 22:00, and we'll get a crash at 23:00. The coincidence seems rather uncanny and it makes us think of something to do with connection lifetime, timeouts or similar.
I'm adding all these details in case they can spark some idea in the mind of contributors, as I have a very limited understanding of this code. In the meantime, we'll see if the debug flags can give us any more data to analyse.
Here's a snippet of a log for one of the crashed services, once we set GRPC_VERBOSITY=DEBUG
and GRPC_TRACE=all
:
2024-02-17T15:55:02.567+00:00 I0217 15:55:02.567038869 63 timer_manager.cc:191] sleep for a 284 milliseconds
2024-02-17T15:55:02.567+00:00 D0217 15:55:02.567061520 43 timer_generic.cc:341] TIMER 0x7fe7b7de4560: SET 82506369 now 82501369 call 0x7fe7b7de4598[0x7fe7b9315e90]
2024-02-17T15:55:02.567+00:00 D0217 15:55:02.567078140 43 timer_generic.cc:377] .. add to shard 13 with queue_deadline_cap=82501756 => is_first_timer=false
2024-02-17T15:55:02.567+00:00 D0217 15:55:02.567088220 43 timer_generic.cc:696] TIMER CHECK BEGIN: now=82501369 next=9223372036854775807 tls_min=82501369 glob_min=82501653
2024-02-17T15:55:02.567+00:00 D0217 15:55:02.567099331 43 timer_generic.cc:722] TIMER CHECK END: r=1; next=82501653
2024-02-17T15:55:02.567+00:00 I0217 15:55:02.567105951 43 timer_manager.cc:201] sleep until kicked
2024-02-17T15:55:02.738+00:00 I0217 15:55:02.738145847 45 tcp_posix.cc:1104] TCP:0x7fe7b666afc0 got_read: OK
2024-02-17T15:55:02.738+00:00 I0217 15:55:02.738186228 45 tcp_posix.cc:894] TCP:0x7fe7b666afc0 do_read
2024-02-17T15:55:02.747+00:00 terminate called after throwing an instance of 'std::logic_error'
2024-02-17T15:55:02.747+00:00 what(): basic_string::_S_construct null not valid
I can provide previous lines too, but they all seem to be more or less the same combination of timers, sleeps, shard operations and so on.
Interesting. Hopefully that helps them narrow it down. A couple hours after posting this issue, I ended up ripping out our code that was using google-cloud-ai_platform
(and indirectly grpc
) and just rewrote it to hit their REST based API using ruby's net/http
. I'm not sure if every google service has a corresponding rest based api (in place of a grpc based one), but that might be an option for anyone affected by this. It also allowed us to remove the google-protobuf
gem from our project which had caused us some installation headaches on Alpine, so that was another nice benefit.
cc @alto-ruby
We had same error on Ruby 3.2.2, grpc 1.56.2 on alpine:3.16. Our google pubsub subscriber stopped pulling any message after that error.
Was this resolved?
I am facing this issue with 'grpc', '= 1.53'
and ruby '3.0.5'
as well..
We seemed to have solved the errors we got by downgrading our service to version 1.50
The errors we got were : terminate called after throwing an instance of 'std::logic_error what(): basic_string::_S_construct null not valid
We get errors even if we use version 1.51
.
Are there any plans to prioritise this issue / bug happening form version 1.51
?
Will take a look but changes between 1.50 and 1.51 is only #31739
We have similar crashes only in Sidekiq, reproducible on grpc 1.64.0
and 1.65.2
(Alpine 3.20, Ruby 3.3.4). Only downgrade to 1.60.2
solved the issue.
What version of gRPC and what language are you using?
1.61.0 version of the grpc ruby gem
What operating system (Linux, Windows,...) and version?
Alpine 3.19.1
What runtime / compiler are you using (e.g. python version or version of gcc)
ruby 3.1.4
What did you do?
Running the following code appears to intermittently crash our sidekiq process:
with the following error:
We run the above ruby code 100-150 times a day and have seen this crash 6 times over the last month and a half. The crash seems to have started when we merged in a PR that migrated from Google AutoML to Google AI Platform. This particular PR updated the
grpc
gem from1.53.1
to1.60.0
and introduced thegoogle-cloud-ai_platform
gem. We've since tried upgrading grpc to1.61.0
, but the crash remains.Any help would be appreciated. Thanks!