grpc / grpc

The C based gRPC (C++, Python, Ruby, Objective-C, PHP, C#)
https://grpc.io
Apache License 2.0
41.61k stars 10.5k forks source link

gRPC server thread gets stuck after creating an Alarm object #31738

Closed ashu-ciena closed 1 year ago

ashu-ciena commented 1 year ago

What version of gRPC and what language are you using?

grpc 1.42, C++

What operating system (Linux, Windows,...) and version?

Linux 5168 5.10.147-oneos-standard #1 SMP PREEMPT Mon Nov 21 04:26:32 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux

What runtime / compiler are you using (e.g. python version or version of gcc)

g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44.0.3)

What did you do?

Using a Go based gnmi client gnmi_cli, subscribed to a Path to a C++ based grpc/gnmi server. After the user authentication, created an Alarm object as below to return the tag of a C++ object from completion queue to proceed further.

m_alarm = new Alarm(); if (m_alarm) { m_alarm->Set(m_cq.get(), gpr_now(GPR_CLOCK_MONOTONIC), this); }

What did you expect to see?

I expected an immediate tag to the returned from completion queue which usually happens in good working case. Working_Log.txt

What did you see instead?

The GRPC server gets stuck and the logs suggests, it has either cancelled the timer or it is stuck and not proceeding. Problem_Log.txt

The good and problematic logs are attached. gRPC process restarts resolves the issue. There is no segfault, so no backtrace.

Anything else we should know about your project / environment?

Not yet.

ashu-ciena commented 1 year ago

@yashykt Would you please take a look ?

yijiem commented 1 year ago

Can you provide a complete minimal example that repros this issue? gRPC internally sets and cancels timers frequently, so it's hard to distinguish the alarm timer from the rest just from the log. And for your example, the timer log should show the SET time < now, since the SET time happens before the now time of the logging, e.g.:

I1123 15:38:45.253072109  769743 timer_generic.cc:342]       TIMER 0x55555658b368: SET 5668 now 5679 call 0x55555658b3b0[0x5555555f6b47]

But I couldn't find any timer with this property in your logs.

ashu-ciena commented 1 year ago

Hello @yijiem Thanks for looking into this. What we see in our logs is exactly opposite to what you are expecting. This has been the case all the time in working scenarios. I checked the code and found our logs are printing the correct information. The SET is for deadline and now is for the time right now, so SET > NOW

if (GRPC_TRACE_FLAG_ENABLED(grpc_timer_trace)) { gpr_log(GPR_INFO, "TIMER %p: SET %" PRId64 " now %" PRId64 " call %p[%p]", timer, deadline, grpc_core::ExecCtx::Get()->Now(), closure, closure->cb); }

Can you check the logs again to suspect any other issue?

yijiem commented 1 year ago

Sorry if I wasn't clear before. I was referring to your example code which sets an alarm with immediate deadline to kick the completion queue:

m_alarm = new Alarm();
if (m_alarm)
{
m_alarm->Set(m_cq.get(), gpr_now(GPR_CLOCK_MONOTONIC), this);
}

This should hit the following code path (note the deadline <= now) and an operation should be queued onto the completion queue on its way out (for you to consume): https://github.com/grpc/grpc/blob/3c96517fc09d3c8ce1616f33f425e0f6d537e9d5/src/core/lib/iomgr/timer_generic.cc#L359-L365

ashu-ciena commented 1 year ago

Hello @yijiem, I looked into my logs again and I have one more observation to share. In the success case, the difference in time in milisec between the NOW and SET instance is significant as below:

TIMER 0x7fa36007f150: SET 368636458 now 361436458 call 0x7fa36007f030[0x3d155a4470] TIMER 0x7fa36001ea00: SET 361556451 now 361436451 call 0x7fa36001ea38[0x3d1559eb90]

But for the failure case, it is very very close, in this case it is just 1 milisec. TIMER 0x7f44001178: SET 5205049 now 5205048 call 0x7f440011b8[0x7f8165f2c8] TIMER 0x7f6c08fe88: SET 5204846 now 5204746 call 0x7f6c08fcf8[0x7f813dd6e8]

Our objective is to insert a class object tag from a helper thread after the user authentication is successful such that the main thread reading the completion queue in a while loop gets it and calls the tag->next() function of the corresponding class to further process the subscription request.

Ideally we don't want to delay the processing, so using gpr_now to insert the current time as deadline. We had expected this to work even if the deadline expiry is checked a few mili sec later by the grpc timer handlers. Why would it not schedule the timer just because we checked after the deadline was past due ? Do you have any recommendation to set the timer?

We are using AsyncServerReaderWriter server. Do you know any other way than Alarm to insert a tag in CQ from a helper thread and received by main thread?

ashu-ciena commented 1 year ago

@yijiem We have been hitting this issue more often than ever, would appreciate your response.

yijiem commented 1 year ago

Can you provide a complete minimal example that repros this issue? It will help us to investigate further. We have tests exercising the Alarm API and we haven't seen any hanging issue.

Also v1.42 is an old version (more than a year old) and we only support the current release (v1.51 atm) and the release prior to that according to: https://grpc.io/docs/what-is-grpc/faq/#how-long-are-grpc-releases-supported-for. So it might be good to try to upgrade to the latest version as well.

ashu-ciena commented 1 year ago

@yijiem, Do you see any problem in using GPR_CLOCK_MONOTONIC instead of GPR_CLOCK_REALTIME?

ashu-ciena commented 1 year ago

@yijiem @drfloob The issue gets reproduced intermittently. As per the code posted by me, we use Alarm API to insert a tag in the completion queue. We observe the tag is not inserted and hence it never returns from the completion queue. So, the execution is halted. We are asking an alarm to expire immediately so that the tag is available immediately.

Is it not recommended to use gpr_now() with clock monotonic ? If there is a system clock change due to NTP synchronization, can this API still work and give predictable result? Is there any minimum delta value of time which can be added to current time while creating an alarm to always get a predictable output?

Is there any alternative on how to insert a tag ?

yijiem commented 1 year ago

I'm not aware of any issue with using GPR_CLOCK_MONOTONIC but won't rule out the possibility of an external event affecting the clock (the clock should not go backward though AFAIK, so it shouldn't delay the alarm). If you observe that there is a correlation between the occurrence of the issue and an external event then it would be good to know.

I tried this example: https://github.com/yijiem/grpc/commit/7d13ce94a90818032f3fb349e65f97d21e8bc842 and didn't see any issue running on my system (note that I have to keep all Alarm objects alive otherwise it would cancel the timer when one destructs and there seems to be some issue setting consecutive timers on one Alarm). If it is closed to your use case you could try it too. Again, it would be really helpful to see a complete minimal example that repro your issue.

ashu-ciena commented 1 year ago

Thanks @yijiem for sharing the test code. I think it mostly resembles with what we do. But we just create one alarm at a time on each gnmi subscription request. I want to highlight one more behavior of gRPC in this context which could be important for you to know. Once this problem is set in the system, any subsequent attempt to create Alarm results in the same state i.e. does not get triggerred. So, you can say, once Alarm API fails to trigger, it continues to fail for subsequent attempts as well. The gRPC server is able to accept new connection requests and reach to this part of the code but does not seem to work then. I am assuming we are hitting some problem in timer module, which stops working forever until we restart the gRPC server. Restart as a workaround always worked for us.

One more thing I want to highlight. We are fetching the time using clock monotonic with gpr_now API. But after is passed on to gRPC, what does gRPC do? How does it know the request is coming for clock monotonic and not clock realtime? I did some code check of timer lib and can see the timers seems to be scheduled with clock reference to the process up time. Can this guarantee us clock monotonic behavior ?

The gRPC server component is a containerized application in docker framework. I want to repeat one of my earlier question, what if we schedule the timer for 100 ms later time w.r.t. current time? The NTP is enabled on the system and can make corrections upto microsec today. Can the microsec/milisec amount of drift cause this failure?

TIMER 0x7f44001178: SET 5205049 now 5205048 call 0x7f440011b8[0x7f8165f2c8] Here the difference was 1 ms and the API failed for us. Meanwhile we are also checking if there had been any hardware FPGA changes w.r.t. to RTC clock recently.

ashu-ciena commented 1 year ago

Hello @yijiem @drfloob, We finally have a recipe to reproduce this issue. If we move the clock backward in time (manually or with NTP sync), the Alarm API stops working and we land in this problem state. The recovery is to move the clock back to original time and Alarm API starts working again.

As reported earlier, we use clock monotonic to derive our time before passing to the Alarm API but does GRPC also use clock monotonic to derive the deadline and current time so that timers are not affected by any discontinuous clock time ? If it doesn't then there is no point of passing gpr_now(GPR_CLOCK_MONOTONIC) in the Alarm API.

This issue has become very critical for us affecting our deliverables to customers. I request you to please provide your observations on priority.

yijiem commented 1 year ago

gRPC timer should use CLOCK_MONOTONIC in its internal book-keeping, see the usage of grpc_core::Timestamp::Now in timer_generic.cc e.g.: https://github.com/grpc/grpc/blob/3c96517fc09d3c8ce1616f33f425e0f6d537e9d5/src/core/lib/iomgr/timer_generic.cc#L662 and that essentially calls: https://github.com/grpc/grpc/blob/0356ff3a7215d4a85e352d7dde4445a246343d89/src/core/lib/gprpp/time.cc#L43

It's strange that the monotonic clock provided by the OS that you use is affected by RTC clock change. It shouldn't. AFAIK, the Posix monotonic clock might be affected by clock speed adjustment (e.g. how many microseconds between each tick/timer interrupt), but it shouldn't have sudden time jump or go backwards.

yijiem commented 1 year ago

It seems like gpr_cv_wait is using CLOCK_REALTIME for the timeout deadline since the underlying synchronization primitive (i.e. absl) is using CLOCK_REALTIME: https://github.com/grpc/grpc/blob/f99b8b5bc4a89d37f86bf063999ec22e44312663/src/core/lib/gpr/sync_abseil.cc#L77

I tried this example: https://github.com/yijiem/grpc/commit/b950859499333fb1b78066f72a0d0ff1f87ff56b in which timers are scheduled repeatedly and a separate binary to set system clock back 1 day in a loop (to prevent OS from changing it back). The test would stuck if the clock change happened during its run. I added a stacktrace of the stucked grpc_global_timer thread in the commit message. @ashu-ciena can you verify that the thread is stucked at the same place as this repro?

If that is the case, then this issue may affect all condition variables that gRPC uses on Linux platforms. The issue is a race condition when the system clock has a jump backward and WaitWithDeadline was called on an absl::CondVar with a CLOCK_REALTIME timestamp which is based on the system clock set before the jump.

yijiem commented 1 year ago

The stacktrace in the commit message of https://github.com/yijiem/grpc/commit/b950859499333fb1b78066f72a0d0ff1f87ff56b is not the one we should be looking. This should be the canonical one: https://gist.githubusercontent.com/yijiem/844ab252199d6d033b29a05c84c33b0f/raw/0c74bfb39da320afbc080cdf439d6321d5c3c1f7/gistfile1.txt

ashu-ciena commented 1 year ago

@yijiem Sure, I will reproduce and let you know on Monday 2nd of January. Happy New Year!

yijiem commented 1 year ago

@ashu-ciena in your repro can you also note down how much did the system clock jump back (say n seconds) and if you then wait for n seconds would the Alarm API starts working again? Thanks!

ashu-ciena commented 1 year ago

@yijiem, I collected the information you asked for.

  1. Yes, If I wait for n seconds, the Alarm API starts working again.
  2. The stack trace of timer thread is exactly same as you pointed out. See below:

**_#0 syscall () at /usr/src/debug/glibc/2.35-r0/git/misc/syscall.S:38

1 0x00007f4ca7ee383c in absl::lts_20210324::synchronization_internal::FutexImpl::WaitUntil (t=..., val=0, v=0x7f4ca53d0340)

at /usr/src/debug/abseil-cpp/20210324.2+gitAUTOINC+278e0a0718-r0/build/kernel_timeout.h:97

2 absl::lts_20210324::synchronization_internal::Waiter::Wait (this=this@entry=0x7f4ca53d0340, t=t@entry=...)

at /usr/src/debug/abseil-cpp/20210324.2+gitAUTOINC+278e0a0718-r0/build/waiter.cc:95

3 0x00007f4ca7ee36da in AbslInternalPerThreadSemWait_lts_20210324 (t=...)

at /usr/src/debug/abseil-cpp/20210324.2+gitAUTOINC+278e0a0718-r0/build/create_thread_identity.h:95

4 0x00007f4ca7ee73a0 in absl::lts_20210324::CondVar::WaitCommon (this=this@entry=0x7f4ca7ac4700 ,

---Type to continue, or q to quit--- mutex=mutex@entry=0x7f4ca7ac4710 <_ZL4g_mu>, t=...) at /usr/src/debug/abseil-cpp/20210324.2+gitAUTOINC+278e0a0718-r0/build/mutex.cc:2559

5 0x00007f4ca7ee74e4 in absl::lts_20210324::CondVar::WaitWithDeadline (this=this@entry=0x7f4ca7ac4700 ,

mu=mu@entry=0x7f4ca7ac4710 <_ZL4g_mu>, 
deadline=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>)
at /usr/src/debug/abseil-cpp/20210324.2+gitAUTOINC+278e0a0718-r0/git/absl/synchronization/per_thread_sem.h:86

6 0x00007f4ca8c1fdd7 in gpr_cv_wait (cv=0x7f4ca7ac4700 , mu=0x7f4ca7ac4710 <_ZL4g_mu>, abs_deadline=...)

at /usr/src/debug/grpc/1.42.0-r0/build/sync_abseil.cc:91

7 0x00007f4ca79259b9 in wait_until (next=) at /usr/src/debug/grpc/1.42.0-r0/build/exec_ctx.h:200

8 timer_main_loop () at /usr/src/debug/grpc/1.42.0-r0/build/exec_ctx.h:255

9 timer_thread(void*) () at /usr/src/debug/grpc/1.42.0-r0/build/exec_ctx.h:284

10 0x00007f4ca8c25869 in operator() (__closure=0x0,

v=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>)
at /usr/src/debug/grpc/1.42.0-r0/build/thd_posix.cc:140

11 _FUN () at /usr/src/debug/grpc/1.42.0-r0/build/thd_posix.cc:145

12 0x00007f4ca8873a42 in start_thread (arg=) at pthread_create.c:442

13 0x00007f4ca88f48e4 in clone () at /usr/src/debug/glibc/2.35-r0/git/misc/clone.S:100_**

  1. if the Alarm API is stuck once, any subsequent usage of this API also gets stuck till the first instance is released. Ideally, I was expecting that only the first instance may get stuck due to clock drift in the past. Any subsequent connection using the Alarm API should work fine as it should refer to the changed clock for all the reasons. But looks like due to first instance is stuck, the subsequent instances also get stuck in serial fashion.

Is there any code change which can fix this issue in gRPC C++ library code?

ashu-ciena commented 1 year ago

@yijiem Please let me know if there is any update from your side. Anything else you want to collect from the device?

tudor commented 1 year ago

@ashu-ciena I'm curious -- we at Rockset are using a similar pattern, but we're inserting the alarm intentionally in the past (std::system_clock::time_point() -- the Unix epoch) instead of the current time. Would this work for you?

yijiem commented 1 year ago

Another thing you could try is to turn off GPR_ABSEIL_SYNC defined here: https://github.com/grpc/grpc/blob/72c296a3c2db882e164036fd83746315985acf06/include/grpc/support/port_platform.h#L37-L47

Which should then fall back to use pthread-based synchronization which may not have the same issue that you are seeing: https://github.com/grpc/grpc/blob/f99b8b5bc4a89d37f86bf063999ec22e44312663/src/core/lib/gpr/sync_posix.cc#L21-L22

But note that we have moved away from using pthread-based synchronization in most places (except macOS) and may eventually remove it in the future.

ashu-ciena commented 1 year ago

@ashu-ciena I'm curious -- we at Rockset are using a similar pattern, but we're inserting the alarm intentionally in the past (std::system_clock::time_point() -- the Unix epoch) instead of the current time. Would this work for you?

Thanks a lot @tudor for suggesting this workaround. I went ahead and used gpr_inf_past() with clock realtime instead of gpr_now() and was able to see my problem getting resolved. I am hoping some more tests pass with this workaround.

ashu-ciena commented 1 year ago

@yijiem Is it fine to use gpr_inf_past() instead of gpr_now() to set a deadline way back to Epoch time. Will this cause any issue in the timer library?

yijiem commented 1 year ago

gpr_inf_past() is not considered as stable released API (include/grpcpp is for C++ for example) so it may be changed or removed in future releases. We think using std::chrono::system_clock::time_point() is fine for this case.

ashu-ciena commented 1 year ago

@yijiem The documentation for gpr_inf_past() is not correct it seems. https://grpc.github.io/grpc/core/time_8h.html#a57c9693ef253d0951363c5dafb6586df

It says that gpr_inf_past is in the future, but by the name and the behavior, it's obviously in the past.

yijiem commented 1 year ago

Thanks! Sent https://github.com/grpc/grpc/pull/32114 for that.

yashykt commented 1 year ago

Closing since there seems to be nothing else actionable here.