ivmai / bdwgc

The Boehm-Demers-Weiser conservative C/C++ Garbage Collector (bdwgc, also known as bdw-gc, boehm-gc, libgc)
https://www.hboehm.info/gc/
Other
2.81k stars 392 forks source link

rare case of infinite loop on Android #630

Open delguoqing opened 1 month ago

delguoqing commented 1 month ago

Hi, maintainer, I have encountered a problem of GC stuck in resend_lost_signals. Because it is a bit hard to reproduce this problem. I will describe my findings as following:

1) assume that all threads are suspended in the GC_suspend_handler and haven't wake up from sigsuspend call.
2) Then GC then finished and want to restart all threads. It will begin by updating GC_stop_count to GC_stop_count + 1. Before it has a chance to raise signals to other threads, however, other threads might be wake up by other signals than the restart signal and find that GC_stop_count != my_stop_count, and then get out of the sigsuspend while loop. It will also set the last_stop_count of its thread to be GC_stop_count | 1.
3) After that, the GC thread will execute GC_restart_all, and decide that the thread metioned in 2) has already started, so it will not raise signal to it. And that thread is not counted into n_live_threads.
4) n_live_threads will be used in resend_lost_signals_retry, so the calling thread will call sem_timedwait "Nthread -1" times while GC_suspend_ack_sem will have a value of Nthread. That is when things start going wrong.
5) So if everything goes well, GCs after this point, can start before other threads have completely stopped. ( because it has 1 extra ack from previous rounds) That already causes trouble. The infinite loop case is just one of them. If, some signal comes in which takes a lot of time to execute, sem_timedwait will timeout, and go into the resend_lost_signal function. At that case, we might have ack_count > n_live_threads. I know we have handled the case of newly_sent < n_live_threads - ack_count due to lost of some threads. But if the opposite happens, the thread will be trapped infinitely, even if we have been successfully restart all threads, we still have ack_count greater than n_live_threads, and the loop will continue.

When Android is creating a bug report. It will send a SIGQUIT signal to "Signal Catcher" thread, the signal catcher thread will in turn send SIGRT_1 to all threads, to stop them and grab the ucontext to do stack unwind. That signal might wake up threads in sigsuspend, thus causing what I have described above.

Hope you can take some time, and give some comments. I'll try to provide any more information you need.
ivmai commented 1 month ago

Might be related: #396

delguoqing commented 1 month ago

Similar to #396 , If GC_ASSERTIONS is enabled in my project, the process will abort at that

  sem_getvalue(&GC_suspend_ack_sem, &i);
  GC_ASSERT(0 == i);

396 seems to be related to fork, this is not the case in my project though.

I have a fix for this problem and want to contribute, How I can do that?

ivmai commented 1 month ago

I have a fix for this problem and want to contribute

Cool. Just make a PR and reference it here.

ivmai commented 1 month ago

Probably the fix could be (not sure, I've not looked at it deeply): } while (ao_load_acquire_async(&GC_stop_count) == my_stop_count -> } while ((ao_load_acquire_async(&GC_stop_count) & ~(AO_t)THREAD_RESTARTED) == my_stop_count

Probably some other fix exist, let's see your version.

delguoqing commented 1 month ago

Hello, I have created a pull request. #631 I have tried to make my best and described this problem clearer. Can you take a look and make some comments?

ivmai commented 1 month ago

Good, I will review it within a week. Could you please use your real name in commit (--author)? (Re-pushing the PR should also run the tests on Travis CI.)

delguoqing commented 1 month ago

Hi, I have changed user.name and re-open a PR #633 Although I have commited using my real name, Github displays my github account name anyway.

delguoqing commented 5 days ago

@ivmai anything I can do to move forward?

ivmai commented 5 days ago

No-no, sorry about delay(( (I'm doing big refactoring of the code base these days, it takes me longer than planned, and I just don't want to switch to deeper review of your patch. I hope I will complete it within May.)