Closed wks closed 7 months ago
This problem is responsible for some of the CI test timeouts for the Ruby binding. The hanging test case is TestAutoload#test_autoload_fork
. It forks, and will usually pass. But if GC is, unfortunately, triggered in a child process, the child will wait forever. And the GC test will hang until being killed after 5 hours.
An alternative design is to allow re-entrance for initialize_collection
and allow MMTk to check with the binding if a GC thread exists. For initialize_collection
, if MMTk does not find GC threads, they will spawn new threads. So after fork
, the new process would just call initialize_collection
again.
An alternative design is to allow re-entrance for
initialize_collection
and allow MMTk to check with the binding if a GC thread exists. Forinitialize_collection
, if MMTk does not find GC threads, they will spawn new threads. So afterfork
, the new process would just callinitialize_collection
again.
Yes, that's exactly what I was going to suggest. Making it reentrant would be easier and/or exposing thread creation explicitly in the API so that a runtime can call it after calling fork
.
TL;DR: Some VMs (CRuby, ART, etc.) support forking, but
fork()
doesn't duplicate any threads other than the one that callsfork()
. Currently, if a VM callsfork()
, MMTk GC threads will not exist in the child process. We need to have the necessary mechanisms to supportfork()
.Requirement
CRuby
Ruby has the method
Kernel#fork
. It does what thefork()
system call does for Ruby, i.e. duplicates the current process, but only the current Ruby thread, not other threads.Shopify's use case involves forking the VM to handle different requests. The Ruby process performs a compacting GC before forking so that the heap is less fragmented for the children. This is not a problem because CRuby's own GC does GC in the same mutator thread. In other words, it doesn't have dedicated GC threads.
When using MMTk, after forking, the child process will not have any GC thread. If a mutator thread in the child process triggers a GC, it will block forever for the GC to finish. But GC will never happen because there is no GC thread.
Android ART
The "Zygote" process runs an ART VM, and forks into different application processes. This is intended for accelerating class loading.
We will face the same problem if the Zygote process forks.
What should happen when forking?
We first need to let GC threads come to a graceful stop. We can only
fork()
when no GC thread is running.We also need to make sure all mutators are at safe point, and all contexts are flushed. After
fork()
, only one thread will remain, and that's likely a mutator thread. This means,Right before
fork()
, all GC threads must stop. Afterfork()
, we should restart GC threads. We can ignore the coordinator thread for now because we plan to remove it (we'll discuss that in https://github.com/mmtk/mmtk-core/issues/1053). The states of a GC worker is encapsulated in theGCWorker
struct, so it should be easy to restart GC threads by reusing theGCWorker
structs.What needs to be done?
Everything will be easier if we remove the coordinator first. See https://github.com/mmtk/mmtk-core/issues/1053
We need to add an API to stop all GC threads for forking. It is basically the reverse of
initialize_collection
.We need another API to restart GC threads. It should be similiar to
initialize_collection
, but it should reuse the existingGCWorker
structs rather than creating new instances.We need to further make sure that GC worker threads save all states in the
GCWorker
struct before exiting.