google / ml-compiler-opt

Infrastructure for Machine Learning Guided Optimization (MLGO) in LLVM.
Apache License 2.0
607 stars 91 forks source link

cancellation_manager.pause_all_work() is insufficient to pause work #116

Open Northbadge opened 2 years ago

Northbadge commented 2 years ago

It is possible for work to continue running if no processes are registered on the cancellation manager when pause_all_work() is called.

Possible solutions: Also send SIGSTOP to the worker python process- however this has a race condition: pause_all_work -> new work scheduled -> pause python -> work is still being done. pausing python first would prevent pause_all_work from running so that's not an option.

Alternatively, have cancellation manager send SIGSTOP upon process registration when paused. This should work, I think, and is probably the easiest solution.

to verify: add a time_sleep right under .pause_children() in train_locally and look at the cpu graph

mtrofin commented 1 year ago

does this happen if we always make sure pause_all_work is followed by resume_all_work?

Northbadge commented 1 year ago

Yes- say the worker python thread is not waiting on a sub process clang, and at that moment pause all work is executed, no clangs are paused, then the worker is free to start sub process clangs going forward. Resume all work would either be a nop or send a SIGCONT for an already running process

Northbadge commented 1 year ago

randomly thought of something less race-y, so before I forget: we could have pause_all_work SIGSTOP itself while holding a process-creation-lock right before calling resume_all_work, so the parent process can "call" resume_all_work by sending a SIGCONT