Upon debugging, this is due to the use of set() for saving various states in state.py. As Python set lacks ordering among its elements, different processes tried to synchronize different states (e.g. rank 0: covariance_state, rank 1: covariance_counter), and this silently caused deadlock. I was able to find this bug with TORCH_DISTRIBUTED_DEBUG=DETAIL.
In addition, I improved __init__ and confirmed that users can now do:
import analog
analog.init(project="test")
analog.watch(model)
analog.setup({"log": "grad", "statistic": "kfac"})
In the previous code, the code got stuck randomly at
torch.cuda.synchronize()
instate.py
. You can reproduce this random deadlock by runningUpon debugging, this is due to the use of
set()
for saving various states instate.py
. As Python set lacks ordering among its elements, different processes tried to synchronize different states (e.g. rank 0: covariance_state, rank 1: covariance_counter), and this silently caused deadlock. I was able to find this bug withTORCH_DISTRIBUTED_DEBUG=DETAIL
.In addition, I improved
__init__
and confirmed that users can now do: