Open dhgrs opened 4 years ago
I think that maybe just removing the barrier should work.
In a distributed environment the reduction step is an implicit barrier, and it gets executed every few iterations so the snapshots will be correctly synchronized.
The multinode snapshot in ChainerMN didn't have a barrier as probably they were enforcing this behavior through reduction.
https://github.com/chainer/chainer/blob/master/chainermn/extensions/_multi_node_snapshot.py
As you suggested, removing the barrier would solve _DistributedSnapshot
with BestValueTrigger
. But I think the core problem is BestValueTrigger
.
It causes that some processes are triggered but the others are not.
Is this expected behavior?
Apparently the trigger doesnt fire equally in all the workers, I think this is expected. As every process might see a different loss value each time and we can't guarantee when the trigger is going to be fired. The reproduction code was very useful, thanks a lot I am printing here the rank first, the best value and the actual value
0 None 1.0 fired
1 None 1.1 fired
0 1.0 2.0
1 1.1 0.1 fired
0 1.0 1.0
1 0.1 1.1
0 1.0 1.0
1 0.1 1.1
I think that removing the barrier is correct.
What happened
_DistributedSnapshot
withBestValueTrigger
gets stuck.code
https://gist.github.com/dhgrs/56424106e00bafee9617b0a15a028c2c
command
CUDA_VISIBLE_DEVICES=0,1 mpiexec -N 2 python3 mnist.py
Why it causes
Reporter works in all mpi processes without all reduce operation so that
BestValueTrigger
check different values in each process. It causes that some processes are triggered but the others are not._DistributedSnapshot
waits for all mpi processes but some of them would never finish because are not triggered.Workaround
Apply all reduce operation manually before reporting. But is this the best way? Will ppe support auto all reduce?