pfnet / pytorch-pfn-extras

Supplementary components to accelerate research and development in PyTorch
https://medium.com/pytorch/migration-from-chainer-to-pytorch-8ed92c12c8
MIT License
271 stars 52 forks source link

`BestValueTrigger` does not work well in mpi run like multi node setting #29

Open dhgrs opened 4 years ago

dhgrs commented 4 years ago

What happened

_DistributedSnapshot with BestValueTrigger gets stuck.

code

https://gist.github.com/dhgrs/56424106e00bafee9617b0a15a028c2c

command

CUDA_VISIBLE_DEVICES=0,1 mpiexec -N 2 python3 mnist.py

Why it causes

Reporter works in all mpi processes without all reduce operation so that BestValueTrigger check different values in each process. It causes that some processes are triggered but the others are not. _DistributedSnapshot waits for all mpi processes but some of them would never finish because are not triggered.

Workaround

Apply all reduce operation manually before reporting. But is this the best way? Will ppe support auto all reduce?

emcastillo commented 4 years ago

I think that maybe just removing the barrier should work.

In a distributed environment the reduction step is an implicit barrier, and it gets executed every few iterations so the snapshots will be correctly synchronized.

The multinode snapshot in ChainerMN didn't have a barrier as probably they were enforcing this behavior through reduction.

https://github.com/chainer/chainer/blob/master/chainermn/extensions/_multi_node_snapshot.py

dhgrs commented 4 years ago

As you suggested, removing the barrier would solve _DistributedSnapshot with BestValueTrigger. But I think the core problem is BestValueTrigger.

It causes that some processes are triggered but the others are not.

Is this expected behavior?

emcastillo commented 4 years ago

Apparently the trigger doesnt fire equally in all the workers, I think this is expected. As every process might see a different loss value each time and we can't guarantee when the trigger is going to be fired. The reproduction code was very useful, thanks a lot I am printing here the rank first, the best value and the actual value

0 None 1.0 fired
1 None 1.1 fired
0 1.0 2.0
1 1.1 0.1 fired
0 1.0 1.0
1 0.1 1.1
0 1.0 1.0
1 0.1 1.1

I think that removing the barrier is correct.