ray-project / ray_lightning

Pytorch Lightning Distributed Accelerators using Ray
Apache License 2.0
211 stars 34 forks source link

trainer is not consistent during the `ray_ddp` #160

Closed JiahaoYao closed 2 years ago

JiahaoYao commented 2 years ago

Using the ray ddp, the weight (or state_dict) of the model is not changed, when i run the example of https://github.com/ray-project/ray_lightning/blob/main/ray_lightning/examples/ray_ddp_example.py

the output is

(RayExecutor pid=17635) All distributed processes registered. Starting with 1 processes
(RayExecutor pid=17635) ----------------------------------------------------------------------------------------------------
(RayExecutor pid=17635) 
(RayExecutor pid=17635) ic| trainer.model.state_dict(): OrderedDict([('layer.weight',
(RayExecutor pid=17635)                                               tensor([[-0.1284,  0.1310, -0.1107,  0.0651,  0.1190, -0.0267, -0.1436,  0.0471,
(RayExecutor pid=17635)                                          -0.0838,  0.0397, -0.0399, -0.1324,  0.0104,  0.0143, -0.0267, -0.0136,
(RayExecutor pid=17635)                                          -0.0615,  0.1493, -0.0103, -0.0674,  0.1604, -0.1438, -0.0297,  0.1336,
(RayExecutor pid=17635)                                           0.0048, -0.0601, -0.0883,  0.0963, -0.0772,  0.0758,  0.0504,  0.1446],
(RayExecutor pid=17635)                                         [-0.1589, -0.0598,  0.0007, -0.0334,  0.0677, -0.0225, -0.0384, -0.0950,
(RayExecutor pid=17635)                                           0.0687,  0.0472, -0.0018, -0.0929,  0.1125, -0.0880, -0.1418,  0.0386,
(RayExecutor pid=17635)                                          -0.1549,  0.0443, -0.0322, -0.1158, -0.1620, -0.1335, -0.0090, -0.1261,
(RayExecutor pid=17635)                                          -0.0398,  0.0462,  0.1015, -0.1090, -0.1676,  0.1510, -0.0376, -0.0029]])),
(RayExecutor pid=17635)                                              ('layer.bias', tensor([ 0.0714, -0.0563]))])
(RayExecutor pid=17635) ic| trainer.state.finished: False

(RayExecutor pid=17635) ic| self: <pytorch_lightning.trainer.trainer.Trainer object at 0x7fa1a6d7c6d0>
(RayExecutor pid=17635)     self.state.status: <TrainerStatus.FINISHED: 'finished'>
(RayExecutor pid=17635) ic| trainer.model.state_dict(): OrderedDict([('layer.weight',
(RayExecutor pid=17635)                                               tensor([[-0.1284,  0.1310, -0.1107,  0.0651,  0.1190, -0.0267, -0.1436,  0.0471,
(RayExecutor pid=17635)                                          -0.0838,  0.0397, -0.0399, -0.1324,  0.0104,  0.0143, -0.0267, -0.0136,
(RayExecutor pid=17635)                                          -0.0615,  0.1493, -0.0103, -0.0674,  0.1604, -0.1438, -0.0297,  0.1336,
(RayExecutor pid=17635)                                           0.0048, -0.0601, -0.0883,  0.0963, -0.0772,  0.0758,  0.0504,  0.1446],
(RayExecutor pid=17635)                                         [-0.1589, -0.0598,  0.0007, -0.0334,  0.0677, -0.0225, -0.0384, -0.0950,
(RayExecutor pid=17635)                                           0.0687,  0.0472, -0.0018, -0.0929,  0.1125, -0.0880, -0.1418,  0.0386,
(RayExecutor pid=17635)                                          -0.1549,  0.0443, -0.0322, -0.1158, -0.1620, -0.1335, -0.0090, -0.1261,
(RayExecutor pid=17635)                                          -0.0398,  0.0462,  0.1015, -0.1090, -0.1676,  0.1510, -0.0376, -0.0029]])),
(RayExecutor pid=17635)                                              ('layer.bias', tensor([ 0.0714, -0.0563]))])

for the mp_spawm , the output is


ic| trainer.model.state_dict(): OrderedDict([('layer.weight',
                                              tensor([[ 0.1695,  0.0004,  0.1094,  0.0932, -0.0273, -0.1312, -0.0479,  0.1731,
                                         -0.0678,  0.1470,  0.1115, -0.0990, -0.0006, -0.1644,  0.0459,  0.0714,
                                          0.1490, -0.1328, -0.0901, -0.1616,  0.1092, -0.0762, -0.1534,  0.0229,
                                         -0.1467, -0.1757,  0.1172, -0.1396,  0.1102, -0.1518, -0.0181,  0.0065],
                                        [ 0.1613,  0.0589,  0.0125,  0.1643, -0.1730,  0.0023, -0.1190, -0.0283,
                                         -0.1743, -0.1035, -0.1147, -0.1426,  0.0919,  0.0543, -0.1295,  0.1022,
                                         -0.0693, -0.0091, -0.0271, -0.1216,  0.0130,  0.1703, -0.1073,  0.1255,
                                          0.1765, -0.1154, -0.1081, -0.0703,  0.0072,  0.0530,  0.0361, -0.0136]])),
                                             ('layer.bias', tensor([-0.0366,  0.1662]))])
ic| trainer.state.finished: False
ic| function: <bound method Trainer._fit_impl of <pytorch_lightning.trainer.trainer.Trainer object at 0x7f9cbdca7ca0>>
    args: (BoringModel(
            (layer): Linear(in_features=32, out_features=2, bias=True)
          ),
           None,
           None,
           None,
           None)
    kwargs: {}
ic| args[0].state_dict(): OrderedDict([('layer.weight',
                                        tensor([[ 0.1695,  0.0004,  0.1094,  0.0932, -0.0273, -0.1312, -0.0479,  0.1731,
                                   -0.0678,  0.1470,  0.1115, -0.0990, -0.0006, -0.1644,  0.0459,  0.0714,
                                    0.1490, -0.1328, -0.0901, -0.1616,  0.1092, -0.0762, -0.1534,  0.0229,
                                   -0.1467, -0.1757,  0.1172, -0.1396,  0.1102, -0.1518, -0.0181,  0.0065],
                                  [ 0.1613,  0.0589,  0.0125,  0.1643, -0.1730,  0.0023, -0.1190, -0.0283,
                                   -0.1743, -0.1035, -0.1147, -0.1426,  0.0919,  0.0543, -0.1295,  0.1022,
                                   -0.0693, -0.0091, -0.0271, -0.1216,  0.0130,  0.1703, -0.1073,  0.1255,
                                    0.1765, -0.1154, -0.1081, -0.0703,  0.0072,  0.0530,  0.0361, -0.0136]])),
                                       ('layer.bias', tensor([-0.0366,  0.1662]))])
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type   | Params
---------------------------------
0 | layer | Linear | 66    
---------------------------------
66        Trainable params
0         Non-trainable params
66        Total params
0.000     Total estimated model params size (MB)
/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:233: UserWarning: strategy=ddp_spawn and num_workers=0 may result in data loading bottlenecks. Consider setting num_workers>0 and persistent_workers=True
  rank_zero_warn(
/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1935: PossibleUserWarning: The number of training batches (10) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  rank_zero_warn(
[W reducer.cpp:1289] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
ic| self.state.status: <TrainerStatus.FINISHED: 'finished'>
ic| args[0].state_dict(): OrderedDict([('layer.weight',
                                        tensor([[ 0.2083, -0.1084, -0.1722,  0.0165,  0.3996, -0.0749, -0.9431, -0.0063,
                                    0.1011,  0.0379, -0.2080, -0.5125, -0.2195, -0.2585,  0.5671, -0.2192,
                                   -0.4280, -0.4997, -0.0467, -0.0351, -0.1094,  0.4505,  0.0201,  0.7633,
                                   -0.0565, -0.2368, -0.2257,  0.1355,  0.3509,  0.3909, -0.2490, -0.2567],
                                  [ 0.1874, -0.0077,  0.1549,  0.1495,  0.2601, -0.1129, -0.3417, -0.0051,
                                   -0.1441,  0.0613, -0.1535, -0.3989, -0.1201, -0.3688,  0.4670,  0.1692,
                                   -0.3195, -0.4685,  0.3749, -0.4972, -0.1101,  0.5094,  0.4260,  0.5927,
                                   -0.1578, -0.1089,  0.0262,  0.1388,  0.3880, -0.1715, -0.3933, -0.0491]])),
                                       ('layer.bias', tensor([1.1195, 0.7150]))])
ic| trainer.model.state_dict(): OrderedDict([('layer.weight',
                                              tensor([[ 0.2083, -0.1084, -0.1722,  0.0165,  0.3996, -0.0749, -0.9431, -0.0063,
                                          0.1011,  0.0379, -0.2080, -0.5125, -0.2195, -0.2585,  0.5671, -0.2192,
                                         -0.4280, -0.4997, -0.0467, -0.0351, -0.1094,  0.4505,  0.0201,  0.7633,
                                         -0.0565, -0.2368, -0.2257,  0.1355,  0.3509,  0.3909, -0.2490, -0.2567],
                                        [ 0.1874, -0.0077,  0.1549,  0.1495,  0.2601, -0.1129, -0.3417, -0.0051,
                                         -0.1441,  0.0613, -0.1535, -0.3989, -0.1201, -0.3688,  0.4670,  0.1692,
                                         -0.3195, -0.4685,  0.3749, -0.4972, -0.1101,  0.5094,  0.4260,  0.5927,
                                         -0.1578, -0.1089,  0.0262,  0.1388,  0.3880, -0.1715, -0.3933, -0.0491]])),
                                             ('layer.bias', tensor([1.1195, 0.7150]))])
ic| trainer.state.finished: True
None

This is because if we print out the trainer id, we can find it is different

(RayExecutor pid=7356) ic| self: <pytorch_lightning.trainer.trainer.Trainer object at 0x7fcdaad8a790>
(RayExecutor pid=7356)     self.state.status: <TrainerStatus.FINISHED: 'finished'>
(RayExecutor pid=7356) ic| trainer.state.finished: False
(RayExecutor pid=7356)     trainer: <pytorch_lightning.trainer.trainer.Trainer object at 0x7fcdb54a9fa0>

however, for the original ddp_spawn, the trainer is the same

ic| self: <pytorch_lightning.trainer.trainer.Trainer object at 0x7fcae29bbbe0>
    self.state.status: <TrainerStatus.FINISHED: 'finished'>
ic| trainer.state.finished: True
    trainer: <pytorch_lightning.trainer.trainer.Trainer object at 0x7fcae29bbbe0>
JiahaoYao commented 2 years ago

tag: #143