Closed JiahaoYao closed 2 years ago
Using the ray ddp, the weight (or state_dict) of the model is not changed, when i run the example of https://github.com/ray-project/ray_lightning/blob/main/ray_lightning/examples/ray_ddp_example.py
state_dict
the output is
(RayExecutor pid=17635) All distributed processes registered. Starting with 1 processes (RayExecutor pid=17635) ---------------------------------------------------------------------------------------------------- (RayExecutor pid=17635) (RayExecutor pid=17635) ic| trainer.model.state_dict(): OrderedDict([('layer.weight', (RayExecutor pid=17635) tensor([[-0.1284, 0.1310, -0.1107, 0.0651, 0.1190, -0.0267, -0.1436, 0.0471, (RayExecutor pid=17635) -0.0838, 0.0397, -0.0399, -0.1324, 0.0104, 0.0143, -0.0267, -0.0136, (RayExecutor pid=17635) -0.0615, 0.1493, -0.0103, -0.0674, 0.1604, -0.1438, -0.0297, 0.1336, (RayExecutor pid=17635) 0.0048, -0.0601, -0.0883, 0.0963, -0.0772, 0.0758, 0.0504, 0.1446], (RayExecutor pid=17635) [-0.1589, -0.0598, 0.0007, -0.0334, 0.0677, -0.0225, -0.0384, -0.0950, (RayExecutor pid=17635) 0.0687, 0.0472, -0.0018, -0.0929, 0.1125, -0.0880, -0.1418, 0.0386, (RayExecutor pid=17635) -0.1549, 0.0443, -0.0322, -0.1158, -0.1620, -0.1335, -0.0090, -0.1261, (RayExecutor pid=17635) -0.0398, 0.0462, 0.1015, -0.1090, -0.1676, 0.1510, -0.0376, -0.0029]])), (RayExecutor pid=17635) ('layer.bias', tensor([ 0.0714, -0.0563]))]) (RayExecutor pid=17635) ic| trainer.state.finished: False (RayExecutor pid=17635) ic| self: <pytorch_lightning.trainer.trainer.Trainer object at 0x7fa1a6d7c6d0> (RayExecutor pid=17635) self.state.status: <TrainerStatus.FINISHED: 'finished'> (RayExecutor pid=17635) ic| trainer.model.state_dict(): OrderedDict([('layer.weight', (RayExecutor pid=17635) tensor([[-0.1284, 0.1310, -0.1107, 0.0651, 0.1190, -0.0267, -0.1436, 0.0471, (RayExecutor pid=17635) -0.0838, 0.0397, -0.0399, -0.1324, 0.0104, 0.0143, -0.0267, -0.0136, (RayExecutor pid=17635) -0.0615, 0.1493, -0.0103, -0.0674, 0.1604, -0.1438, -0.0297, 0.1336, (RayExecutor pid=17635) 0.0048, -0.0601, -0.0883, 0.0963, -0.0772, 0.0758, 0.0504, 0.1446], (RayExecutor pid=17635) [-0.1589, -0.0598, 0.0007, -0.0334, 0.0677, -0.0225, -0.0384, -0.0950, (RayExecutor pid=17635) 0.0687, 0.0472, -0.0018, -0.0929, 0.1125, -0.0880, -0.1418, 0.0386, (RayExecutor pid=17635) -0.1549, 0.0443, -0.0322, -0.1158, -0.1620, -0.1335, -0.0090, -0.1261, (RayExecutor pid=17635) -0.0398, 0.0462, 0.1015, -0.1090, -0.1676, 0.1510, -0.0376, -0.0029]])), (RayExecutor pid=17635) ('layer.bias', tensor([ 0.0714, -0.0563]))])
for the mp_spawm , the output is
mp_spawm
ic| trainer.model.state_dict(): OrderedDict([('layer.weight', tensor([[ 0.1695, 0.0004, 0.1094, 0.0932, -0.0273, -0.1312, -0.0479, 0.1731, -0.0678, 0.1470, 0.1115, -0.0990, -0.0006, -0.1644, 0.0459, 0.0714, 0.1490, -0.1328, -0.0901, -0.1616, 0.1092, -0.0762, -0.1534, 0.0229, -0.1467, -0.1757, 0.1172, -0.1396, 0.1102, -0.1518, -0.0181, 0.0065], [ 0.1613, 0.0589, 0.0125, 0.1643, -0.1730, 0.0023, -0.1190, -0.0283, -0.1743, -0.1035, -0.1147, -0.1426, 0.0919, 0.0543, -0.1295, 0.1022, -0.0693, -0.0091, -0.0271, -0.1216, 0.0130, 0.1703, -0.1073, 0.1255, 0.1765, -0.1154, -0.1081, -0.0703, 0.0072, 0.0530, 0.0361, -0.0136]])), ('layer.bias', tensor([-0.0366, 0.1662]))]) ic| trainer.state.finished: False ic| function: <bound method Trainer._fit_impl of <pytorch_lightning.trainer.trainer.Trainer object at 0x7f9cbdca7ca0>> args: (BoringModel( (layer): Linear(in_features=32, out_features=2, bias=True) ), None, None, None, None) kwargs: {} ic| args[0].state_dict(): OrderedDict([('layer.weight', tensor([[ 0.1695, 0.0004, 0.1094, 0.0932, -0.0273, -0.1312, -0.0479, 0.1731, -0.0678, 0.1470, 0.1115, -0.0990, -0.0006, -0.1644, 0.0459, 0.0714, 0.1490, -0.1328, -0.0901, -0.1616, 0.1092, -0.0762, -0.1534, 0.0229, -0.1467, -0.1757, 0.1172, -0.1396, 0.1102, -0.1518, -0.0181, 0.0065], [ 0.1613, 0.0589, 0.0125, 0.1643, -0.1730, 0.0023, -0.1190, -0.0283, -0.1743, -0.1035, -0.1147, -0.1426, 0.0919, 0.0543, -0.1295, 0.1022, -0.0693, -0.0091, -0.0271, -0.1216, 0.0130, 0.1703, -0.1073, 0.1255, 0.1765, -0.1154, -0.1081, -0.0703, 0.0072, 0.0530, 0.0361, -0.0136]])), ('layer.bias', tensor([-0.0366, 0.1662]))]) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] | Name | Type | Params --------------------------------- 0 | layer | Linear | 66 --------------------------------- 66 Trainable params 0 Non-trainable params 66 Total params 0.000 Total estimated model params size (MB) /home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:233: UserWarning: strategy=ddp_spawn and num_workers=0 may result in data loading bottlenecks. Consider setting num_workers>0 and persistent_workers=True rank_zero_warn( /home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1935: PossibleUserWarning: The number of training batches (10) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch. rank_zero_warn( [W reducer.cpp:1289] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) ic| self.state.status: <TrainerStatus.FINISHED: 'finished'> ic| args[0].state_dict(): OrderedDict([('layer.weight', tensor([[ 0.2083, -0.1084, -0.1722, 0.0165, 0.3996, -0.0749, -0.9431, -0.0063, 0.1011, 0.0379, -0.2080, -0.5125, -0.2195, -0.2585, 0.5671, -0.2192, -0.4280, -0.4997, -0.0467, -0.0351, -0.1094, 0.4505, 0.0201, 0.7633, -0.0565, -0.2368, -0.2257, 0.1355, 0.3509, 0.3909, -0.2490, -0.2567], [ 0.1874, -0.0077, 0.1549, 0.1495, 0.2601, -0.1129, -0.3417, -0.0051, -0.1441, 0.0613, -0.1535, -0.3989, -0.1201, -0.3688, 0.4670, 0.1692, -0.3195, -0.4685, 0.3749, -0.4972, -0.1101, 0.5094, 0.4260, 0.5927, -0.1578, -0.1089, 0.0262, 0.1388, 0.3880, -0.1715, -0.3933, -0.0491]])), ('layer.bias', tensor([1.1195, 0.7150]))]) ic| trainer.model.state_dict(): OrderedDict([('layer.weight', tensor([[ 0.2083, -0.1084, -0.1722, 0.0165, 0.3996, -0.0749, -0.9431, -0.0063, 0.1011, 0.0379, -0.2080, -0.5125, -0.2195, -0.2585, 0.5671, -0.2192, -0.4280, -0.4997, -0.0467, -0.0351, -0.1094, 0.4505, 0.0201, 0.7633, -0.0565, -0.2368, -0.2257, 0.1355, 0.3509, 0.3909, -0.2490, -0.2567], [ 0.1874, -0.0077, 0.1549, 0.1495, 0.2601, -0.1129, -0.3417, -0.0051, -0.1441, 0.0613, -0.1535, -0.3989, -0.1201, -0.3688, 0.4670, 0.1692, -0.3195, -0.4685, 0.3749, -0.4972, -0.1101, 0.5094, 0.4260, 0.5927, -0.1578, -0.1089, 0.0262, 0.1388, 0.3880, -0.1715, -0.3933, -0.0491]])), ('layer.bias', tensor([1.1195, 0.7150]))]) ic| trainer.state.finished: True None
This is because if we print out the trainer id, we can find it is different
(RayExecutor pid=7356) ic| self: <pytorch_lightning.trainer.trainer.Trainer object at 0x7fcdaad8a790> (RayExecutor pid=7356) self.state.status: <TrainerStatus.FINISHED: 'finished'> (RayExecutor pid=7356) ic| trainer.state.finished: False (RayExecutor pid=7356) trainer: <pytorch_lightning.trainer.trainer.Trainer object at 0x7fcdb54a9fa0>
however, for the original ddp_spawn, the trainer is the same
ddp_spawn
trainer
ic| self: <pytorch_lightning.trainer.trainer.Trainer object at 0x7fcae29bbbe0> self.state.status: <TrainerStatus.FINISHED: 'finished'> ic| trainer.state.finished: True trainer: <pytorch_lightning.trainer.trainer.Trainer object at 0x7fcae29bbbe0>
tag: #143
Using the ray ddp, the weight (or
state_dict
) of the model is not changed, when i run the example of https://github.com/ray-project/ray_lightning/blob/main/ray_lightning/examples/ray_ddp_example.pythe output is
for the
mp_spawm
, the output isThis is because if we print out the trainer id, we can find it is different
however, for the original
ddp_spawn
, thetrainer
is the same