ray-project / ray_lightning

Pytorch Lightning Distributed Accelerators using Ray
Apache License 2.0
211 stars 34 forks source link

`ray_ddp` issue of `Leaking Caffe2 thread-pool after fork. (function pthreadpool)` #180

Open JiahaoYao opened 2 years ago

JiahaoYao commented 2 years ago
Epoch 0:  81%|████████  | 759/937 [00:05<00:01, 147.85it/s, loss=2.32, v_num=0]
Epoch 0:  82%|████████▏ | 772/937 [00:05<00:01, 147.47it/s, loss=2.31, v_num=0]
Epoch 0:  82%|████████▏ | 773/937 [00:05<00:01, 147.42it/s, loss=2.31, v_num=0]
Epoch 0:  84%|████████▍ | 789/937 [00:05<00:01, 147.51it/s, loss=2.32, v_num=0]
Epoch 0:  84%|████████▍ | 789/937 [00:05<00:01, 147.50it/s, loss=2.32, v_num=0]
Epoch 0:  84%|████████▍ | 790/937 [00:05<00:00, 147.51it/s, loss=2.32, v_num=0]
Epoch 0:  84%|████████▍ | 790/937 [00:05<00:00, 147.50it/s, loss=2.32, v_num=0]
Epoch 0:  86%|████████▌ | 806/937 [00:05<00:00, 147.53it/s, loss=2.33, v_num=0]
Epoch 0:  87%|████████▋ | 819/937 [00:05<00:00, 147.20it/s, loss=2.32, v_num=0]
Epoch 0:  88%|████████▊ | 820/937 [00:05<00:00, 147.15it/s, loss=2.31, v_num=0]
Epoch 0:  89%|████████▉ | 835/937 [00:05<00:00, 147.09it/s, loss=2.32, v_num=0]
Epoch 0:  91%|█████████ | 849/937 [00:05<00:00, 146.90it/s, loss=2.31, v_num=0]
Epoch 0:  91%|█████████ | 850/937 [00:05<00:00, 146.89it/s, loss=2.32, v_num=0]
Epoch 0:  92%|█████████▏| 859/937 [00:05<00:00, 146.94it/s, loss=2.31, v_num=0]
(BaseHorovodWorker pid=38267) 
(BaseHorovodWorker pid=38268) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
(BaseHorovodWorker pid=38267) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
Epoch 0:  93%|█████████▎| 876/937 [00:05<00:00, 146.00it/s, loss=2.31, v_num=0]
Epoch 0:  94%|█████████▎| 877/937 [00:06<00:00, 146.09it/s, loss=2.31, v_num=0]
Epoch 0:  94%|█████████▎| 878/937 [00:06<00:00, 146.17it/s, loss=2.31, v_num=0]
Epoch 0:  94%|█████████▍| 879/937 [00:06<00:00, 146.26it/s, loss=2.31, v_num=0]
Epoch 0:  94%|█████████▍| 880/937 [00:06<00:00, 146.35it/s, loss=2.31, v_num=0]
Epoch 0:  94%|█████████▍| 881/937 [00:06<00:00, 146.42it/s, loss=2.31, v_num=0]
Epoch 0:  94%|█████████▍| 883/937 [00:06<00:00, 146.59it/s, loss=2.31, v_num=0]
Epoch 0:  94%|█████████▍| 884/937 [00:06<00:00, 146.67it/s, loss=2.31, v_num=0]

see this https://github.com/pytorch/pytorch/issues/57273