Open FudanEMWLab opened 4 years ago
@FudanEMWLab Hi, firstly thanks for having a test on MindSpore! Before answering your quesiton, please notice that it's not suggested to directly test MindSpore examples with GPU (especially with nccl
) on devel
environment. Could you transfer the whl package to mindspore/mindspore-gpu:runtime
docker image and retry your code? If the error still goes on, then we will check what's going on.
If you want to try some test code about multi-GPU scenario, please try https://github.com/mindspore-ai/mindspore/tree/master/tests/st/nccl.
mpirun -n 8 pytest -s test_nccl_reduce_scatter_op.py
@FudanEMWLab PaddlePaddle Supports Multi-GPU Training pretty well, you can reference https://github.com/PaddlePaddle/Fleet/ for more details?
@FudanEMWLab PaddlePaddle Supports Multi-GPU Training pretty well, you can reference https://github.com/PaddlePaddle/Fleet/ for more details?
Hi @nizhaoqiao , even though your github account is weirdly empty but I guess you are a developer participated in the Paddle community, so you are more than welcomed to join the convo here in MindSpore ! Open source is all about comradery and friendship :)
I'm interested in what you referred to pretty well as in
PaddlePaddle Supports Multi-GPU Training pretty well.
I've looked up some of the benchmark I could find , for example fleet's number, looks like 2000 for 8 card which for the sake of argument, in a linear scaling assumption, be around 256 for single GPU performance.
Another benchmark suggested that for paddle 1.5 it is around 168 for single GPU perf and around 840 for 8 cards in single process (the comparison is made with PyTorch v1.1.0 which is not a very new version).
There is an article written by a developer independently running benchmarks on MindSpore and PyTorch 1.5 with Ascend 910 and 2080TI/Tesla respectively. It shows that MindSpore on single GPU, without any targeted optimization, reaches around 230 for single GPU. It would be great if you or other developers could run a multi-GPU bench and I would guess the number should be pretty well.
I think for MindSpore, a newly open source framework that could be on par with Paddlepaddle, a great 4 yrs old open source framework on the type of hardware which is not the primal focal point of MindSpore support, probably we could agree that:
MindSpore does really well
Just some thoughts :) Welcome to participate our community more often :)
@hannibalhuang Hahaha, Don't be so nervous bro, I mean no malice~ I do agree MindSpore does really well on Ascend 910! Let's work hard to build more competitive solutions for the community and developers~ :)
@hannibalhuang Hahaha, Don't be so nervous bro, I mean no malice~ I do agree MindSpore does really well on Ascend 910! Let's work hard to build more competitive solutions for the community and developers~ :)
Cannot agree more with the last sentence :) Just a quick response that I don't know where you picked nervous from my reply which is just a standard open source community exchange, and I didn't imply in any way that you acted with somewhat "malevolent" attitude . Malice is too strong a word for open source discussions :)
Anyways welcome to provide your own benchmarks for running MindSpore on multi-GPUs as I suggested, I think it'll run pretty well :P
@hannibalhuang Hahaha, Don't be so nervous bro, I mean no malice~ I do agree MindSpore does really well on Ascend 910! Let's work hard to build more competitive solutions for the community and developers~ :)
Cannot agree more with the last sentence :) Just a quick response that I don't know where you picked nervous from my reply which is just a standard open source community exchange, and I didn't imply in any way that you acted with somewhat "malevolent" attitude . Malice is too strong a word for open source discussions :)
Anyways welcome to provide your own benchmarks for running MindSpore on multi-GPUs as I suggested, I think it'll run pretty well :P
👍 加油啦~
I built the source from a docker environment based on the dockerfile (docker/mindspore-gpu/devel/Dockerfile) and tried some tests under mindspore/tests/ut/python/parallel. I've modified the tests by adding below two lines,
I used below command to build the source,
I checked the folder where the package were installed, the libgpu_collective.so which was not loaded successfully during nccl initialization is in there.
The tests failed with below error messages. Is there any guide to run MindSpore with multi-GPU and different parallel modes?
Thanks
################## Error Message #################### elif backend_name == "nccl":
../../../../mindspore/communication/management.py:69: RuntimeError -------------------------------------------------------------------------------------- Captured stderr call -------------------------------------------------------------------------------------- [ERROR] ME(102,python):2020-04-19-14:19:19.404.520 [mindspore/ccsrc/device/gpu/distribution/collective_init.cc:35] InitCollective] Loading libgpu_collective.so failed. Many reasons could cause this: 1.libgpu_collective.so is not installed. 2.nccl is not installed or found. 3.mpi is not installed or found ==================================================================================== short test summary info ===================================================================================== FAILED test_matmul_tensor.py::test_two_matmul - RuntimeError: mindspore/ccsrc/device/gpu/distribution/collective_init.cc:35 InitCollective] Loading libgpu_collective.so failed. Many reasons c... ======================================================================================= 1 failed in 0.96s ========================================================================================