Closed OuYaozhong closed 1 year ago
Have you ever monitored your machine's memory usage? This could potentially be caused by a memory failure.
Have you ever monitored your machine's memory usage? This could potentially be caused by a memory failure.
Hi, @chenshi3
It seems that not the out of memory problem. The total size of memory of my machine is up to 256G, and though have not monitor it, the maximal usage of memory when it ran with torch.multiprocessing
is around 133G, which is only around half of the total.
And the GPU memory is about 24G, and occupies about 10G per one.
@chenshi3 And i have a problem that, is it necessary to use torch.multiprocessing
to start multiprocessing ?
I see that in many DDP best practice, the mp.set_start_method() is an option, and not all of them are used.
And in my case, set the mp.set_start_method('spawn') indeed make the program run 2x faster.
Is there any reason ?
@chenshi3 And i have a problem that, is it necessary to use
torch.multiprocessing
to start multiprocessing ?I see that in many DDP best practice, the mp.set_start_method() is an option, and not all of them are used.
And in my case, set the mp.set_start_method('spawn') indeed make the program run 2x faster.
Is there any reason ?
I am sorry that I'm not an expert on this issue and unable to provide valuable comments.
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.
Hi @chenshi3 , could you please provide an example command of torchrun? I am using pytorch launceher for the first time and it would be really helpful to have the command. Thanks again~!
Hi, @sshaoshuai @chenshi3 @jihanyang @yukang2017 @djiajunustc @Gus-Guo @Cedarch @acivgin1
I have stuck at the NCCL problem. If i use multiple GPUs at single machine, it comes at a NCCL communication error:
[Rank 1] Watchdog caught collective operation timeout:
Location: The location that stuck and raise timeout error, is random. But, most of time is located at the
dist.all_reduce()
or somexx.alltogether
. It seems as the NCCL communication problem.Special Features: If the problem run normally, the utilization of GPU will change with time. But if the program stuck at this kind of problem, the utilization of all used GPUs are 100%, till the timeout error raised.
Some Trial: I have do some trial to find what happen and help your developer to save time to locate the problem. The most meaningful finding is telling below:
At pcdet/utils/common_utils.py, we have the code:
If delete these two line code, it can run normally without any NCCL timeout error for several hours. And if use
torch.multiprocessing
, the NCCL timeout error will be met within 1 hours in general.And, If DISABLE the
torch.multiprocessing
, the ETA time of program reach 2x higher, similar to running with the single GPU (may be relative to the number of GPU as i am using 2). And only 43G memory will be used instead of about 133G will used if ENABLEtorch.multiprocessing
.In conculsion, the problem seems relative to the mixed use of both
torchrun
andtorch.multiprocessing
. The mixed use will cause higher memory usage, and reach higher speed, but will face NCCL errorcollective operations timeout
at random within short time. Only usetorchrun
seems get rid of the NCCL timeout error, but run slower, and use much less memory.Below are some detailed log about this issue. Environment:
Log: The complete log: openpcdet_2gpu_mixed_torchrun_mp_log.txt
Show the most important parts below.
Stange Behavior Capture:
The GPU utilization of both stuck at 100%, at the beginning the prgram ran normally, the utilization were various. The GPU 1 seems sending and receiving something since its TX and RX using large bandwidth. But GPU 0 seems no action withmuch lower bandwidth for both TX and TX. And both process using 100% CPU.