Closed Xjmengnieer closed 7 months ago
And I found through debugging that the problem occurred here in DDP, and when each card was subjected to DDP, the number 0 card would significantly increase:
this is before DDP:
this is 0 and 1 gpus after DDP:
and this is all 0,1,2 after DDP:
it seems like the model have been copied to 0 gpu!!
Hi, please check if all the losses are computed on GPU0.
Hi, please check if all the losses are computed on GPU0.
no, I debug step by step,They are on their respective gpus:
The following is a scenario where only three gpus are used.
As can be seen, the occupancy of card 0 is still much higher than that of other cards, but it is proportional to the number of cards
The occupancy of gpu 0 has been higher than other gpus since DDP execution, And it seems that everything is normal during the training process
Yes. There is a lot of discussion on the internet about unbalanced GPU usage with DDP.
Yes. There is a lot of discussion on the internet about unbalanced GPU usage with DDP.
Can it be because your model is built in blocks, rather than a complete model?
I don't think so. The full name of DDP is Distributed Data Parallel, so you are actually parallelizing your data, not the model.
I don't think so. The full name of DDP is Distributed Data Parallel, so you are actually parallelizing your data, not the model.
but now, The problem seems to have nothing to do with the data,
https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel
This container provides data parallelism by synchronizing gradients across each model replica.
To use DistributedDataParallel on a host with N GPUs, you should spawn up N processes, ensuring that each process exclusively works on a single GPU from 0 to N-1.
Yes, I have complied with this regulation:
If possible, could you also try implementing a distributed training script to see if the same problem will occur.
be deeply grateful
If possible, could you also try implementing a distributed training script to see if the same problem will occur.
be deeply grateful
Unfortunately, I am currently working on other projects and do not have time to implement this feature. Maybe I will have a look at it when I have some free time. If you solve this problem, feel free to share your solutions.
ok,thanks
I have identified such an error, but I am not sure which parameter of this stage is. Do you know
It should be in module depth_encoder
Can't layer-norm perform distributed training
Hi, is it the LayerNorm used in the LGFI
module? Have you modified other parts of the structure?
I found another issue that might be related.
Hi, is it the LayerNorm used in the
LGFI
module? Have you modified other parts of the structure?I found another issue that might be related.
no, I havn't modified other parts of the structure ! I think operation layer_norm itself does not involve distributed communication,Therefore, when conducting DDP distributed training, “find_unused_parameters” must be set to True
Yes, it's possible. See this link.
This layer uses statistics computed from input data in both training and evaluation modes.
I don't know if this is the problem.
Operation layer-norm is indeed a problem. After I commented it out, a new problem appeared, and it seems that there was an omission in the calculation of the loss function
Are you loading the ImageNet pre-trained weights? The fc
layer seems to be the last unused layer in the pre-trained model.
It should be,and the occupancy of card 0 is still much higher than that of other cards, and we still don't know where the reason lies
Hi, I'm also trying to achieve distributed training and my code is just like the issue's author's. I recently encountered a warning that I'd like to inquire about. The warning message is as follows:
/home/dingyl/miniconda3/envs/litemono/lib/python3.7/site-packages/torch/autograd/init.py:175: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [48, 64, 1, 1], strides() = [64, 1, 64, 64] bucket_view.sizes() = [48, 64, 1, 1], strides() = [64, 1, 1, 1] (Triggered internally at /opt/conda/conda-bld/pytorch_1659484809535/work/torch/csrc/distributed/c10d/reducer.cpp:312.) allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
Could you kindly provide some insights into what might be causing this warning? Furthermore, I haven't encountered high GPU0 occupancy issues, but DDP seems to have a minimal impact on training speed.
Hello author, I have made modifications according to your prompts to achieve distributed training!
but But the following very strange phenomenon occurred:
This is my startup command
And these are the modified code
The following is the usage of the GPUS
It can be clearly seen that card 0 is occupying too much space!!
I don't know what went wrong. I hope you can help me