noahzn / Lite-Mono

[CVPR2023] Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation
MIT License
540 stars 61 forks source link

about distriuted training #43

Closed Xjmengnieer closed 7 months ago

Xjmengnieer commented 1 year ago

Hello author, I have made modifications according to your prompts to achieve distributed training!

but But the following very strange phenomenon occurred:

This is my startup command image

And these are the modified code image image

The following is the usage of the GPUS image

It can be clearly seen that card 0 is occupying too much space!!

I don't know what went wrong. I hope you can help me

Xjmengnieer commented 1 year ago

And I found through debugging that the problem occurred here in DDP, and when each card was subjected to DDP, the number 0 card would significantly increase:

this is before DDP: image

this is 0 and 1 gpus after DDP: image

and this is all 0,1,2 after DDP:

image

it seems like the model have been copied to 0 gpu!!

noahzn commented 1 year ago

Hi, please check if all the losses are computed on GPU0.

Xjmengnieer commented 1 year ago

Hi, please check if all the losses are computed on GPU0.

no, I debug step by step,They are on their respective gpus:

The following is a scenario where only three gpus are used.

As can be seen, the occupancy of card 0 is still much higher than that of other cards, but it is proportional to the number of cards image

Xjmengnieer commented 1 year ago

The occupancy of gpu 0 has been higher than other gpus since DDP execution, And it seems that everything is normal during the training process

image

noahzn commented 1 year ago

Yes. There is a lot of discussion on the internet about unbalanced GPU usage with DDP.

Xjmengnieer commented 1 year ago

Yes. There is a lot of discussion on the internet about unbalanced GPU usage with DDP.

Can it be because your model is built in blocks, rather than a complete model?

noahzn commented 1 year ago

I don't think so. The full name of DDP is Distributed Data Parallel, so you are actually parallelizing your data, not the model.

Xjmengnieer commented 1 year ago

I don't think so. The full name of DDP is Distributed Data Parallel, so you are actually parallelizing your data, not the model.

but now, The problem seems to have nothing to do with the data,

noahzn commented 1 year ago

https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel This container provides data parallelism by synchronizing gradients across each model replica. To use DistributedDataParallel on a host with N GPUs, you should spawn up N processes, ensuring that each process exclusively works on a single GPU from 0 to N-1.

Xjmengnieer commented 1 year ago

Yes, I have complied with this regulation:

image

Xjmengnieer commented 1 year ago

If possible, could you also try implementing a distributed training script to see if the same problem will occur.

be deeply grateful

noahzn commented 1 year ago

If possible, could you also try implementing a distributed training script to see if the same problem will occur.

be deeply grateful

Unfortunately, I am currently working on other projects and do not have time to implement this feature. Maybe I will have a look at it when I have some free time. If you solve this problem, feel free to share your solutions.

Xjmengnieer commented 1 year ago

ok,thanks

Xjmengnieer commented 1 year ago

I have identified such an error, but I am not sure which parameter of this stage is. Do you know image

Xjmengnieer commented 1 year ago

It should be in module depth_encoder

Xjmengnieer commented 1 year ago

Can't layer-norm perform distributed training

noahzn commented 1 year ago

Hi, is it the LayerNorm used in the LGFI module? Have you modified other parts of the structure?

I found another issue that might be related.

Xjmengnieer commented 1 year ago

Hi, is it the LayerNorm used in the LGFI module? Have you modified other parts of the structure?

I found another issue that might be related.

no, I havn't modified other parts of the structure ! I think operation layer_norm itself does not involve distributed communication,Therefore, when conducting DDP distributed training, “find_unused_parameters” must be set to True

noahzn commented 1 year ago

Yes, it's possible. See this link.

This layer uses statistics computed from input data in both training and evaluation modes. I don't know if this is the problem.

Xjmengnieer commented 1 year ago

Operation layer-norm is indeed a problem. After I commented it out, a new problem appeared, and it seems that there was an omission in the calculation of the loss function

image

noahzn commented 1 year ago

Are you loading the ImageNet pre-trained weights? The fc layer seems to be the last unused layer in the pre-trained model.

Xjmengnieer commented 1 year ago

It should be,and the occupancy of card 0 is still much higher than that of other cards, and we still don't know where the reason lies

owl-of-pastos commented 1 year ago

Hi, I'm also trying to achieve distributed training and my code is just like the issue's author's. I recently encountered a warning that I'd like to inquire about. The warning message is as follows:

/home/dingyl/miniconda3/envs/litemono/lib/python3.7/site-packages/torch/autograd/init.py:175: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [48, 64, 1, 1], strides() = [64, 1, 64, 64] bucket_view.sizes() = [48, 64, 1, 1], strides() = [64, 1, 1, 1] (Triggered internally at /opt/conda/conda-bld/pytorch_1659484809535/work/torch/csrc/distributed/c10d/reducer.cpp:312.) allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass

Could you kindly provide some insights into what might be causing this warning? Furthermore, I haven't encountered high GPU0 occupancy issues, but DDP seems to have a minimal impact on training speed.

noahzn commented 1 year ago

UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.

Hi @owl-of-pastos, this might be helpful.