open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.7k stars 9.48k forks source link

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. #10515

Open notfacezhi opened 1 year ago

notfacezhi commented 1 year ago

Adding my own modification part to the official code found that it can run in single card mode, but it cannot run in multi-card case, what should I do?

image

hhaAndroid commented 1 year ago

@notfacezhi You can add find_unused_parameters=True in the configuration.

notfacezhi commented 1 year ago

Thank you! I tried it and found it works fine. Wish you a happy life!

@hhaAndroid

James-S-choi commented 1 year ago

@notfacezhi You can add find_unused_parameters=True in the configuration.

Hi @hhaAndroid , I met the same error . I added my own modification part to the official code and it can run in single card mode, but it cannot run in multi-card case. And I added find_unused_parameters=True in the configuration. However, a new error showed: RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the forward function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple checkpoint functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.

Why is this and how can I solve this? Thank you!

Cloud65000 commented 6 months ago

@notfacezhi You can add find_unused_parameters=True in the configuration.

Hi @hhaAndroid , I met the same error . I added my own modification part to the official code and it can run in single card mode, but it cannot run in multi-card case. And I added find_unused_parameters=True in the configuration. However, a new error showed: RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the forward function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple checkpoint functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.

Why is this and how can I solve this? Thank you!

Have you solve the problem? I met the same problem now.