Closed blizda closed 3 years ago
Hi again @blizda!
So there's one line in the error you gave that garners attention (lol). I split it into multiple lines since it is more readable:
RuntimeError: Expected to mark a variable ready only once. This error is caused
by one of the following reasons:
1) Use of a module parameter outside the `forward` function. Please make sure
model parameters are not shared across multiple concurrent forward-backward
passes
2) Reused parameters in multiple reentrant backward passes. For example, if you
use multiple `checkpoint` functions to wrap the same part of your model, it
would result in the same set of parameters been used by different reentrant
backward passes multiple times, and hence marking a variable ready multiple
times. DDP does not support such use cases yet.
3) Incorrect unused parameter detection. The return value of the `forward`
function is inspected by the distributed data parallel wrapper to figure out if
any of the module's parameters went unused. For unused parameters, DDP would
not expect gradients from then. However, if an unused parameter becomes part of
the autograd graph at a later point in time (e.g., in a reentrant backward when
using `checkpoint`), the gradient will show up unexpectedly. If all parameters
in the model participate in the backward pass, you can disable unused parameter
detection by passing the keyword argument `find_unused_parameters=False` to
`torch.nn.parallel.DistributedDataParallel`.
Here, I think the error is with case 2
. So if you look at the config, you can see that you have checkpoint_level="C2"
set. This checkpoints the gradients with each forward pass. Also in your model you select a parameter sharing option that is not none
. This means that the same parameters will be called more than once with each forward pass.
So what ends up happening is that DDP marks these same parameters ready multiple times. However, as you can see, DDP does not support this use case yet. This is not an issue with the model here per se, rather, with the DDP module that torch
offers.
What you can do, if you still wanna use my code with DDP as well as share parameters, is to set checkpoint_level="C0"
. This does not checkpoint the model, and I tested this, along with parameter_sharing="none"
, and it works just fine with using the above code. Unfortunately, when using any parameter sharing with DDP, one cannot checkpoint yet, and this seems like an issue that the people of torch will have to solve on their own.
If this is a persistent issue, try opening up an issue there and they might get working on it :+1: I hope that I answered all your questions!
Hi again @blizda!
So there's one line in the error you gave that garners attention (lol). I split it into multiple lines since it is more readable:
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes 2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases yet. 3) Incorrect unused parameter detection. The return value of the `forward` function is inspected by the distributed data parallel wrapper to figure out if any of the module's parameters went unused. For unused parameters, DDP would not expect gradients from then. However, if an unused parameter becomes part of the autograd graph at a later point in time (e.g., in a reentrant backward when using `checkpoint`), the gradient will show up unexpectedly. If all parameters in the model participate in the backward pass, you can disable unused parameter detection by passing the keyword argument `find_unused_parameters=False` to `torch.nn.parallel.DistributedDataParallel`.
Here, I think the error is with case
2
. So if you look at the config, you can see that you havecheckpoint_level="C2"
set. This checkpoints the gradients with each forward pass. Also in your model you select a parameter sharing option that is notnone
. This means that the same parameters will be called more than once with each forward pass.So what ends up happening is that DDP marks these same parameters ready multiple times. However, as you can see, DDP does not support this use case yet. This is not an issue with the model here per se, rather, with the DDP module that
torch
offers.What you can do, if you still wanna use my code with DDP as well as share parameters, is to set
checkpoint_level="C0"
. This does not checkpoint the model, and I tested this, along withparameter_sharing="none"
, and it works just fine with using the above code. Unfortunately, when using any parameter sharing with DDP, one cannot checkpoint yet, and this seems like an issue that the people of torch will have to solve on their own.If this is a persistent issue, try opening up an issue there and they might get working on it 👍 I hope that I answered all your questions!
Thanks, sorry for stupid questions. With C0 all going fine thank you a lot.
Hi, I trying to run informer training with DistributedDataParallel, parameter_sharing="layerwise" and get this error
Code for reproducing
Also, this issue reproducing with any parameter sharing besides the "none"