Error with DistributedDataParallel and parameter_sharing="layerwise"

blizda commented 3 years ago

Hi, I trying to run informer training with DistributedDataParallel, parameter_sharing="layerwise" and get this error

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/home/jovyan/nlpdata/test_ddp_vanila_torch.py", line 95, in demo_basic
    loss_fn(output, labels).backward()
  File "/opt/conda/lib/python3.6/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases yet.3) Incorrect unused parameter detection. The return value of the `forward` function is inspected by the distributed data parallel wrapper to figure out if any of the module's parameters went unused. For unused parameters, DDP would not expect gradients from then. However, if an unused parameter becomes part of the autograd graph at a later point in time (e.g., in a reentrant backward when using `checkpoint`), the gradient will show up unexpectedly. If all parameters in the model participate in the backward pass, you can disable unused parameter detection by passing the keyword argument `find_unused_parameters=False` to `torch.nn.parallel.DistributedDataParallel`.
Exception raised from mark_variable_ready at ../torch/csrc/distributed/c10d/reducer.cpp:484 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7f62b61fd99b in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10d::Reducer::mark_variable_ready(c10d::Reducer::VariableIndex) + 0xbe7 (0x7f62ef7edac7 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #2: c10d::Reducer::autograd_hook(c10d::Reducer::VariableIndex) + 0x93 (0x7f62ef7ede23 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0xad2006 (0x7f62ef7ee006 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0xad902a (0x7f62ef7f502a in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x4f9 (0x7f62ea50b889 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #6: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x4b4 (0x7f62ea50d3f4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #7: torch::autograd::Engine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>) + 0x33c (0x7f62ea50aa1c in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #8: torch::autograd::python::PythonEngine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>) + 0x4c (0x7f62ef2495bc in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #9: torch::autograd::Engine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&) + 0x82f (0x7f62ea509d5f in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #10: torch::autograd::python::PythonEngine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&) + 0x74 (0x7f62ef2492f4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #11: THPEngine_run_backward(THPEngine*, _object*, _object*) + 0xa10 (0x7f62ef24a070 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #12: _PyCFunction_FastCallDict + 0x154 (0x5572c4395304 in /opt/conda/bin/python)
frame #13: _PyCFunction_FastCallKeywords + 0x50 (0x5572c43c1cd0 in /opt/conda/bin/python)
frame #14: <unknown function> + 0x199b0c (0x5572c441cb0c in /opt/conda/bin/python)
frame #15: _PyEval_EvalFrameDefault + 0x10c9 (0x5572c44405d9 in /opt/conda/bin/python)
frame #16: <unknown function> + 0x192f26 (0x5572c4415f26 in /opt/conda/bin/python)
frame #17: <unknown function> + 0x193f31 (0x5572c4416f31 in /opt/conda/bin/python)
frame #18: <unknown function> + 0x199be5 (0x5572c441cbe5 in /opt/conda/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x30a (0x5572c443f81a in /opt/conda/bin/python)
frame #20: PyEval_EvalCodeEx + 0x329 (0x5572c4417a49 in /opt/conda/bin/python)
frame #21: <unknown function> + 0x195864 (0x5572c4418864 in /opt/conda/bin/python)
frame #22: PyObject_Call + 0x3e (0x5572c439510e in /opt/conda/bin/python)
frame #23: _PyEval_EvalFrameDefault + 0x1aaf (0x5572c4440fbf in /opt/conda/bin/python)
frame #24: <unknown function> + 0x192f26 (0x5572c4415f26 in /opt/conda/bin/python)
frame #25: _PyFunction_FastCallDict + 0x1be (0x5572c441740e in /opt/conda/bin/python)
frame #26: _PyObject_FastCallDict + 0x26f (0x5572c43956cf in /opt/conda/bin/python)
frame #27: _PyObject_Call_Prepend + 0x63 (0x5572c439a143 in /opt/conda/bin/python)
frame #28: PyObject_Call + 0x3e (0x5572c439510e in /opt/conda/bin/python)
frame #29: torch::autograd::PyNode::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x193 (0x7f62ef2519f3 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #30: <unknown function> + 0x29d82c5 (0x7f62ea5112c5 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #31: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x14a8 (0x7f62ea50c838 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #32: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x4b4 (0x7f62ea50d3f4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #33: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x99 (0x7f62ea504ec9 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #34: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x5a (0x7f62ef24905a in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #35: <unknown function> + 0xbd6df (0x7f62fb49b6df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #36: <unknown function> + 0x76db (0x7f6318d876db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #37: clone + 0x3f (0x7f6318ab0a3f in /lib/x86_64-linux-gnu/libc.so.6)

Code for reproducing

import os
import tempfile
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
import torch.multiprocessing as mp
from linformer_pytorch import LinformerLM
from torch.nn.parallel import DistributedDataParallel as DDP

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'

    # initialize the process group
    dist.init_process_group("gloo", rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()

def demo_basic(rank, world_size):
    print(f"Running basic DDP example on rank {rank}.")
    setup(rank, world_size)

    # create model and move it to GPU with id rank

    model = LinformerLM(
            num_tokens=30522,  # Number of tokens in the LM
            input_size=5120,  # Dimension 1 of the input
            channels=128,  # Dimension 2 of the input
            dim_d=None,
            # Overwrites the inner dim of the attention heads. If None, sticks with the recommended channels // nhead, as in the "Attention is all you need" paper
            dim_k=128,  # The second dimension of the P_bar matrix from the paper
            dim_ff=128,  # Dimension in the feed forward network
            dropout_ff=0.15,  # Dropout for feed forward network
            nhead=16,  # Number of attention heads
            depth=12,  # How many times to run the model
            dropout=0.1,  # How much dropout to apply to P_bar after softmax
            activation="gelu",
            # What activation to use. Currently, only gelu and relu supported, and only on ff network.
            checkpoint_level="C2",  # What checkpoint level to use. For more information, see below.
            parameter_sharing="layerwise",  # What level of parameter sharing to use. For more information, see below.
            k_reduce_by_layer=0,
            # Going down `depth`, how much to reduce `dim_k` by, for the `E` and `F` matrices. Will have a minimum value of 1.
            full_attention=False,
            # Use full attention instead, for O(n^2) time and space complexity. Included here just for comparison
            include_ff=True,  # Whether or not to include the Feed Forward layer
            w_o_intermediate_dim=None,
            # If not None, have 2 w_o matrices, such that instead of `dim*nead,channels`, you have `dim*nhead,w_o_int`, and `w_o_int,channels`
            emb_dim=128,  # If you want the embedding dimension to be different than the channels for the Linformer
        ).to(rank)
    ddp_model = DDP(model, device_ids=[rank])

    loss_fn = nn.CrossEntropyLoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    optimizer.zero_grad()
    outputs = ddp_model(torch.randint(20000, (3, 5120)))
    labels = torch.randint(20000, (3, 5120)).to(rank)
    loss_mx = labels != -100
    output = outputs[loss_mx].view(-1, 30522)
    labels = labels[loss_mx].view(-1)
    loss_fn(output, labels).backward()
    optimizer.step()

    cleanup()

def run_demo(demo_fn, world_size):
    mp.spawn(demo_fn,
             args=(world_size,),
             nprocs=world_size,
             join=True)

if __name__ == "__main__":
    run_demo(demo_basic, 2)

Also, this issue reproducing with any parameter sharing besides the "none"

tatp22 commented 3 years ago

Hi again @blizda!

So there's one line in the error you gave that garners attention (lol). I split it into multiple lines since it is more readable:

RuntimeError: Expected to mark a variable ready only once. This error is caused
by one of the following reasons:

1) Use of a module parameter outside the `forward` function. Please make sure
model parameters are not shared across multiple concurrent forward-backward
passes

2) Reused parameters in multiple reentrant backward passes. For example, if you
use multiple `checkpoint` functions to wrap the same part of your model, it
would result in the same set of parameters been used by different reentrant
backward passes multiple times, and hence marking a variable ready multiple
times. DDP does not support such use cases yet.

3) Incorrect unused parameter detection. The return value of the `forward`
function is inspected by the distributed data parallel wrapper to figure out if
any of the module's parameters went unused. For unused parameters, DDP would
not expect gradients from then. However, if an unused parameter becomes part of
the autograd graph at a later point in time (e.g., in a reentrant backward when
using `checkpoint`), the gradient will show up unexpectedly. If all parameters
in the model participate in the backward pass, you can disable unused parameter
detection by passing the keyword argument `find_unused_parameters=False` to
`torch.nn.parallel.DistributedDataParallel`.

Here, I think the error is with case 2. So if you look at the config, you can see that you have checkpoint_level="C2" set. This checkpoints the gradients with each forward pass. Also in your model you select a parameter sharing option that is not none. This means that the same parameters will be called more than once with each forward pass.

So what ends up happening is that DDP marks these same parameters ready multiple times. However, as you can see, DDP does not support this use case yet. This is not an issue with the model here per se, rather, with the DDP module that torch offers.

What you can do, if you still wanna use my code with DDP as well as share parameters, is to set checkpoint_level="C0". This does not checkpoint the model, and I tested this, along with parameter_sharing="none", and it works just fine with using the above code. Unfortunately, when using any parameter sharing with DDP, one cannot checkpoint yet, and this seems like an issue that the people of torch will have to solve on their own.

If this is a persistent issue, try opening up an issue there and they might get working on it :+1: I hope that I answered all your questions!

blizda commented 3 years ago

Hi again @blizda!

So there's one line in the error you gave that garners attention (lol). I split it into multiple lines since it is more readable:
RuntimeError: Expected to mark a variable ready only once. This error is caused
by one of the following reasons:

1) Use of a module parameter outside the `forward` function. Please make sure
model parameters are not shared across multiple concurrent forward-backward
passes

2) Reused parameters in multiple reentrant backward passes. For example, if you
use multiple `checkpoint` functions to wrap the same part of your model, it
would result in the same set of parameters been used by different reentrant
backward passes multiple times, and hence marking a variable ready multiple
times. DDP does not support such use cases yet.

3) Incorrect unused parameter detection. The return value of the `forward`
function is inspected by the distributed data parallel wrapper to figure out if
any of the module's parameters went unused. For unused parameters, DDP would
not expect gradients from then. However, if an unused parameter becomes part of
the autograd graph at a later point in time (e.g., in a reentrant backward when
using `checkpoint`), the gradient will show up unexpectedly. If all parameters
in the model participate in the backward pass, you can disable unused parameter
detection by passing the keyword argument `find_unused_parameters=False` to
`torch.nn.parallel.DistributedDataParallel`.
Here, I think the error is with case 2. So if you look at the config, you can see that you have checkpoint_level="C2" set. This checkpoints the gradients with each forward pass. Also in your model you select a parameter sharing option that is not none. This means that the same parameters will be called more than once with each forward pass.

So what ends up happening is that DDP marks these same parameters ready multiple times. However, as you can see, DDP does not support this use case yet. This is not an issue with the model here per se, rather, with the DDP module that torch offers.

What you can do, if you still wanna use my code with DDP as well as share parameters, is to set checkpoint_level="C0". This does not checkpoint the model, and I tested this, along with parameter_sharing="none", and it works just fine with using the above code. Unfortunately, when using any parameter sharing with DDP, one cannot checkpoint yet, and this seems like an issue that the people of torch will have to solve on their own.

If this is a persistent issue, try opening up an issue there and they might get working on it 👍 I hope that I answered all your questions!

Thanks, sorry for stupid questions. With C0 all going fine thank you a lot.

tatp22 / linformer-pytorch

Error with DistributedDataParallel and parameter_sharing="layerwise" #23