hku-systems / vpipe

26 stars 3 forks source link

RuntimeError when executing `python driver.py ...` #3

Closed Hyaloid closed 1 year ago

Hyaloid commented 1 year ago

When I execute nvidia-docker run -it -v $(dirname $PWD):/workspace --net=host --ipc=host bert /bin/bash -c 'export GLOO_SOCKET_IFNAME=docker0; cp ../runtime/launch.py .; python -m launch --nnodes 1 --node_rank 0 --nproc_per_node 4 main_with_runtime.py --data_dir data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/books_wiki_en_corpus --master_addr localhost --module vgpus=4 --checkpoint_dir output/2023-03-30T02:49:14 --partition vgpus=4/vpipe.json --sync_mode asp --distributed_backend gloo -b 16 --lr 0.050000 --lr_policy polynomial --weight-decay 0.000000 --epochs 40 --print-freq 100 --verbose 0 --num_ranks_in_server 4 --config_path vgpus=4/mp_conf.json 2>&1 | tee output/2023-03-30T02:49:14/output.log.0; rm launch.py', and I got this error:

Traceback (most recent call last):
  File "main_with_runtime.py", line 576, in <module>
    main()
  File "main_with_runtime.py", line 324, in main
    train(train_loader, r, optimizer, epoch, lr_scheduler)
  File "main_with_runtime.py", line 455, in train
    pipelining(n, args.print_freq, weight_stash=True)
  File "main_with_runtime.py", line 421, in pipelining
    r.run_backward()
  File "../runtime.py", line 624, in run_backward
    for output_name in outputs]))
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [4096, 1024]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

The environment is:

Do you know how to solve this problem? It would be so appreciated if you could help.

SimonZsx commented 1 year ago

Hi, this issue is caused by a version upgrade of PyTorch since 1.5.0. You can (1) downgrade it to 1.4.0 or (2) refer to this issue I opened before:

https://github.com/msr-fiddle/pipedream/issues/52

I temporarily make pipedream run on latest PyTorch by eliminating the version check in unpack() in torch/csrc/autograd/saved_variable.cpp, it seems runtime errors come from this version checking (really dirty solution). I have not really understood pipedream's manipulation on the backward propagated gradients, but I guess this comes from one more in-place operation on the tensors passing between stages. I think this may help you solve this problem.

Variable SavedVariable::unpack(std::shared_ptr<Node> saved_for) const {
  if (!data_.defined()) {
    if (!was_default_constructed_) {
      throw std::runtime_error(ERR_BACKWARD_TWICE);
    }
    return Variable();
  }

  auto grad_fn = is_inplace_view_ ? weak_grad_fn_.lock() : grad_fn_;
  if (has_grad_fn_ && !grad_fn) {
    if (!saved_for) {
      // If saving the grad_fn would create a circular reference, then it must
      // be passed in to the unpack function.
      throw std::runtime_error("No grad_fn for non-leaf saved variable");
    }
    grad_fn = std::move(saved_for);
  }
  if (saved_version_ != version_counter_.current_version()) {
    std::stringstream message;
    message << "one of the variables needed for gradient computation has been "
        "modified by an inplace operation: [" << data_.toString() << " "
        << data_.sizes() << "]";
    if (grad_fn) {
        message << ", which is output " << output_nr_
            << " of " << grad_fn->name() << ",";
    }
    message << " is at version " << version_counter_.current_version()
        << "; expected version " << saved_version_ << " instead.";
    if (!AnomalyMode::is_enabled()) {
        message << " Hint: enable anomaly detection to find the operation "
            "that failed to compute its gradient, with torch.autograd."
            "set_detect_anomaly(True).";
    }
    else {
        message << " Hint: the backtrace further above shows the operation "
            "that failed to compute its gradient. The variable in question "
            "was changed in there or anywhere later. Good luck!";
    }
    throw std::runtime_error(message.str());
  }