msr-fiddle / pipedream

MIT License
379 stars 117 forks source link

What's the latest version of PyTorch supported? #52

Open SimonZsx opened 4 years ago

SimonZsx commented 4 years ago

Hi, what's the latest version of stable PyTorch release supported? Which version is pre_hook_pytorch_latest.patch for? Thanks for your reply in advance.

deepakn94 commented 4 years ago

I have been using NVIDIA container image for PyTorch, release 19.09. Looks like that corresponds to PyTorch version 1.2.0. I have not tried this with later versions.

SimonZsx commented 4 years ago

Just an update on this issue, I checked all PyTorch releases, and until NVIDIA PyTorch release 20.01, corresponding to PyTorch release 1.4.0, pipedream works, but since NVIDIA release 20.02, a runtime error occurs, #31 , the same error log as this issue. I tried to locate the issue, it seems from an in-place version checking feature added by release 1.5.0. And the problem comes from when the second last stage tries to start its backward pass, and if load_old_params() before backward(), the error will show.

deepakn94 commented 4 years ago

Thanks for doing this! This is helpful! I will look into this in the next couple of days!

SimonZsx commented 4 years ago

I temporarily make pipedream run on latest PyTorch by eliminating the version check in unpack() in torch/csrc/autograd/saved_variable.cpp, it seems runtime errors come from this version checking (really dirty solution). I have not really understood pipedream's manipulation on the backward propagated gradients, but I guess this comes from one more in-place operation on the tensors passing between stages. I think this may help you solve this problem.


Variable SavedVariable::unpack(std::shared_ptr<Node> saved_for) const {
  if (!data_.defined()) {
    if (!was_default_constructed_) {
      throw std::runtime_error(ERR_BACKWARD_TWICE);
    }
    return Variable();
  }

  auto grad_fn = is_inplace_view_ ? weak_grad_fn_.lock() : grad_fn_;
  if (has_grad_fn_ && !grad_fn) {
    if (!saved_for) {
      // If saving the grad_fn would create a circular reference, then it must
      // be passed in to the unpack function.
      throw std::runtime_error("No grad_fn for non-leaf saved variable");
    }
    grad_fn = std::move(saved_for);
  }
  if (saved_version_ != version_counter_.current_version()) {
    std::stringstream message;
    message << "one of the variables needed for gradient computation has been "
        "modified by an inplace operation: [" << data_.toString() << " "
        << data_.sizes() << "]";
    if (grad_fn) {
        message << ", which is output " << output_nr_
            << " of " << grad_fn->name() << ",";
    }
    message << " is at version " << version_counter_.current_version()
        << "; expected version " << saved_version_ << " instead.";
    if (!AnomalyMode::is_enabled()) {
        message << " Hint: enable anomaly detection to find the operation "
            "that failed to compute its gradient, with torch.autograd."
            "set_detect_anomaly(True).";
    }
    else {
        message << " Hint: the backtrace further above shows the operation "
            "that failed to compute its gradient. The variable in question "
            "was changed in there or anywhere later. Good luck!";
    }
    throw std::runtime_error(message.str());
  }
BestSonny commented 4 years ago

@SimonZsx Have you tried to comment out the version checking code in Pytorch and see whether it is working?

fkh12345 commented 2 years ago

@deepakn94 @SimonZsx Sorry to bother you, I'm reproducing the training process of pipedream, and hope to deploy it in torch >= 1.5.0. May I ask if there are any solutions currently?

jglicat commented 2 years ago

@deepakn94 @SimonZsx Sorry to bother you, I'm reproducing the training process of pipedream, and hope to deploy it in torch >= 1.5.0. May I ask if there are any solutions currently?

I also have this confusion. Do you have any progress? Maybe we can talk about it.

SimonZsx commented 2 years ago

Hi, the commenting and recompiling solution work, but it’s kind of dirty. The problem can be avoided by not using the weight stashing, because this feature seems to be used for gradient version checking, and the weight stashing breaks the checking.

One of my folks says the version can be manually set to avoid this error, but I have not checked it yet; just a tiny hint for you.

fkh12345 commented 2 years ago

Hi, the commenting and recompiling solution work, but it’s kind of dirty. The problem can be avoided by not using the weight stashing, because this feature seems to be used for gradient version checking, and the weight stashing breaks the checking.

One of my folks says the version can be manually set to avoid this error, but I have not checked it yet; just a tiny hint for you.

Thanks!

leiguan1210 commented 2 years ago

@deepakn94 @SimonZsx Sorry to bother you, I'm reproducing the training process of pipedream, and hope to deploy it in torch >= 1.5.0. May I ask if there are any solutions currently?

I also have this confusion. Do you have any progress? Maybe we can talk about it.

Same to you. Do you have any progress?