Open SimonZsx opened 4 years ago
I have been using NVIDIA container image for PyTorch, release 19.09. Looks like that corresponds to PyTorch version 1.2.0. I have not tried this with later versions.
Just an update on this issue, I checked all PyTorch releases, and until NVIDIA PyTorch release 20.01, corresponding to PyTorch release 1.4.0, pipedream works, but since NVIDIA release 20.02, a runtime error occurs, #31 , the same error log as this issue. I tried to locate the issue, it seems from an in-place version checking feature added by release 1.5.0. And the problem comes from when the second last stage tries to start its backward pass, and if load_old_params() before backward(), the error will show.
Thanks for doing this! This is helpful! I will look into this in the next couple of days!
I temporarily make pipedream run on latest PyTorch by eliminating the version check in unpack() in torch/csrc/autograd/saved_variable.cpp, it seems runtime errors come from this version checking (really dirty solution). I have not really understood pipedream's manipulation on the backward propagated gradients, but I guess this comes from one more in-place operation on the tensors passing between stages. I think this may help you solve this problem.
Variable SavedVariable::unpack(std::shared_ptr<Node> saved_for) const {
if (!data_.defined()) {
if (!was_default_constructed_) {
throw std::runtime_error(ERR_BACKWARD_TWICE);
}
return Variable();
}
auto grad_fn = is_inplace_view_ ? weak_grad_fn_.lock() : grad_fn_;
if (has_grad_fn_ && !grad_fn) {
if (!saved_for) {
// If saving the grad_fn would create a circular reference, then it must
// be passed in to the unpack function.
throw std::runtime_error("No grad_fn for non-leaf saved variable");
}
grad_fn = std::move(saved_for);
}
if (saved_version_ != version_counter_.current_version()) {
std::stringstream message;
message << "one of the variables needed for gradient computation has been "
"modified by an inplace operation: [" << data_.toString() << " "
<< data_.sizes() << "]";
if (grad_fn) {
message << ", which is output " << output_nr_
<< " of " << grad_fn->name() << ",";
}
message << " is at version " << version_counter_.current_version()
<< "; expected version " << saved_version_ << " instead.";
if (!AnomalyMode::is_enabled()) {
message << " Hint: enable anomaly detection to find the operation "
"that failed to compute its gradient, with torch.autograd."
"set_detect_anomaly(True).";
}
else {
message << " Hint: the backtrace further above shows the operation "
"that failed to compute its gradient. The variable in question "
"was changed in there or anywhere later. Good luck!";
}
throw std::runtime_error(message.str());
}
@SimonZsx Have you tried to comment out the version checking code in Pytorch and see whether it is working?
@deepakn94 @SimonZsx Sorry to bother you, I'm reproducing the training process of pipedream, and hope to deploy it in torch >= 1.5.0. May I ask if there are any solutions currently?
@deepakn94 @SimonZsx Sorry to bother you, I'm reproducing the training process of pipedream, and hope to deploy it in torch >= 1.5.0. May I ask if there are any solutions currently?
I also have this confusion. Do you have any progress? Maybe we can talk about it.
Hi, the commenting and recompiling solution work, but it’s kind of dirty. The problem can be avoided by not using the weight stashing, because this feature seems to be used for gradient version checking, and the weight stashing breaks the checking.
One of my folks says the version can be manually set to avoid this error, but I have not checked it yet; just a tiny hint for you.
Hi, the commenting and recompiling solution work, but it’s kind of dirty. The problem can be avoided by not using the weight stashing, because this feature seems to be used for gradient version checking, and the weight stashing breaks the checking.
One of my folks says the version can be manually set to avoid this error, but I have not checked it yet; just a tiny hint for you.
Thanks!
@deepakn94 @SimonZsx Sorry to bother you, I'm reproducing the training process of pipedream, and hope to deploy it in torch >= 1.5.0. May I ask if there are any solutions currently?
I also have this confusion. Do you have any progress? Maybe we can talk about it.
Same to you. Do you have any progress?
Hi, what's the latest version of stable PyTorch release supported? Which version is pre_hook_pytorch_latest.patch for? Thanks for your reply in advance.