Open MaryAhn opened 11 months ago
@MaryAhn hey Keonhee, actually not too sure
do you want to give 0.7.1 a quick try? if it still does not work, i'll debug it once i get access to my multi-GPU setup
Yes, I would like to try 0.7.1. My email is khahn196@gmail.com.
@MaryAhn just pip install byol-pytorch -U
for 0.7.1
I installed 0.7.1 through the statement you gave, however, it still does not work. If you find the problem and solution via debugging, please let me know.
after setting torch.autograd.set_detect_anomaly(True)
, do you see a different error trace? could you paste that trace if so?
I set detect anomaly, and the error message is:
File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 129, in <module> loss = learner(images) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward else self._run_ddp_forward(*inputs, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward return self.module(*inputs, **kwargs) # type: ignore[index] File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/keonhee/byol-pytorch/byol_pytorch/byol_pytorch.py", line 264, in forward online_proj_one, _ = self.online_encoder(image_one) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/keonhee/byol-pytorch/byol_pytorch/byol_pytorch.py", line 165, in forward representation = self.get_representation(x) File "/home/cvlab/keonhee/byol-pytorch/byol_pytorch/byol_pytorch.py", line 157, in get_representation _ = self.net(x) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward else self._run_ddp_forward(*inputs, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward return self.module(*inputs, **kwargs) # type: ignore[index] File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torchvision/models/resnet.py", line 285, in forward return self._forward_impl(x) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torchvision/models/resnet.py", line 276, in _forward_impl x = self.layer4(x) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/container.py", line 215, in forward input = module(input) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torchvision/models/resnet.py", line 155, in forward out = self.bn3(out) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 171, in forward return F.batch_norm( File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/functional.py", line 2478, in batch_norm return torch.batch_norm( (Triggered internally at /opt/conda/conda-bld/pytorch_1695392035629/work/torch/csrc/autograd/python_anomaly_mode.cpp:114.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Traceback (most recent call last): File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 131, in <module> loss.backward() File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 7; expected version 6 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck! Traceback (most recent call last): File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 131, in <module> loss.backward() File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 7; expected version 6 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
I think the backward process of online encoder or online predictor does not work appropriately, but I'm not sure.
want to try 0.7.2? it may or may not do anything
i can try to debug this once i get back on my multi-gpu machine
Yes I want to try. pip install byol-pytorch==0.7.2
or pip install byol-pytorch -U
which one should I use?
@MaryAhn both should work
After installation, still same error occurs. After debugging, please let me know about this issue. Thank you.
@MaryAhn i see you are using a custom script
are you not using pytorch lightning? there's a setting in there to replace batchnorms in your resnet with sync batchnorms
otherwise, this could also be related to an ongoing pytorch issue, and you could try setting broadcast_buffers = False
for DistributedDataParallel
you shouldn't need to write the training loop as in your initial comment if you just modify this file and run trainer.fit()
try using that lightning script as is with your resnet, and if the issue persists, then that would tell me a lot
would it help if i offered a huggingface accelerate version? i find accelerate much more hackable
@MaryAhn try 0.8.0 following the instructions here
Hi, I have tried to run the code according to Usage in this repo: `args = parse_args() num_gpus = int(os.environ["WORLD_SIZE"]) if "WORLD_SIZE" in os.environ else 1 args.num_gpus = num_gpus args.distributed = num_gpus > 1 if torch.cuda.is_available(): cudnn.benchmark = False args.device = "cuda" else: args.distributed = False args.device = "cpu" if args.distributed: torch.cuda.set_device(args.local_rank) torch.distributed.init_process_group(backend="nccl", init_method="env://") synchronize()
train_loader = get_loader(args=args)
model = get_model(args) learner = SelfSupervisedLearner( model, image_size=480, hidden_layer='module.avgpool', projection_size = 256, projection_hidden_size = args.hidden_size, moving_average_decay = 0.99 )
opt = torch.optim.Adam(learner.parameters(), lr=3e-4)
if not os.path.exists(args.model_dir): os.makedirs(args.model_dir)
for _ in range(args.epochs): for idx, images in enumerate(train_loader): if torch.cuda.is_available(): images = images.cuda(non_blocking=True) loss = learner(images) opt.zero_grad() loss.backward() opt.step() learner.update_moving_average() # update moving average of target encoder
save your improved network
torch.save(model.state_dict(), './improved-net.pt')`
However, After run this code with distributed learning, during backward(), I got this error message repeated:
Traceback (most recent call last): File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 131, in <module> loss.backward() File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 7; expected version 6 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True). Traceback (most recent call last): File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 131, in <module> loss.backward() File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 7; expected version 6 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True). Traceback (most recent call last): File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 131, in <module> loss.backward() File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 7; expected version 6 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
I used detach().clone() instead of detach() in byol_pytorch.py, I got same error. Even if I set torch.autograd.set_detect_anomaly(True), I could not get what is the reason. Would you let me know what part of this code invokes this problem? Thanks in advance.