lucidrains / byol-pytorch

Usable Implementation of "Bootstrap Your Own Latent" self-supervised learning, from Deepmind, in Pytorch
MIT License
1.76k stars 248 forks source link

duplicated calculation in gradient #91

Open MaryAhn opened 11 months ago

MaryAhn commented 11 months ago

Hi, I have tried to run the code according to Usage in this repo: `args = parse_args() num_gpus = int(os.environ["WORLD_SIZE"]) if "WORLD_SIZE" in os.environ else 1 args.num_gpus = num_gpus args.distributed = num_gpus > 1 if torch.cuda.is_available(): cudnn.benchmark = False args.device = "cuda" else: args.distributed = False args.device = "cpu" if args.distributed: torch.cuda.set_device(args.local_rank) torch.distributed.init_process_group(backend="nccl", init_method="env://") synchronize()

train_loader = get_loader(args=args)

model = get_model(args) learner = SelfSupervisedLearner( model, image_size=480, hidden_layer='module.avgpool', projection_size = 256, projection_hidden_size = args.hidden_size, moving_average_decay = 0.99 )

opt = torch.optim.Adam(learner.parameters(), lr=3e-4)

if not os.path.exists(args.model_dir): os.makedirs(args.model_dir)

for _ in range(args.epochs): for idx, images in enumerate(train_loader): if torch.cuda.is_available(): images = images.cuda(non_blocking=True) loss = learner(images) opt.zero_grad() loss.backward() opt.step() learner.update_moving_average() # update moving average of target encoder

save your improved network

torch.save(model.state_dict(), './improved-net.pt')`

However, After run this code with distributed learning, during backward(), I got this error message repeated:

Traceback (most recent call last): File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 131, in <module> loss.backward() File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 7; expected version 6 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True). Traceback (most recent call last): File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 131, in <module> loss.backward() File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 7; expected version 6 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True). Traceback (most recent call last): File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 131, in <module> loss.backward() File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 7; expected version 6 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

I used detach().clone() instead of detach() in byol_pytorch.py, I got same error. Even if I set torch.autograd.set_detect_anomaly(True), I could not get what is the reason. Would you let me know what part of this code invokes this problem? Thanks in advance.

lucidrains commented 11 months ago

@MaryAhn hey Keonhee, actually not too sure

do you want to give 0.7.1 a quick try? if it still does not work, i'll debug it once i get access to my multi-GPU setup

MaryAhn commented 11 months ago

Yes, I would like to try 0.7.1. My email is khahn196@gmail.com.

lucidrains commented 11 months ago

@MaryAhn just pip install byol-pytorch -U for 0.7.1

MaryAhn commented 11 months ago

I installed 0.7.1 through the statement you gave, however, it still does not work. If you find the problem and solution via debugging, please let me know.

lucidrains commented 11 months ago

after setting torch.autograd.set_detect_anomaly(True), do you see a different error trace? could you paste that trace if so?

MaryAhn commented 11 months ago

I set detect anomaly, and the error message is: File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 129, in <module> loss = learner(images) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward else self._run_ddp_forward(*inputs, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward return self.module(*inputs, **kwargs) # type: ignore[index] File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/keonhee/byol-pytorch/byol_pytorch/byol_pytorch.py", line 264, in forward online_proj_one, _ = self.online_encoder(image_one) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/keonhee/byol-pytorch/byol_pytorch/byol_pytorch.py", line 165, in forward representation = self.get_representation(x) File "/home/cvlab/keonhee/byol-pytorch/byol_pytorch/byol_pytorch.py", line 157, in get_representation _ = self.net(x) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward else self._run_ddp_forward(*inputs, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward return self.module(*inputs, **kwargs) # type: ignore[index] File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torchvision/models/resnet.py", line 285, in forward return self._forward_impl(x) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torchvision/models/resnet.py", line 276, in _forward_impl x = self.layer4(x) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/container.py", line 215, in forward input = module(input) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torchvision/models/resnet.py", line 155, in forward out = self.bn3(out) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 171, in forward return F.batch_norm( File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/functional.py", line 2478, in batch_norm return torch.batch_norm( (Triggered internally at /opt/conda/conda-bld/pytorch_1695392035629/work/torch/csrc/autograd/python_anomaly_mode.cpp:114.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass

Traceback (most recent call last): File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 131, in <module> loss.backward() File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 7; expected version 6 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck! Traceback (most recent call last): File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 131, in <module> loss.backward() File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 7; expected version 6 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

I think the backward process of online encoder or online predictor does not work appropriately, but I'm not sure.

lucidrains commented 11 months ago

want to try 0.7.2? it may or may not do anything

i can try to debug this once i get back on my multi-gpu machine

MaryAhn commented 11 months ago

Yes I want to try. pip install byol-pytorch==0.7.2 or pip install byol-pytorch -U which one should I use?

lucidrains commented 11 months ago

@MaryAhn both should work

MaryAhn commented 11 months ago

After installation, still same error occurs. After debugging, please let me know about this issue. Thank you.

lucidrains commented 11 months ago

@MaryAhn i see you are using a custom script

are you not using pytorch lightning? there's a setting in there to replace batchnorms in your resnet with sync batchnorms

lucidrains commented 11 months ago

otherwise, this could also be related to an ongoing pytorch issue, and you could try setting broadcast_buffers = False for DistributedDataParallel

lucidrains commented 11 months ago

you shouldn't need to write the training loop as in your initial comment if you just modify this file and run trainer.fit()

try using that lightning script as is with your resnet, and if the issue persists, then that would tell me a lot

lucidrains commented 11 months ago

would it help if i offered a huggingface accelerate version? i find accelerate much more hackable

lucidrains commented 11 months ago

@MaryAhn try 0.8.0 following the instructions here