MaryAhn commented 11 months ago

Hi, I have tried to run the code according to Usage in this repo: `args = parse_args() num_gpus = int(os.environ["WORLD_SIZE"]) if "WORLD_SIZE" in os.environ else 1 args.num_gpus = num_gpus args.distributed = num_gpus > 1 if torch.cuda.is_available(): cudnn.benchmark = False args.device = "cuda" else: args.distributed = False args.device = "cpu" if args.distributed: torch.cuda.set_device(args.local_rank) torch.distributed.init_process_group(backend="nccl", init_method="env://") synchronize()

train_loader = get_loader(args=args)

model = get_model(args) learner = SelfSupervisedLearner( model, image_size=480, hidden_layer='module.avgpool', projection_size = 256, projection_hidden_size = args.hidden_size, moving_average_decay = 0.99 )

opt = torch.optim.Adam(learner.parameters(), lr=3e-4)

if not os.path.exists(args.model_dir): os.makedirs(args.model_dir)

for _ in range(args.epochs): for idx, images in enumerate(train_loader): if torch.cuda.is_available(): images = images.cuda(non_blocking=True) loss = learner(images) opt.zero_grad() loss.backward() opt.step() learner.update_moving_average() # update moving average of target encoder

save your improved network

torch.save(model.state_dict(), './improved-net.pt')`

However, After run this code with distributed learning, during backward(), I got this error message repeated:

Traceback (most recent call last): File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 131, in <module> loss.backward() File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 7; expected version 6 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True). Traceback (most recent call last): File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 131, in <module> loss.backward() File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 7; expected version 6 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True). Traceback (most recent call last): File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 131, in <module> loss.backward() File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 7; expected version 6 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

I used detach().clone() instead of detach() in byol_pytorch.py, I got same error. Even if I set torch.autograd.set_detect_anomaly(True), I could not get what is the reason. Would you let me know what part of this code invokes this problem? Thanks in advance.