When I want to train the Relation-Net, an runtime error is raised:
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
There is no any problem in the past several steps, including training the 3D-U-Net and running train_test.py to generate predictions.
The gradient problem seems to be related to the computation of the loss function. Could you give some help?
The whole error information is as follows:
[2021-08-13 22:20:07,039 INFO log.py line 40 25892] **** Start Logging ****
[2021-08-13 22:20:09,274 INFO train.py line 26 25892] Namespace(TEST_NMS_THRESH=0.3, TEST_NPOINT_THRESH=100, TEST_SCORE_THRESH=0.09, batch_size=4, bg_thresh=0.25, block_reps=2, block_residual=True, classes=20, cluster_meanActive=50, cluster_npoint_thre=50, cluster_radius=0.03, cluster_shift_meanActive=300, config='config/pointgroup_run1_scannet.yaml', data_root='/home/xxx/One-Thing-One-Click/data/scannet', dataset='train_unsup', dataset_dir='data/scannetv2_inst.py', dataset_name='scannet', epochs=1024, eval=True, exp_path='exp/train_unsup/pointgroup/pointgroup_run1_scannet', fg_thresh=0.75, filename_suffix='_inst_nostuff.pth', fix_module=[], full_scale=[128, 512], ignore_label=-100, input_channel=3, loss_weight=[1.0, 1.0, 1.0, 1.0], lr=0.01, m=32, manual_seed=123, max_npoint=250000, mode=4, model_dir='model/pointgroup/pointgroup.py', model_name='pointgroup', momentum=0.9, multiplier=0.5, optim='Adam', prepare_epochs=128, pretrain='', pretrain_module=[], pretrain_path=None, save_freq=16, save_instance=False, save_pt_offsets=False, save_semantic=False, scale=50, score_fullscale=14, score_mode=4, score_scale=50, split='val', step_epoch=768, task='train', test_epoch=384, test_seed=567, test_workers=16, train_workers=16, use_coords=True, weight_decay=0.0001)
[2021-08-13 22:20:09,290 INFO train.py line 135 25892] => creating model ...
[2021-08-13 22:20:14,844 INFO train.py line 147 25892] cuda available: True
[2021-08-13 22:20:14,938 INFO train.py line 152 25892] #classifier parameters: 30106304
[2021-08-13 22:27:57,409 INFO scannetv2_inst.py line 50 25892] Training samples: 1201
[2021-08-13 22:27:57,430 INFO scannetv2_inst.py line 63 25892] Validation samples: 0
Traceback (most recent call last):
File "/data/xxx/.miniconda3/envs/pointV/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
send_bytes(obj)
File "/data/xxx/.miniconda3/envs/pointV/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/data/xxx/.miniconda3/envs/pointV/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File "/data/xxx/.miniconda3/envs/pointV/lib/python3.7/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Traceback (most recent call last):
File "train.py", line 182, in
train_epoch(dataset.train_data_loader, model, model_fn, optimizer, epoch)
File "train.py", line 64, in train_epoch
loss.backward()
File "/data/xxx/.miniconda3/envs/pointV/lib/python3.7/site-packages/torch/tensor.py", line 150, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/data/xxx/.miniconda3/envs/pointV/lib/python3.7/site-packages/torch/autograd/init.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
When I want to train the Relation-Net, an runtime error is raised: RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
There is no any problem in the past several steps, including training the 3D-U-Net and running train_test.py to generate predictions.
The gradient problem seems to be related to the computation of the loss function. Could you give some help?
The whole error information is as follows: [2021-08-13 22:20:07,039 INFO log.py line 40 25892] **** Start Logging **** [2021-08-13 22:20:09,274 INFO train.py line 26 25892] Namespace(TEST_NMS_THRESH=0.3, TEST_NPOINT_THRESH=100, TEST_SCORE_THRESH=0.09, batch_size=4, bg_thresh=0.25, block_reps=2, block_residual=True, classes=20, cluster_meanActive=50, cluster_npoint_thre=50, cluster_radius=0.03, cluster_shift_meanActive=300, config='config/pointgroup_run1_scannet.yaml', data_root='/home/xxx/One-Thing-One-Click/data/scannet', dataset='train_unsup', dataset_dir='data/scannetv2_inst.py', dataset_name='scannet', epochs=1024, eval=True, exp_path='exp/train_unsup/pointgroup/pointgroup_run1_scannet', fg_thresh=0.75, filename_suffix='_inst_nostuff.pth', fix_module=[], full_scale=[128, 512], ignore_label=-100, input_channel=3, loss_weight=[1.0, 1.0, 1.0, 1.0], lr=0.01, m=32, manual_seed=123, max_npoint=250000, mode=4, model_dir='model/pointgroup/pointgroup.py', model_name='pointgroup', momentum=0.9, multiplier=0.5, optim='Adam', prepare_epochs=128, pretrain='', pretrain_module=[], pretrain_path=None, save_freq=16, save_instance=False, save_pt_offsets=False, save_semantic=False, scale=50, score_fullscale=14, score_mode=4, score_scale=50, split='val', step_epoch=768, task='train', test_epoch=384, test_seed=567, test_workers=16, train_workers=16, use_coords=True, weight_decay=0.0001) [2021-08-13 22:20:09,290 INFO train.py line 135 25892] => creating model ... [2021-08-13 22:20:14,844 INFO train.py line 147 25892] cuda available: True [2021-08-13 22:20:14,938 INFO train.py line 152 25892] #classifier parameters: 30106304 [2021-08-13 22:27:57,409 INFO scannetv2_inst.py line 50 25892] Training samples: 1201 [2021-08-13 22:27:57,430 INFO scannetv2_inst.py line 63 25892] Validation samples: 0 Traceback (most recent call last): File "/data/xxx/.miniconda3/envs/pointV/lib/python3.7/multiprocessing/queues.py", line 242, in _feed send_bytes(obj) File "/data/xxx/.miniconda3/envs/pointV/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes self._send_bytes(m[offset:offset + size]) File "/data/xxx/.miniconda3/envs/pointV/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes self._send(header + buf) File "/data/xxx/.miniconda3/envs/pointV/lib/python3.7/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe Traceback (most recent call last): File "train.py", line 182, in
train_epoch(dataset.train_data_loader, model, model_fn, optimizer, epoch)
File "train.py", line 64, in train_epoch
loss.backward()
File "/data/xxx/.miniconda3/envs/pointV/lib/python3.7/site-packages/torch/tensor.py", line 150, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/data/xxx/.miniconda3/envs/pointV/lib/python3.7/site-packages/torch/autograd/init.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn