hustvl / MapTR

[ICLR'23 Spotlight] MapTR: Structured Modeling and Learning for Online Vectorized HD Map Construction
MIT License
984 stars 152 forks source link

Training runtime error #84

Closed HardysJin closed 10 months ago

HardysJin commented 11 months ago

2023-08-14 19:57:44,356 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration. 2023-08-14 20:06:42,452 - mmdet - INFO - Epoch [25][50/1759] lr: 3.191e-04, eta: 5 days, 22:04:00, time: 11.201, data_time: 0.197, memory: 10558, loss_cls: 0.1298, loss_bbox: 0.0000, loss_iou: 0.0000, loss_pts: 2.7538, loss_dir: 0.0215, d0.loss_cls: 0.1842, d0.loss_bbox: 0.0000, d0.loss_iou: 0.0000, d0.loss_pts: 3.4238, d0.loss_dir: 0.0256, d1.loss_cls: 0.1569, d1.loss_bbox: 0.0000, d1.loss_iou: 0.0000, d1.loss_pts: 2.9039, d1.loss_dir: 0.0225, d2.loss_cls: 0.1445, d2.loss_bbox: 0.0000, d2.loss_iou: 0.0000, d2.loss_pts: 2.8066, d2.loss_dir: 0.0219, d3.loss_cls: 0.1344, d3.loss_bbox: 0.0000, d3.loss_iou: 0.0000, d3.loss_pts: 2.7689, d3.loss_dir: 0.0216, d4.loss_cls: 0.1252, d4.loss_bbox: 0.0000, d4.loss_iou: 0.0000, d4.loss_pts: 2.7611, d4.loss_dir: 0.0216, loss: 18.4279, grad_norm: 46.0450 2023-08-14 20:15:45,132 - mmdet - INFO - Epoch [25][100/1759] lr: 3.191e-04, eta: 5 days, 19:42:42, time: 10.854, data_time: 0.030, memory: 10558, loss_cls: 0.1430, loss_bbox: 0.0000, loss_iou: 0.0000, loss_pts: 2.9818, loss_dir: 0.0238, d0.loss_cls: 0.2034, d0.loss_bbox: 0.0000, d0.loss_iou: 0.0000, d0.loss_pts: 3.7004, d0.loss_dir: 0.0281, d1.loss_cls: 0.1794, d1.loss_bbox: 0.0000, d1.loss_iou: 0.0000, d1.loss_pts: 3.1415, d1.loss_dir: 0.0249, d2.loss_cls: 0.1642, d2.loss_bbox: 0.0000, d2.loss_iou: 0.0000, d2.loss_pts: 3.0319, d2.loss_dir: 0.0244, d3.loss_cls: 0.1530, d3.loss_bbox: 0.0000, d3.loss_iou: 0.0000, d3.loss_pts: 2.9953, d3.loss_dir: 0.0241, d4.loss_cls: 0.1436, d4.loss_bbox: 0.0000, d4.loss_iou: 0.0000, d4.loss_pts: 2.9870, d4.loss_dir: 0.0240, loss: 19.9738, grad_norm: 46.1005 [W python_anomaly_mode.cpp:104] Warning: Error detected in FusedDropoutBackward. Traceback of forward call that caused the error: File "./tools/train.py", line 260, in main() File "./tools/train.py", line 249, in main custom_train_model( File "/home/derek/hardys/MapTR/projects/mmdet3d_plugin/bevformer/apis/train.py", line 27, in custom_train_model custom_train_detector( File "/home/derek/hardys/MapTR/projects/mmdet3d_plugin/bevformer/apis/mmdet_train.py", line 199, in custom_train_detector runner.run(data_loaders, cfg.workflow) File "/home/derek/miniconda3/envs/maptr2/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run epoch_runner(data_loaders[i], kwargs) File "/home/derek/miniconda3/envs/maptr2/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train self.run_iter(data_batch, train_mode=True, kwargs) File "/home/derek/miniconda3/envs/maptr2/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter outputs = self.model.train_step(data_batch, self.optimizer, File "/home/derek/miniconda3/envs/maptr2/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 52, in train_step output = self.module.train_step(inputs[0], kwargs[0]) File "/home/derek/miniconda3/envs/maptr2/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 237, in train_step losses = self(data) File "/home/derek/miniconda3/envs/maptr2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, kwargs) File "/home/derek/hardys/MapTR/projects/mmdet3d_plugin/maptr/detectors/maptr.py", line 162, in forward return self.forward_train(kwargs) File "/home/derek/miniconda3/envs/maptr2/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 214, in new_func output = old_func(*new_args, new_kwargs) File "/home/derek/hardys/MapTR/projects/mmdet3d_plugin/maptr/detectors/maptr.py", line 277, in forward_train losses_pts = self.forward_pts_train(img_feats, lidar_feat, gt_bboxes_3d, File "/home/derek/hardys/MapTR/projects/mmdet3d_plugin/maptr/detectors/maptr.py", line 141, in forward_pts_train outs = self.pts_bbox_head( File "/home/derek/miniconda3/envs/maptr2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, *kwargs) File "/home/derek/miniconda3/envs/maptr2/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 214, in new_func output = old_func(new_args, new_kwargs) File "/home/derek/hardys/MapTR/projects/mmdet3d_plugin/maptr/dense_heads/maptr_head.py", line 254, in forward outputs = self.transformer( File "/home/derek/miniconda3/envs/maptr2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, kwargs) File "/home/derek/hardys/MapTR/projects/mmdet3d_plugin/maptr/modules/transformer.py", line 339, in forward inter_states, inter_references = self.decoder( File "/home/derek/miniconda3/envs/maptr2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, *kwargs) File "/home/derek/hardys/MapTR/projects/mmdet3d_plugin/maptr/modules/decoder.py", line 59, in forward output = layer( File "/home/derek/miniconda3/envs/maptr2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, kwargs) File "/home/derek/hardys/MapTR/projects/mmdet3d_plugin/maptr/modules/decoder.py", line 377, in forward query = self.ffns[ffn_index]( File "/home/derek/miniconda3/envs/maptr2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, kwargs) File "/home/derek/miniconda3/envs/maptr2/lib/python3.8/site-packages/mmcv/utils/misc.py", line 340, in new_func output = old_func(*args, *kwargs) File "/home/derek/miniconda3/envs/maptr2/lib/python3.8/site-packages/mmcv/cnn/bricks/transformer.py", line 274, in forward out = self.layers(x) File "/home/derek/miniconda3/envs/maptr2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, kwargs) File "/home/derek/miniconda3/envs/maptr2/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward input = module(input) File "/home/derek/miniconda3/envs/maptr2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, kwargs) File "/home/derek/miniconda3/envs/maptr2/lib/python3.8/site-packages/torch/nn/modules/dropout.py", line 58, in forward return F.dropout(input, self.p, self.training, self.inplace) File "/home/derek/miniconda3/envs/maptr2/lib/python3.8/site-packages/torch/nn/functional.py", line 1168, in dropout return VF.dropout(input, p, training) if inplace else _VF.dropout(input, p, training) (function _print_stack) Traceback (most recent call last): File "./tools/train.py", line 260, in main() File "./tools/train.py", line 249, in main custom_train_model( File "/home/derek/hardys/MapTR/projects/mmdet3d_plugin/bevformer/apis/train.py", line 27, in custom_train_model custom_train_detector( File "/home/derek/hardys/MapTR/projects/mmdet3d_plugin/bevformer/apis/mmdet_train.py", line 199, in custom_train_detector runner.run(data_loaders, cfg.workflow) File "/home/derek/miniconda3/envs/maptr2/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run epoch_runner(data_loaders[i], kwargs) File "/home/derek/miniconda3/envs/maptr2/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train self.call_hook('after_train_iter') File "/home/derek/miniconda3/envs/maptr2/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook getattr(hook, fn_name)(self) File "/home/derek/miniconda3/envs/maptr2/lib/python3.8/site-packages/mmcv/runner/hooks/optimizer.py", line 224, in after_train_iter self.loss_scaler.scale(runner.outputs['loss']).backward() File "/home/derek/miniconda3/envs/maptr2/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/derek/miniconda3/envs/maptr2/lib/python3.8/site-packages/torch/autograd/init.py", line 147, in backward Variable._execution_engine.run_backward(

RuntimeError: Trying to backward through the graph a second time (or directly access saved variables after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved variables after calling backward.

anyone know how to fix this?

HardysJin commented 11 months ago

tried with self.loss_scaler.scale(runner.outputs['loss']).backward(retain_graph=True) in mmcv optimizer, the error become:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512, 256]], which is output 0 of TBackward, is at version 224; expected version 222 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!