Open 121649982 opened 10 months ago
Hello, Did you manage to run pyskl on a single GPU? I am trying to do so but I am encountering some errors!
Hello, Did you manage to run pyskl on a single GPU? I am trying to do so but I am encountering some errors!
yes,I have trained on a single GPU what error you encountered?
I processed by removing the MMDistributedDataParallel wrapper and removing any distributed training hooks like DistSamplerSeedHook in tools/train.py The problem is that I am getting the following error: RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor I don't know if there are any changes that need to be made in scripts other than tools/train.py.
What did you change exactly to work on single GPU? Thanks in advance
First of all,thank you for your great work. When I Train with one class.I encounter a problem.the loss always 0
2023-09-07 14:27:17,951 - pyskl - INFO - Config: model = dict( type='Recognizer3D', backbone=dict( type='C3D', in_channels=17, base_channels=32, num_stages=3, temporal_downsample=False), cls_head=dict(type='I3DHead', in_channels=256, num_classes=1, dropout=0.5), test_cfg=dict(average_clips='prob')) dataset_type = 'PoseDataset' ann_file = './data/nturgbd/train.pkl' left_kp = [1, 3, 5, 7, 9, 11, 13, 15] right_kp = [2, 4, 6, 8, 10, 12, 14, 16] train_pipeline = [ dict(type='UniformSampleFrames', clip_len=48), dict(type='PoseDecode'), dict(type='PoseCompact', hw_ratio=1.0, allow_imgpad=True), dict(type='Resize', scale=(-1, 64)), dict(type='RandomResizedCrop', area_range=(0.56, 1.0)), dict(type='Resize', scale=(56, 56), keep_ratio=False), dict( type='Flip', flip_ratio=0.5, left_kp=[1, 3, 5, 7, 9, 11, 13, 15], right_kp=[2, 4, 6, 8, 10, 12, 14, 16]), dict(type='GeneratePoseTarget', with_kp=True, with_limb=False), dict(type='FormatShape', input_format='NCTHW_Heatmap'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs', 'label']) ] val_pipeline = [ dict(type='UniformSampleFrames', clip_len=48, num_clips=1), dict(type='PoseDecode'), dict(type='PoseCompact', hw_ratio=1.0, allow_imgpad=True), dict(type='Resize', scale=(64, 64), keep_ratio=False), dict(type='GeneratePoseTarget', with_kp=True, with_limb=False), dict(type='FormatShape', input_format='NCTHW_Heatmap'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs']) ] test_pipeline = [ dict(type='UniformSampleFrames', clip_len=48, num_clips=10), dict(type='PoseDecode'), dict(type='PoseCompact', hw_ratio=1.0, allow_imgpad=True), dict(type='Resize', scale=(64, 64), keep_ratio=False), dict( type='GeneratePoseTarget', with_kp=True, with_limb=False, double=True, left_kp=[1, 3, 5, 7, 9, 11, 13, 15], right_kp=[2, 4, 6, 8, 10, 12, 14, 16]), dict(type='FormatShape', input_format='NCTHW_Heatmap'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs']) ] data = dict( videos_per_gpu=32, workers_per_gpu=4, test_dataloader=dict(videos_per_gpu=1), train=dict( type='RepeatDataset', times=10, dataset=dict( type='PoseDataset', ann_file='./data/nturgbd/train.pkl', split='xsub_train', pipeline=[ dict(type='UniformSampleFrames', clip_len=48), dict(type='PoseDecode'), dict(type='PoseCompact', hw_ratio=1.0, allow_imgpad=True), dict(type='Resize', scale=(-1, 64)), dict(type='RandomResizedCrop', area_range=(0.56, 1.0)), dict(type='Resize', scale=(56, 56), keep_ratio=False), dict( type='Flip', flip_ratio=0.5, left_kp=[1, 3, 5, 7, 9, 11, 13, 15], right_kp=[2, 4, 6, 8, 10, 12, 14, 16]), dict(type='GeneratePoseTarget', with_kp=True, with_limb=False), dict(type='FormatShape', input_format='NCTHW_Heatmap'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs', 'label']) ])), val=dict( type='PoseDataset', ann_file='./data/nturgbd/train.pkl', split='xsub_val', pipeline=[ dict(type='UniformSampleFrames', clip_len=48, num_clips=1), dict(type='PoseDecode'), dict(type='PoseCompact', hw_ratio=1.0, allow_imgpad=True), dict(type='Resize', scale=(64, 64), keep_ratio=False), dict(type='GeneratePoseTarget', with_kp=True, with_limb=False), dict(type='FormatShape', input_format='NCTHW_Heatmap'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs']) ]), test=dict( type='PoseDataset', ann_file='./data/nturgbd/train.pkl', split='xsub_val', pipeline=[ dict(type='UniformSampleFrames', clip_len=48, num_clips=10), dict(type='PoseDecode'), dict(type='PoseCompact', hw_ratio=1.0, allow_imgpad=True), dict(type='Resize', scale=(64, 64), keep_ratio=False), dict( type='GeneratePoseTarget', with_kp=True, with_limb=False, double=True, left_kp=[1, 3, 5, 7, 9, 11, 13, 15], right_kp=[2, 4, 6, 8, 10, 12, 14, 16]), dict(type='FormatShape', input_format='NCTHW_Heatmap'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs']) ])) optimizer = dict(type='SGD', lr=0.4, momentum=0.9, weight_decay=0.0003) optimizer_config = dict(grad_clip=dict(max_norm=40, norm_type=2)) lr_config = dict(policy='CosineAnnealing', by_epoch=False, min_lr=0) total_epochs = 24 checkpoint_config = dict(interval=1) evaluation = dict( interval=1, metrics=['top_k_accuracy', 'mean_class_accuracy'], topk=(1, 5)) log_config = dict(interval=20, hooks=[dict(type='TextLoggerHook')]) log_level = 'INFO' work_dir = './work_dirs/posec3d/c3d_light_ntu60_xsub/joint' dist_params = dict(backend='nccl') gpu_ids = range(0, 1)
2023-09-07 14:27:17,951 - pyskl - INFO - Set random seed to 1045533513, deterministic: False 2023-09-07 14:27:18,009 - pyskl - INFO - 704 videos remain after valid thresholding fatal: not a git repository (or any of the parent directories): .git 2023-09-07 14:27:19,134 - pyskl - INFO - Start running, host: lhc@lhc, work_dir: /home/lhc/gc8/pyskl-main/work_dirs/posec3d/c3d_light_ntu60_xsub/joint 2023-09-07 14:27:19,134 - pyskl - INFO - Hooks will be executed in the following order: before_run: (VERY_HIGH ) CosineAnnealingLrUpdaterHook
(NORMAL ) CheckpointHook
(VERY_LOW ) TextLoggerHook
before_train_epoch: (VERY_HIGH ) CosineAnnealingLrUpdaterHook
(NORMAL ) DistSamplerSeedHook
(LOW ) IterTimerHook
(VERY_LOW ) TextLoggerHook
before_train_iter: (VERY_HIGH ) CosineAnnealingLrUpdaterHook
(LOW ) IterTimerHook
after_train_iter: (ABOVE_NORMAL) OptimizerHook
(NORMAL ) CheckpointHook
(LOW ) IterTimerHook
(VERY_LOW ) TextLoggerHook
after_train_epoch: (NORMAL ) CheckpointHook
(VERY_LOW ) TextLoggerHook
before_val_epoch: (NORMAL ) DistSamplerSeedHook
(LOW ) IterTimerHook
(VERY_LOW ) TextLoggerHook
before_val_iter: (LOW ) IterTimerHook
after_val_iter: (LOW ) IterTimerHook
after_val_epoch: (VERY_LOW ) TextLoggerHook
after_run: (VERY_LOW ) TextLoggerHook
2023-09-07 14:27:19,134 - pyskl - INFO - workflow: [('train', 1)], max: 24 epochs 2023-09-07 14:27:19,134 - pyskl - INFO - Checkpoints will be saved to /home/lhc/gc8/pyskl-main/work_dirs/posec3d/c3d_light_ntu60_xsub/joint by HardDiskBackend. [W reducer.cpp:1298] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) 2023-09-07 14:27:34,668 - pyskl - INFO - Epoch [1][20/220] lr: 4.000e-01, eta: 1:08:05, time: 0.777, data_time: 0.446, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:27:39,113 - pyskl - INFO - Epoch [1][40/220] lr: 3.999e-01, eta: 0:43:37, time: 0.222, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:27:43,633 - pyskl - INFO - Epoch [1][60/220] lr: 3.999e-01, eta: 0:35:31, time: 0.226, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:27:48,048 - pyskl - INFO - Epoch [1][80/220] lr: 3.998e-01, eta: 0:31:19, time: 0.221, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:27:52,572 - pyskl - INFO - Epoch [1][100/220] lr: 3.997e-01, eta: 0:28:52, time: 0.226, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:27:56,953 - pyskl - INFO - Epoch [1][120/220] lr: 3.995e-01, eta: 0:27:06, time: 0.219, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:28:01,348 - pyskl - INFO - Epoch [1][140/220] lr: 3.993e-01, eta: 0:25:49, time: 0.220, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:28:05,744 - pyskl - INFO - Epoch [1][160/220] lr: 3.991e-01, eta: 0:24:51, time: 0.220, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:28:10,142 - pyskl - INFO - Epoch [1][180/220] lr: 3.989e-01, eta: 0:24:05, time: 0.220, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:28:14,541 - pyskl - INFO - Epoch [1][200/220] lr: 3.986e-01, eta: 0:23:27, time: 0.220, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:28:18,940 - pyskl - INFO - Epoch [1][220/220] lr: 3.983e-01, eta: 0:22:55, time: 0.220, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:28:19,302 - pyskl - INFO - Saving checkpoint at 1 epochs 2023-09-07 14:28:32,316 - pyskl - INFO - Epoch [2][20/220] lr: 3.980e-01, eta: 0:25:28, time: 0.649, data_time: 0.427, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:28:36,710 - pyskl - INFO - Epoch [2][40/220] lr: 3.976e-01, eta: 0:24:49, time: 0.220, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:28:41,117 - pyskl - INFO - Epoch [2][60/220] lr: 3.973e-01, eta: 0:24:16, time: 0.220, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:28:45,533 - pyskl - INFO - Epoch [2][80/220] lr: 3.968e-01, eta: 0:23:47, time: 0.221, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:28:49,956 - pyskl - INFO - Epoch [2][100/220] lr: 3.964e-01, eta: 0:23:21, time: 0.221, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:28:54,377 - pyskl - INFO - Epoch [2][120/220] lr: 3.959e-01, eta: 0:22:57, time: 0.221, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:28:58,798 - pyskl - INFO - Epoch [2][140/220] lr: 3.955e-01, eta: 0:22:36, time: 0.221, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:29:03,223 - pyskl - INFO - Epoch [2][160/220] lr: 3.949e-01, eta: 0:22:16, time: 0.221, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:29:07,646 - pyskl - INFO - Epoch [2][180/220] lr: 3.944e-01, eta: 0:21:58, time: 0.221, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:29:12,070 - pyskl - INFO - Epoch [2][200/220] lr: 3.938e-01, eta: 0:21:42, time: 0.221, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000