train error - Githubissues

121649982 commented 10 months ago

First of all,thank you for your great work. When I Train with one class.I encounter a problem.the loss always 0

2023-09-07 14:27:17,951 - pyskl - INFO - Config: model = dict( type='Recognizer3D', backbone=dict( type='C3D', in_channels=17, base_channels=32, num_stages=3, temporal_downsample=False), cls_head=dict(type='I3DHead', in_channels=256, num_classes=1, dropout=0.5), test_cfg=dict(average_clips='prob')) dataset_type = 'PoseDataset' ann_file = './data/nturgbd/train.pkl' left_kp = [1, 3, 5, 7, 9, 11, 13, 15] right_kp = [2, 4, 6, 8, 10, 12, 14, 16] train_pipeline = [ dict(type='UniformSampleFrames', clip_len=48), dict(type='PoseDecode'), dict(type='PoseCompact', hw_ratio=1.0, allow_imgpad=True), dict(type='Resize', scale=(-1, 64)), dict(type='RandomResizedCrop', area_range=(0.56, 1.0)), dict(type='Resize', scale=(56, 56), keep_ratio=False), dict( type='Flip', flip_ratio=0.5, left_kp=[1, 3, 5, 7, 9, 11, 13, 15], right_kp=[2, 4, 6, 8, 10, 12, 14, 16]), dict(type='GeneratePoseTarget', with_kp=True, with_limb=False), dict(type='FormatShape', input_format='NCTHW_Heatmap'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs', 'label']) ] val_pipeline = [ dict(type='UniformSampleFrames', clip_len=48, num_clips=1), dict(type='PoseDecode'), dict(type='PoseCompact', hw_ratio=1.0, allow_imgpad=True), dict(type='Resize', scale=(64, 64), keep_ratio=False), dict(type='GeneratePoseTarget', with_kp=True, with_limb=False), dict(type='FormatShape', input_format='NCTHW_Heatmap'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs']) ] test_pipeline = [ dict(type='UniformSampleFrames', clip_len=48, num_clips=10), dict(type='PoseDecode'), dict(type='PoseCompact', hw_ratio=1.0, allow_imgpad=True), dict(type='Resize', scale=(64, 64), keep_ratio=False), dict( type='GeneratePoseTarget', with_kp=True, with_limb=False, double=True, left_kp=[1, 3, 5, 7, 9, 11, 13, 15], right_kp=[2, 4, 6, 8, 10, 12, 14, 16]), dict(type='FormatShape', input_format='NCTHW_Heatmap'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs']) ] data = dict( videos_per_gpu=32, workers_per_gpu=4, test_dataloader=dict(videos_per_gpu=1), train=dict( type='RepeatDataset', times=10, dataset=dict( type='PoseDataset', ann_file='./data/nturgbd/train.pkl', split='xsub_train', pipeline=[ dict(type='UniformSampleFrames', clip_len=48), dict(type='PoseDecode'), dict(type='PoseCompact', hw_ratio=1.0, allow_imgpad=True), dict(type='Resize', scale=(-1, 64)), dict(type='RandomResizedCrop', area_range=(0.56, 1.0)), dict(type='Resize', scale=(56, 56), keep_ratio=False), dict( type='Flip', flip_ratio=0.5, left_kp=[1, 3, 5, 7, 9, 11, 13, 15], right_kp=[2, 4, 6, 8, 10, 12, 14, 16]), dict(type='GeneratePoseTarget', with_kp=True, with_limb=False), dict(type='FormatShape', input_format='NCTHW_Heatmap'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs', 'label']) ])), val=dict( type='PoseDataset', ann_file='./data/nturgbd/train.pkl', split='xsub_val', pipeline=[ dict(type='UniformSampleFrames', clip_len=48, num_clips=1), dict(type='PoseDecode'), dict(type='PoseCompact', hw_ratio=1.0, allow_imgpad=True), dict(type='Resize', scale=(64, 64), keep_ratio=False), dict(type='GeneratePoseTarget', with_kp=True, with_limb=False), dict(type='FormatShape', input_format='NCTHW_Heatmap'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs']) ]), test=dict( type='PoseDataset', ann_file='./data/nturgbd/train.pkl', split='xsub_val', pipeline=[ dict(type='UniformSampleFrames', clip_len=48, num_clips=10), dict(type='PoseDecode'), dict(type='PoseCompact', hw_ratio=1.0, allow_imgpad=True), dict(type='Resize', scale=(64, 64), keep_ratio=False), dict( type='GeneratePoseTarget', with_kp=True, with_limb=False, double=True, left_kp=[1, 3, 5, 7, 9, 11, 13, 15], right_kp=[2, 4, 6, 8, 10, 12, 14, 16]), dict(type='FormatShape', input_format='NCTHW_Heatmap'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs']) ])) optimizer = dict(type='SGD', lr=0.4, momentum=0.9, weight_decay=0.0003) optimizer_config = dict(grad_clip=dict(max_norm=40, norm_type=2)) lr_config = dict(policy='CosineAnnealing', by_epoch=False, min_lr=0) total_epochs = 24 checkpoint_config = dict(interval=1) evaluation = dict( interval=1, metrics=['top_k_accuracy', 'mean_class_accuracy'], topk=(1, 5)) log_config = dict(interval=20, hooks=[dict(type='TextLoggerHook')]) log_level = 'INFO' work_dir = './work_dirs/posec3d/c3d_light_ntu60_xsub/joint' dist_params = dict(backend='nccl') gpu_ids = range(0, 1)

2023-09-07 14:27:17,951 - pyskl - INFO - Set random seed to 1045533513, deterministic: False 2023-09-07 14:27:18,009 - pyskl - INFO - 704 videos remain after valid thresholding fatal: not a git repository (or any of the parent directories): .git 2023-09-07 14:27:19,134 - pyskl - INFO - Start running, host: lhc@lhc, work_dir: /home/lhc/gc8/pyskl-main/work_dirs/posec3d/c3d_light_ntu60_xsub/joint 2023-09-07 14:27:19,134 - pyskl - INFO - Hooks will be executed in the following order: before_run: (VERY_HIGH ) CosineAnnealingLrUpdaterHook
(NORMAL ) CheckpointHook
(VERY_LOW ) TextLoggerHook

before_train_epoch: (VERY_HIGH ) CosineAnnealingLrUpdaterHook
(NORMAL ) DistSamplerSeedHook
(LOW ) IterTimerHook
(VERY_LOW ) TextLoggerHook

before_train_iter: (VERY_HIGH ) CosineAnnealingLrUpdaterHook
(LOW ) IterTimerHook

after_train_iter: (ABOVE_NORMAL) OptimizerHook
(NORMAL ) CheckpointHook
(LOW ) IterTimerHook
(VERY_LOW ) TextLoggerHook

after_train_epoch: (NORMAL ) CheckpointHook
(VERY_LOW ) TextLoggerHook

before_val_epoch: (NORMAL ) DistSamplerSeedHook
(LOW ) IterTimerHook
(VERY_LOW ) TextLoggerHook

before_val_iter: (LOW ) IterTimerHook

after_val_iter: (LOW ) IterTimerHook

after_val_epoch: (VERY_LOW ) TextLoggerHook

after_run: (VERY_LOW ) TextLoggerHook

2023-09-07 14:27:19,134 - pyskl - INFO - workflow: [('train', 1)], max: 24 epochs 2023-09-07 14:27:19,134 - pyskl - INFO - Checkpoints will be saved to /home/lhc/gc8/pyskl-main/work_dirs/posec3d/c3d_light_ntu60_xsub/joint by HardDiskBackend. [W reducer.cpp:1298] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) 2023-09-07 14:27:34,668 - pyskl - INFO - Epoch [1][20/220] lr: 4.000e-01, eta: 1:08:05, time: 0.777, data_time: 0.446, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:27:39,113 - pyskl - INFO - Epoch [1][40/220] lr: 3.999e-01, eta: 0:43:37, time: 0.222, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:27:43,633 - pyskl - INFO - Epoch [1][60/220] lr: 3.999e-01, eta: 0:35:31, time: 0.226, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:27:48,048 - pyskl - INFO - Epoch [1][80/220] lr: 3.998e-01, eta: 0:31:19, time: 0.221, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:27:52,572 - pyskl - INFO - Epoch [1][100/220] lr: 3.997e-01, eta: 0:28:52, time: 0.226, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:27:56,953 - pyskl - INFO - Epoch [1][120/220] lr: 3.995e-01, eta: 0:27:06, time: 0.219, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:28:01,348 - pyskl - INFO - Epoch [1][140/220] lr: 3.993e-01, eta: 0:25:49, time: 0.220, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:28:05,744 - pyskl - INFO - Epoch [1][160/220] lr: 3.991e-01, eta: 0:24:51, time: 0.220, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:28:10,142 - pyskl - INFO - Epoch [1][180/220] lr: 3.989e-01, eta: 0:24:05, time: 0.220, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:28:14,541 - pyskl - INFO - Epoch [1][200/220] lr: 3.986e-01, eta: 0:23:27, time: 0.220, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:28:18,940 - pyskl - INFO - Epoch [1][220/220] lr: 3.983e-01, eta: 0:22:55, time: 0.220, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:28:19,302 - pyskl - INFO - Saving checkpoint at 1 epochs 2023-09-07 14:28:32,316 - pyskl - INFO - Epoch [2][20/220] lr: 3.980e-01, eta: 0:25:28, time: 0.649, data_time: 0.427, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:28:36,710 - pyskl - INFO - Epoch [2][40/220] lr: 3.976e-01, eta: 0:24:49, time: 0.220, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:28:41,117 - pyskl - INFO - Epoch [2][60/220] lr: 3.973e-01, eta: 0:24:16, time: 0.220, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:28:45,533 - pyskl - INFO - Epoch [2][80/220] lr: 3.968e-01, eta: 0:23:47, time: 0.221, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:28:49,956 - pyskl - INFO - Epoch [2][100/220] lr: 3.964e-01, eta: 0:23:21, time: 0.221, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:28:54,377 - pyskl - INFO - Epoch [2][120/220] lr: 3.959e-01, eta: 0:22:57, time: 0.221, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:28:58,798 - pyskl - INFO - Epoch [2][140/220] lr: 3.955e-01, eta: 0:22:36, time: 0.221, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:29:03,223 - pyskl - INFO - Epoch [2][160/220] lr: 3.949e-01, eta: 0:22:16, time: 0.221, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:29:07,646 - pyskl - INFO - Epoch [2][180/220] lr: 3.944e-01, eta: 0:21:58, time: 0.221, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000 2023-09-07 14:29:12,070 - pyskl - INFO - Epoch [2][200/220] lr: 3.938e-01, eta: 0:21:42, time: 0.221, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000

SafwenNaimi commented 10 months ago

Hello, Did you manage to run pyskl on a single GPU? I am trying to do so but I am encountering some errors!

121649982 commented 10 months ago

Hello, Did you manage to run pyskl on a single GPU? I am trying to do so but I am encountering some errors!

yes,I have trained on a single GPU what error you encountered?

SafwenNaimi commented 10 months ago

I processed by removing the MMDistributedDataParallel wrapper and removing any distributed training hooks like DistSamplerSeedHook in tools/train.py The problem is that I am getting the following error: RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor I don't know if there are any changes that need to be made in scripts other than tools/train.py.

What did you change exactly to work on single GPU? Thanks in advance

kennymckormick / pyskl

train error #203