open-mmlab / mmaction2

OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
https://mmaction2.readthedocs.io
Apache License 2.0
4.18k stars 1.22k forks source link

[Bug] STGCN training not going as expected #2496

Open MABatin opened 1 year ago

MABatin commented 1 year ago

Branch

0.x branch (0.x version, such as v0.24.1)

Prerequisite

Environment

sys.platform: linux Python: 3.8.10 (default, Mar 13 2023, 10:26:41) [GCC 9.4.0] CUDA available: True numpy_random_seed: 2147483648 GPU 0: NVIDIA GeForce GTX 1080 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.3, V11.3.109 GCC: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 PyTorch: 1.12.1+cu113 PyTorch compiling details: PyTorch built with:

TorchVision: 0.13.1+cu113 OpenCV: 4.5.4 MMEngine: 0.7.3 MMAction2: 1.0.0+

Describe the bug

When training STGCN model with a custom dataset with 3 classes, I see that the loss isn't going down at all. It's like the following: W B Chart 5_25_2023, 1 24 36 PM

W B Chart 5_25_2023, 1 24 21 PM

As can be seen, training loss is just oscillating and val/top1_accuracy remains just constant. This indicates the model isn't learning anything. Why is that?

Reproduces the problem - code sample

I am using the following config:

model = dict(
    type='SkeletonGCN',
    backbone=dict(
        type='STGCN',
        in_channels=3,
        edge_importance_weighting=True,
        graph_cfg=dict(layout='coco', strategy='spatial')),
    cls_head=dict(
        type='STGCNHead',
        num_classes=3,
        in_channels=256,
        loss_cls=dict(type='CrossEntropyLoss', class_weight=[0.632, 1.0, 2.496])),  # class_weight=[0.632, 1.0, 2.496] / ntu60-fall:[46.818, 1.0, 0.999]
    train_cfg=None,
    test_cfg=None)

dataset_type = 'PoseDataset'
ann_file_train = '/home/portia/portia-train/mmaction2/data/ntu-fall/ntu-fall_xsub_train.pkl'
ann_file_val = '/home/portia/portia-train/mmaction2/data/ntu-fall/ntu-fall_xsub_val.pkl'
train_pipeline = [
    dict(type='PaddingWithLoop', clip_len=6),
    dict(type='PoseDecode'),
    dict(type='FormatGCNInput', input_format='NCTVM'),
    dict(type='PoseNormalize'),
    dict(type='Collect', keys=['keypoint', 'label'], meta_keys=[]),
    dict(type='ToTensor', keys=['keypoint'])
]
val_pipeline = [
    dict(type='PaddingWithLoop', clip_len=6),
    dict(type='PoseDecode'),
    dict(type='FormatGCNInput', input_format='NCTVM'),
    dict(type='PoseNormalize'),
    dict(type='Collect', keys=['keypoint', 'label'], meta_keys=[]),
    dict(type='ToTensor', keys=['keypoint'])
]
test_pipeline = [
    dict(type='PaddingWithLoop', clip_len=6),
    dict(type='PoseDecode'),
    dict(type='FormatGCNInput', input_format='NCTVM'),
    dict(type='PoseNormalize'),
    dict(type='Collect', keys=['keypoint', 'label'], meta_keys=[]),
    dict(type='ToTensor', keys=['keypoint'])
]
data = dict(
    videos_per_gpu=16,
    workers_per_gpu=2,
    test_dataloader=dict(videos_per_gpu=1),
    train=dict(
        type=dataset_type,
        ann_file=ann_file_train,
        data_prefix='',
        pipeline=train_pipeline),
    val=dict(
        type=dataset_type,
        ann_file=ann_file_val,
        data_prefix='',
        pipeline=val_pipeline),
    test=dict(
        type=dataset_type,
        ann_file=ann_file_val,
        data_prefix='',
        pipeline=test_pipeline))

# optimizer
optimizer = dict(
    type='SGD', lr=0.1, momentum=0.9, weight_decay=0.0001, nesterov=True)
optimizer_config = dict(grad_clip=None)
# learning policy
lr_config = dict(policy='step', step=[10, 50])
total_epochs = 80
checkpoint_config = dict(interval=5)
evaluation = dict(interval=5, metrics=['top_k_accuracy', 'mean_class_accuracy'], topk=(1,))
work_dir = './work_dirs/stgcn_80e_ntu60-fall_xsub_keypoint/'
log_config = dict(
    interval=1,
    hooks=[
        dict(type='TextLoggerHook', by_epoch=True),
        dict(
            type='WandbLoggerHook',
            by_epoch=True,
            init_kwargs={'entity': 'unholytsar',
                         'project': 'portialyze-carevision',
                         'name': 'stgcn_80e_ntu-fall_xsub_keypoint',
                         'dir': work_dir,
                         'resume': 'allow',
                         'id': '2wsewedceerrye'},
            interval=1)])

# runtime settings
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1), ('val', 1)]
gpu_ids = range(0, 1)

Reproduces the problem - command or script

No response

Reproduces the problem - error message

No response

Additional information

  1. I expected training loss to go down and val accuracy to go up.
  2. Instead loss is oscillating and val accuracy is just constant.
knifofia commented 1 year ago

Hi @MABatin, I faced the same issue as you did. For me, there were two issues:

Here is how I changed the pipeline of stgcn

train_pipeline = [
    # dict(type="PreNormalize2D"),
    dict(type="GenSkeFeat", dataset="coco", feats=["j"]),
    dict(type="UniformSampleFrames", clip_len=100),
    dict(type="PoseDecode"),
    dict(type="FormatGCNInput", num_person=2),
    dict(type="PackActionInputs"),
]
val_pipeline = [
    # dict(type="PreNormalize2D"),
    dict(type="GenSkeFeat", dataset="coco", feats=["j"]),
    dict(type="UniformSampleFrames", clip_len=100, num_clips=1, test_mode=True),
    dict(type="PoseDecode"),
    dict(type="FormatGCNInput", num_person=2),
    dict(type="PackActionInputs"),
]
test_pipeline = [
    # dict(type="PreNormalize2D"),
    dict(type="GenSkeFeat", dataset="coco", feats=["j"]),
    dict(type="UniformSampleFrames", clip_len=100, num_clips=10, test_mode=True),
    dict(type="PoseDecode"),
    dict(type="FormatGCNInput", num_person=2),
    dict(type="PackActionInputs"),
]

hope it helps

MABatin commented 1 year ago

Hi @MABatin, I faced the same issue as you did. For me, there were two issues:

* a double normalization fixed by removing the one of the pipeline of the stgcn

* a learning rate too high that I set to 0.001

Here is how I changed the pipeline of stgcn

train_pipeline = [
    # dict(type="PreNormalize2D"),
    dict(type="GenSkeFeat", dataset="coco", feats=["j"]),
    dict(type="UniformSampleFrames", clip_len=100),
    dict(type="PoseDecode"),
    dict(type="FormatGCNInput", num_person=2),
    dict(type="PackActionInputs"),
]
val_pipeline = [
    # dict(type="PreNormalize2D"),
    dict(type="GenSkeFeat", dataset="coco", feats=["j"]),
    dict(type="UniformSampleFrames", clip_len=100, num_clips=1, test_mode=True),
    dict(type="PoseDecode"),
    dict(type="FormatGCNInput", num_person=2),
    dict(type="PackActionInputs"),
]
test_pipeline = [
    # dict(type="PreNormalize2D"),
    dict(type="GenSkeFeat", dataset="coco", feats=["j"]),
    dict(type="UniformSampleFrames", clip_len=100, num_clips=10, test_mode=True),
    dict(type="PoseDecode"),
    dict(type="FormatGCNInput", num_person=2),
    dict(type="PackActionInputs"),
]

hope it helps

Thank you very much for the suggestion. I too saw an improvement in actual training after setting the learning rate lower. However, I did not make changes to the pipeline, so I don't know about that. I'm on 0.x version, so can you tell me where in the pipeline the double normalization issue might be happening?

    dict(type='PaddingWithLoop', clip_len=6),
    dict(type='PoseDecode'),
    dict(type='FormatGCNInput', input_format='NCTVM'),
    dict(type='PoseNormalize'),
    dict(type='Collect', keys=['keypoint', 'label'], meta_keys=[]),
    dict(type='ToTensor', keys=['keypoint'])
]
val_pipeline = [
    dict(type='PaddingWithLoop', clip_len=6),
    dict(type='PoseDecode'),
    dict(type='FormatGCNInput', input_format='NCTVM'),
    dict(type='PoseNormalize'),
    dict(type='Collect', keys=['keypoint', 'label'], meta_keys=[]),
    dict(type='ToTensor', keys=['keypoint'])
]
test_pipeline = [
    dict(type='PaddingWithLoop', clip_len=6),
    dict(type='PoseDecode'),
    dict(type='FormatGCNInput', input_format='NCTVM'),
    dict(type='PoseNormalize'),
    dict(type='Collect', keys=['keypoint', 'label'], meta_keys=[]),
    dict(type='ToTensor', keys=['keypoint'])
]
knifofia commented 1 year ago

I choose to use mediapipe skeleton extractor to get the skeleton from my video dataset. Then, I convert the skeleton to the coco dataset format. I decide to use mediapipe because it is faster to extract skeleton and easy to implement.

There is already a normalization made by mediapipe on the skeleton. MMaction2 do another one. It seems it's an issue but I didn't dive deep into the code to find why

MABatin commented 1 year ago

I choose to use mediapipe skeleton extractor to get the skeleton from my video dataset. Then, I convert the skeleton to the coco dataset format. I decide to use mediapipe because it is faster to extract skeleton and easy to implement.

There is already a normalization made by mediapipe on the skeleton. MMaction2 do another one. It seems it's an issue but I didn't dive deep into the code to find why

I see. I am using YOLOv7 pose model to extract pose information which doesn't normalize the keypoints. So maybe double normalization isn't an issue in my case.