open-mmlab / mmskeleton

A OpenMMLAB toolbox for human pose estimation, skeleton-based action recognition, and action synthesis.
Apache License 2.0
2.92k stars 1.03k forks source link

Preparing custom dataset from Videos #303

Open rashidch opened 4 years ago

rashidch commented 4 years ago

Hey, I want to prepare custom dataset from videos which have actions: CALL: answer phone call COUG: cough DRIN: drink water SCRA: scratch head SNEE: sneeze STRE: stretch arms WAVE: wave hand WIPE: wipe glasses

I am using this dataset: https://web.bii.a-star.edu.sg/~chengli/FluRecognition.html

Can explain to me the following terms from build_dataset_example.yaml?

image

How I should calculate image_size, pixel_std, image_mean, image_std this video dataset?

I have tried preparing the dataset using default parameter and started the training process but the training loss does not decrease and accuracy was 0.000.

INFO:mmcv.runner.runner:Epoch [11][100/840] lr: 0.10000, eta: 0:43:07, time: 0.060, data_time: 0.026, memory: 2344, loss: 2.4426 INFO:mmcv.runner.runner:Epoch [11][200/840] lr: 0.10000, eta: 0:43:02, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4421 INFO:mmcv.runner.runner:Epoch [11][300/840] lr: 0.10000, eta: 0:42:58, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4406 INFO:mmcv.runner.runner:Epoch [11][400/840] lr: 0.10000, eta: 0:42:53, time: 0.059, data_time: 0.025, memory: 2344, loss: 2.4408 INFO:mmcv.runner.runner:Epoch [11][500/840] lr: 0.10000, eta: 0:42:49, time: 0.059, data_time: 0.025, memory: 2344, loss: 2.4415 INFO:mmcv.runner.runner:Epoch [11][600/840] lr: 0.10000, eta: 0:42:44, time: 0.059, data_time: 0.025, memory: 2344, loss: 2.4418 INFO:mmcv.runner.runner:Epoch [11][700/840] lr: 0.10000, eta: 0:42:40, time: 0.058, data_time: 0.023, memory: 2344, loss: 2.4420 INFO:mmcv.runner.runner:Epoch [11][800/840] lr: 0.10000, eta: 0:42:35, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4426 INFO:mmcv.runner.runner:Epoch [12][100/840] lr: 0.10000, eta: 0:42:18, time: 0.061, data_time: 0.027, memory: 2344, loss: 2.4422 INFO:mmcv.runner.runner:Epoch [12][200/840] lr: 0.10000, eta: 0:42:14, time: 0.059, data_time: 0.026, memory: 2344, loss: 2.4419 INFO:mmcv.runner.runner:Epoch [12][300/840] lr: 0.10000, eta: 0:42:10, time: 0.059, data_time: 0.026, memory: 2344, loss: 2.4410 INFO:mmcv.runner.runner:Epoch [12][400/840] lr: 0.10000, eta: 0:42:05, time: 0.058, data_time: 0.025, memory: 2344, loss: 2.4407 INFO:mmcv.runner.runner:Epoch [12][500/840] lr: 0.10000, eta: 0:42:01, time: 0.059, data_time: 0.026, memory: 2344, loss: 2.4412 INFO:mmcv.runner.runner:Epoch [12][600/840] lr: 0.10000, eta: 0:41:57, time: 0.059, data_time: 0.025, memory: 2344, loss: 2.4424 INFO:mmcv.runner.runner:Epoch [12][700/840] lr: 0.10000, eta: 0:41:52, time: 0.059, data_time: 0.025, memory: 2344, loss: 2.4421 INFO:mmcv.runner.runner:Epoch [12][800/840] lr: 0.10000, eta: 0:41:47, time: 0.059, data_time: 0.024, memory: 2344, loss: 2.4425 INFO:mmcv.runner.runner:Epoch [13][100/840] lr: 0.10000, eta: 0:41:31, time: 0.060, data_time: 0.025, memory: 2344, loss: 2.4422 INFO:mmcv.runner.runner:Epoch [13][200/840] lr: 0.10000, eta: 0:41:27, time: 0.059, data_time: 0.025, memory: 2344, loss: 2.4421 INFO:mmcv.runner.runner:Epoch [13][300/840] lr: 0.10000, eta: 0:41:22, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4399 INFO:mmcv.runner.runner:Epoch [13][400/840] lr: 0.10000, eta: 0:41:17, time: 0.058, data_time: 0.025, memory: 2344, loss: 2.4418 INFO:mmcv.runner.runner:Epoch [13][500/840] lr: 0.10000, eta: 0:41:13, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4417 INFO:mmcv.runner.runner:Epoch [13][600/840] lr: 0.10000, eta: 0:41:08, time: 0.059, data_time: 0.025, memory: 2344, loss: 2.4422 INFO:mmcv.runner.runner:Epoch [13][700/840] lr: 0.10000, eta: 0:41:03, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4426 INFO:mmcv.runner.runner:Epoch [13][800/840] lr: 0.10000, eta: 0:40:59, time: 0.059, data_time: 0.024, memory: 2344, loss: 2.4421 INFO:mmcv.runner.runner:Epoch [14][100/840] lr: 0.10000, eta: 0:40:43, time: 0.060, data_time: 0.025, memory: 2344, loss: 2.4422 INFO:mmcv.runner.runner:Epoch [14][200/840] lr: 0.10000, eta: 0:40:39, time: 0.058, data_time: 0.025, memory: 2344, loss: 2.4421 INFO:mmcv.runner.runner:Epoch [14][300/840] lr: 0.10000, eta: 0:40:34, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4413 INFO:mmcv.runner.runner:Epoch [14][400/840] lr: 0.10000, eta: 0:40:29, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4411 INFO:mmcv.runner.runner:Epoch [14][500/840] lr: 0.10000, eta: 0:40:24, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4424 INFO:mmcv.runner.runner:Epoch [14][600/840] lr: 0.10000, eta: 0:40:19, time: 0.058, data_time: 0.025, memory: 2344, loss: 2.4417 INFO:mmcv.runner.runner:Epoch [14][700/840] lr: 0.10000, eta: 0:40:15, time: 0.059, data_time: 0.024, memory: 2344, loss: 2.4424 INFO:mmcv.runner.runner:Epoch [14][800/840] lr: 0.10000, eta: 0:40:10, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4424 INFO:mmcv.runner.runner:Epoch [15][100/840] lr: 0.10000, eta: 0:39:55, time: 0.060, data_time: 0.025, memory: 2344, loss: 2.4422 INFO:mmcv.runner.runner:Epoch [15][200/840] lr: 0.10000, eta: 0:39:50, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4419 INFO:mmcv.runner.runner:Epoch [15][300/840] lr: 0.10000, eta: 0:39:45, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4410 INFO:mmcv.runner.runner:Epoch [15][400/840] lr: 0.10000, eta: 0:39:40, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4410 INFO:mmcv.runner.runner:Epoch [15][500/840] lr: 0.10000, eta: 0:39:36, time: 0.059, data_time: 0.024, memory: 2344, loss: 2.4421 INFO:mmcv.runner.runner:Epoch [15][600/840] lr: 0.10000, eta: 0:39:31, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4411 INFO:mmcv.runner.runner:Epoch [15][700/840] lr: 0.10000, eta: 0:39:26, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4417 INFO:mmcv.runner.runner:Epoch [15][800/840] lr: 0.10000, eta: 0:39:21, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4419 INFO:mmcv.runner.runner:Epoch(train) [15][18] loss: 2.2971, top1: 0.0000, top5: 0.0000 INFO:mmcv.runner.runner:Epoch [16][100/840] lr: 0.10000, eta: 0:39:07, time: 0.060, data_time: 0.025, memory: 2344, loss: 2.4426 INFO:mmcv.runner.runner:Epoch [16][200/840] lr: 0.10000, eta: 0:39:02, time: 0.058, data_time: 0.025, memory: 2344, loss: 2.4419 INFO:mmcv.runner.runner:Epoch [16][300/840] lr: 0.10000, eta: 0:38:57, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4407 INFO:mmcv.runner.runner:Epoch [16][400/840] lr: 0.10000, eta: 0:38:52, time: 0.058, data_time: 0.025, memory: 2344, loss: 2.4410

I have used the following train.yaml:

argparse_cfg: gpus: bind_to: processor_cfg.gpus help: number of gpus work_dir: bind_to: processor_cfg.work_dir help: the dir to save logs and models batch_size: bind_to: processor_cfg.batch_size resume_from: bind_to: processor_cfg.resume_from help: the checkpoint file to resume from

processor_cfg: type: 'processor.recognition.train' workers: 2

model setting model_cfg: type: 'models.backbones.ST_GCN_18' in_channels: 3 num_class: 8 edge_importance_weighting: True graph_cfg: layout: 'coco' strategy: 'spatial' loss_cfg: type: 'torch.nn.CrossEntropyLoss'

dataset setting dataset_cfg:

training set

- type: "datasets.DataPipeline"
  data_source:
    type: "datasets.SkeletonLoader"
    data_dir: ./data/symptoms_data/train
    num_track: 2
    num_keypoints: 17
    repeat: 20
  pipeline:
    - {type: "datasets.skeleton.normalize_by_resolution"}
    - {type: "datasets.skeleton.mask_by_visibility"}
    - {type: "datasets.skeleton.pad_zero", size: 150 }
    - {type: "datasets.skeleton.random_crop", size: 150 }
    - {type: "datasets.skeleton.simulate_camera_moving"}
    - {type: "datasets.skeleton.transpose", order: [0, 2, 1, 3]}
    - {type: "datasets.skeleton.to_tuple"}

- type: "datasets.DataPipeline"
  data_source:
    type: "datasets.SkeletonLoader"
    data_dir: ./data/symptoms_data/val
    num_track: 2
    num_keypoints: 17
  pipeline:
    - {type: "datasets.skeleton.normalize_by_resolution"}
    - {type: "datasets.skeleton.mask_by_visibility"}
    - {type: "datasets.skeleton.pad_zero", size: 300 }
    - {type: "datasets.skeleton.random_crop", size: 300 }
    - {type: "datasets.skeleton.transpose", order: [0, 2, 1, 3]}
    - {type: "datasets.skeleton.to_tuple"}

dataloader setting batch_size: 32 gpus: 3

optimizer setting optimizer_cfg: type: 'torch.optim.SGD' lr: 0.1 momentum: 0.9 nesterov: true weight_decay: 0.0001

runtime setting workflow: [['train', 5], ['val', 1]] work_dir: ./work_dir/recognition/st_gcn/symptoms_data total_epochs: 65 training_hooks: lr_config: policy: 'step' step: [20, 30, 40, 50] log_config: interval: 100 hooks:

and build_dataset_example.yaml:

processor_cfg: type: "processor.skeleton_dataset.build" gpus: 1 worker_per_gpu: 2 video_dir: data/symptoms_data/videos out_dir: "data/symptoms_data/dataset" category_annotation: resource/category_annotations_symptoms.json detection_cfg: model_cfg: configs/mmdet/cascade_rcnn_r50_fpn_1x.py checkpoint_file: mmskeleton://mmdet/cascade_rcnn_r50_fpn_20e bbox_thre: 0.8 estimation_cfg: model_cfg: configs/pose_estimation/hrnet/pose_hrnet_w32_256x192_test.yaml checkpoint_file: mmskeleton://pose_estimation/pose_hrnet_w32_256x192 data_cfg: image_size:

argparse_cfg: gpus: bind_to: processor_cfg.gpus help: number of gpus worker_per_gpu: bind_to: processor_cfg.worker_per_gpu help: number of workers for each gpu video_dir: bind_to: processor_cfg.video_dir help: folder for videos category_annotation: bind_to: processor_cfg.category_annotation help: a json file recording video category annotation out_dir: bind_to: processor_cfg.out_dir help: folder for storing output dataset skeleton_model: bind_to: processor_cfg.estimation_cfg.model_cfg skeleton_checkpoint: bind_to: processor_cfg.estimation_cfg.checkpoint_file detection_model: bind_to: processor_cfg.detection_cfg.model_cfg detection_checkpoint: bind_to: processor_cfg.detection_cfg.checkpoint_file

Why the loss not decreasing? How to get correct configuration parameters for building custom dataset?

atomtony commented 4 years ago

Me too.

jiawenhao2015 commented 4 years ago

same quesion...

jiawenhao2015 commented 4 years ago

Why the loss not decreasing? How to get correct configuration parameters for building custom dataset?

hello ~ i have met the same probelm,and the loss does not decrease....have you sloved the problem? i am looking forward to your reply ,many thx!!!

rashidch commented 4 years ago

Why the loss not decreasing? How to get correct configuration parameters for building custom dataset?

hello ~ i have met the same probelm,and the loss does not decrease....have you sloved the problem? i am looking forward to your reply ,many thx!!!

Hey, No!

jiawenhao2015 commented 4 years ago

Why the loss not decreasing? How to get correct configuration parameters for building custom dataset?

hello ~ i have met the same probelm,and the loss does not decrease....have you sloved the problem? i am looking forward to your reply ,many thx!!!

Hey, No!

same sad...

atomtony commented 4 years ago

INFO:mmcv.runner.runner:workflow: [('train', 5), ('val', 1)], max: 65 epochs INFO:mmcv.runner.runner:Epoch [1][100/125] lr: 0.10000, eta: 0:16:44, time: 0.125, data_time: 0.015, memory: 858, loss: 2.3419 INFO:mmcv.runner.runner:Epoch [2][100/125] lr: 0.10000, eta: 0:10:46, time: 0.059, data_time: 0.014, memory: 858, loss: 2.3402 INFO:mmcv.runner.runner:Epoch [3][100/125] lr: 0.10000, eta: 0:08:59, time: 0.059, data_time: 0.014, memory: 858, loss: 2.3420 INFO:mmcv.runner.runner:Epoch [4][100/125] lr: 0.10000, eta: 0:08:07, time: 0.060, data_time: 0.014, memory: 858, loss: 2.3415 INFO:mmcv.runner.runner:Epoch [5][100/125] lr: 0.10000, eta: 0:07:34, time: 0.060, data_time: 0.015, memory: 858, loss: 2.3466 INFO:mmcv.runner.runner:Epoch(train) [5][6] loss: 2.3335, top1: 0.1458, top5: 0.5312 INFO:mmcv.runner.runner:Epoch [6][100/125] lr: 0.10000, eta: 0:07:09, time: 0.058, data_time: 0.014, memory: 859, loss: 2.3459 INFO:mmcv.runner.runner:Epoch [7][100/125] lr: 0.10000, eta: 0:06:50, time: 0.058, data_time: 0.014, memory: 859, loss: 2.3443 INFO:mmcv.runner.runner:Epoch [8][100/125] lr: 0.10000, eta: 0:06:35, time: 0.060, data_time: 0.016, memory: 859, loss: 2.3476 INFO:mmcv.runner.runner:Epoch [9][100/125] lr: 0.10000, eta: 0:06:22, time: 0.060, data_time: 0.016, memory: 859, loss: 2.3418 INFO:mmcv.runner.runner:Epoch [10][100/125] lr: 0.10000, eta: 0:06:10, time: 0.060, data_time: 0.016, memory: 859, loss: 2.3433 INFO:mmcv.runner.runner:Epoch(train) [10][6] loss: 2.3405, top1: 0.1250, top5: 0.5104 INFO:mmcv.runner.runner:Epoch [11][100/125] lr: 0.10000, eta: 0:05:59, time: 0.059, data_time: 0.015, memory: 859, loss: 2.3361 INFO:mmcv.runner.runner:Epoch [12][100/125] lr: 0.10000, eta: 0:05:50, time: 0.059, data_time: 0.016, memory: 859, loss: 2.3499 INFO:mmcv.runner.runner:Epoch [13][100/125] lr: 0.10000, eta: 0:05:40, time: 0.058, data_time: 0.016, memory: 859, loss: 2.3451 INFO:mmcv.runner.runner:Epoch [14][100/125] lr: 0.10000, eta: 0:05:31, time: 0.059, data_time: 0.014, memory: 859, loss: 2.3369 INFO:mmcv.runner.runner:Epoch [15][100/125] lr: 0.10000, eta: 0:05:23, time: 0.059, data_time: 0.016, memory: 859, loss: 2.3392 INFO:mmcv.runner.runner:Epoch(train) [15][6] loss: 2.3434, top1: 0.1250, top5: 0.4792 INFO:mmcv.runner.runner:Epoch [16][100/125] lr: 0.10000, eta: 0:05:15, time: 0.059, data_time: 0.017, memory: 859, loss: 2.3394 INFO:mmcv.runner.runner:Epoch [17][100/125] lr: 0.10000, eta: 0:05:07, time: 0.059, data_time: 0.015, memory: 859, loss: 2.3448 INFO:mmcv.runner.runner:Epoch [18][100/125] lr: 0.10000, eta: 0:04:59, time: 0.059, data_time: 0.017, memory: 859, loss: 2.3451 INFO:mmcv.runner.runner:Epoch [19][100/125] lr: 0.10000, eta: 0:04:52, time: 0.060, data_time: 0.016, memory: 859, loss: 2.3315 INFO:mmcv.runner.runner:Epoch [20][100/125] lr: 0.10000, eta: 0:04:45, time: 0.060, data_time: 0.017, memory: 859, loss: 2.3446 INFO:mmcv.runner.runner:Epoch(train) [20][6] loss: 2.3341, top1: 0.1250, top5: 0.5312 INFO:mmcv.runner.runner:Epoch [21][100/125] lr: 0.01000, eta: 0:04:37, time: 0.059, data_time: 0.016, memory: 859, loss: 2.3390 INFO:mmcv.runner.runner:Epoch [22][100/125] lr: 0.01000, eta: 0:04:30, time: 0.060, data_time: 0.015, memory: 859, loss: 2.3439 INFO:mmcv.runner.runner:Epoch [23][100/125] lr: 0.01000, eta: 0:04:23, time: 0.058, data_time: 0.014, memory: 859, loss: 2.3418 INFO:mmcv.runner.runner:Epoch [24][100/125] lr: 0.01000, eta: 0:04:17, time: 0.059, data_time: 0.015, memory: 859, loss: 2.3454 INFO:mmcv.runner.runner:Epoch [25][100/125] lr: 0.01000, eta: 0:04:10, time: 0.059, data_time: 0.015, memory: 859, loss: 2.3470 INFO:mmcv.runner.runner:Epoch(train) [25][6] loss: 2.3479, top1: 0.1250, top5: 0.5000 INFO:mmcv.runner.runner:Epoch [26][100/125] lr: 0.01000, eta: 0:04:03, time: 0.060, data_time: 0.015, memory: 859, loss: 2.3462 INFO:mmcv.runner.runner:Epoch [27][100/125] lr: 0.01000, eta: 0:03:57, time: 0.059, data_time: 0.016, memory: 859, loss: 2.3369 INFO:mmcv.runner.runner:Epoch [28][100/125] lr: 0.01000, eta: 0:03:50, time: 0.058, data_time: 0.016, memory: 859, loss: 2.3454 INFO:mmcv.runner.runner:Epoch [29][100/125] lr: 0.01000, eta: 0:03:43, time: 0.060, data_time: 0.015, memory: 859, loss: 2.3453 INFO:mmcv.runner.runner:Epoch [30][100/125] lr: 0.01000, eta: 0:03:37, time: 0.059, data_time: 0.016, memory: 859, loss: 2.3449 INFO:mmcv.runner.runner:Epoch(train) [30][6] loss: 2.3513, top1: 0.1250, top5: 0.4896 INFO:mmcv.runner.runner:Epoch [31][100/125] lr: 0.00100, eta: 0:03:30, time: 0.059, data_time: 0.017, memory: 859, loss: 2.3327 INFO:mmcv.runner.runner:Epoch [32][100/125] lr: 0.00100, eta: 0:03:24, time: 0.059, data_time: 0.017, memory: 859, loss: 2.3501 INFO:mmcv.runner.runner:Epoch [33][100/125] lr: 0.00100, eta: 0:03:18, time: 0.060, data_time: 0.015, memory: 859, loss: 2.3438 INFO:mmcv.runner.runner:Epoch [34][100/125] lr: 0.00100, eta: 0:03:11, time: 0.058, data_time: 0.015, memory: 859, loss: 2.3404 INFO:mmcv.runner.runner:Epoch [35][100/125] lr: 0.00100, eta: 0:03:05, time: 0.059, data_time: 0.018, memory: 859, loss: 2.3446 INFO:mmcv.runner.runner:Epoch(train) [35][6] loss: 2.3447, top1: 0.1458, top5: 0.5521 INFO:mmcv.runner.runner:Epoch [36][100/125] lr: 0.00100, eta: 0:02:58, time: 0.059, data_time: 0.014, memory: 859, loss: 2.3453 INFO:mmcv.runner.runner:Epoch [37][100/125] lr: 0.00100, eta: 0:02:52, time: 0.060, data_time: 0.016, memory: 859, loss: 2.3467 INFO:mmcv.runner.runner:Epoch [38][100/125] lr: 0.00100, eta: 0:02:46, time: 0.059, data_time: 0.014, memory: 859, loss: 2.3416 INFO:mmcv.runner.runner:Epoch [39][100/125] lr: 0.00100, eta: 0:02:40, time: 0.060, data_time: 0.017, memory: 859, loss: 2.3470 INFO:mmcv.runner.runner:Epoch [40][100/125] lr: 0.00100, eta: 0:02:33, time: 0.059, data_time: 0.018, memory: 859, loss: 2.3481 INFO:mmcv.runner.runner:Epoch(train) [40][6] loss: 2.3345, top1: 0.1458, top5: 0.5417 INFO:mmcv.runner.runner:Epoch [41][100/125] lr: 0.00010, eta: 0:02:27, time: 0.060, data_time: 0.015, memory: 859, loss: 2.3414 INFO:mmcv.runner.runner:Epoch [42][100/125] lr: 0.00010, eta: 0:02:21, time: 0.060, data_time: 0.018, memory: 859, loss: 2.3456 INFO:mmcv.runner.runner:Epoch [43][100/125] lr: 0.00010, eta: 0:02:15, time: 0.059, data_time: 0.015, memory: 859, loss: 2.3401 INFO:mmcv.runner.runner:Epoch [44][100/125] lr: 0.00010, eta: 0:02:09, time: 0.059, data_time: 0.016, memory: 859, loss: 2.3525 INFO:mmcv.runner.runner:Epoch [45][100/125] lr: 0.00010, eta: 0:02:03, time: 0.059, data_time: 0.015, memory: 859, loss: 2.3436 INFO:mmcv.runner.runner:Epoch(train) [45][6] loss: 2.3390, top1: 0.1042, top5: 0.5208 INFO:mmcv.runner.runner:Epoch [46][100/125] lr: 0.00010, eta: 0:01:56, time: 0.059, data_time: 0.015, memory: 859, loss: 2.3454 INFO:mmcv.runner.runner:Epoch [47][100/125] lr: 0.00010, eta: 0:01:50, time: 0.060, data_time: 0.016, memory: 859, loss: 2.3395 INFO:mmcv.runner.runner:Epoch [48][100/125] lr: 0.00010, eta: 0:01:44, time: 0.059, data_time: 0.015, memory: 859, loss: 2.3419 INFO:mmcv.runner.runner:Epoch [49][100/125] lr: 0.00010, eta: 0:01:38, time: 0.059, data_time: 0.017, memory: 859, loss: 2.3367 INFO:mmcv.runner.runner:Epoch [50][100/125] lr: 0.00010, eta: 0:01:32, time: 0.060, data_time: 0.016, memory: 859, loss: 2.3467 INFO:mmcv.runner.runner:Epoch(train) [50][6] loss: 2.3346, top1: 0.1354, top5: 0.5000 INFO:mmcv.runner.runner:Epoch [51][100/125] lr: 0.00001, eta: 0:01:26, time: 0.060, data_time: 0.016, memory: 859, loss: 2.3390 INFO:mmcv.runner.runner:Epoch [52][100/125] lr: 0.00001, eta: 0:01:20, time: 0.060, data_time: 0.016, memory: 859, loss: 2.3441 INFO:mmcv.runner.runner:Epoch [53][100/125] lr: 0.00001, eta: 0:01:14, time: 0.061, data_time: 0.018, memory: 859, loss: 2.3491 INFO:mmcv.runner.runner:Epoch [54][100/125] lr: 0.00001, eta: 0:01:08, time: 0.061, data_time: 0.016, memory: 859, loss: 2.3433 INFO:mmcv.runner.runner:Epoch [55][100/125] lr: 0.00001, eta: 0:01:01, time: 0.061, data_time: 0.017, memory: 859, loss: 2.3393 INFO:mmcv.runner.runner:Epoch(train) [55][6] loss: 2.3434, top1: 0.1458, top5: 0.5208 INFO:mmcv.runner.runner:Epoch [56][100/125] lr: 0.00001, eta: 0:00:55, time: 0.062, data_time: 0.017, memory: 859, loss: 2.3490 INFO:mmcv.runner.runner:Epoch [57][100/125] lr: 0.00001, eta: 0:00:49, time: 0.061, data_time: 0.014, memory: 859, loss: 2.3373 INFO:mmcv.runner.runner:Epoch [58][100/125] lr: 0.00001, eta: 0:00:43, time: 0.062, data_time: 0.016, memory: 859, loss: 2.3434 INFO:mmcv.runner.runner:Epoch [59][100/125] lr: 0.00001, eta: 0:00:37, time: 0.063, data_time: 0.015, memory: 859, loss: 2.3424 INFO:mmcv.runner.runner:Epoch [60][100/125] lr: 0.00001, eta: 0:00:31, time: 0.063, data_time: 0.015, memory: 859, loss: 2.3543 INFO:mmcv.runner.runner:Epoch(train) [60][6] loss: 2.3233, top1: 0.1458, top5: 0.5312 INFO:mmcv.runner.runner:Epoch [61][100/125] lr: 0.00001, eta: 0:00:25, time: 0.063, data_time: 0.015, memory: 859, loss: 2.3460 INFO:mmcv.runner.runner:Epoch [62][100/125] lr: 0.00001, eta: 0:00:19, time: 0.062, data_time: 0.016, memory: 859, loss: 2.3449 INFO:mmcv.runner.runner:Epoch [63][100/125] lr: 0.00001, eta: 0:00:13, time: 0.063, data_time: 0.014, memory: 859, loss: 2.3469 INFO:mmcv.runner.runner:Epoch [64][100/125] lr: 0.00001, eta: 0:00:07, time: 0.064, data_time: 0.016, memory: 859, loss: 2.3478 INFO:mmcv.runner.runner:Epoch [65][100/125] lr: 0.00001, eta: 0:00:01, time: 0.064, data_time: 0.018, memory: 859, loss: 2.3395 INFO:mmcv.runner.runner:Epoch(train) [65][6] loss: 2.3542, top1: 0.1250, top5: 0.5104

This is my training log, 10 categories, 10 samples per category, is this training correct?

jiawenhao2015 commented 4 years ago

i found that the core training phrase is done in the mmcv module(in my machine,is at /xxxxxxx/miniconda3/lib/python3.7/site-packages/mmcv-0.4.3-py3.7-linux-x86_64.egg/mmcv/runner/runner.py),

`def train(self, data_loader, *kwargs): self.model.train() self.mode = 'train' self.data_loader = data_loader self._max_iters = self._max_epochs len(data_loader) self.call_hook('before_train_epoch') for i, data_batch in enumerate(data_loader): self._inner_iter = i self.call_hook('before_train_iter') outputs = self.batch_processor( self.model, data_batch, train_mode=True, **kwargs) if not isinstance(outputs, dict): raise TypeError('batch_processor() must return a dict') if 'log_vars' in outputs: self.log_buffer.update(outputs['log_vars'], outputs['num_samples']) self.outputs = outputs

        self.optimizer.zero_grad()
        self.outputs['loss'].backward()
        self.optimizer.step()

        self.call_hook('after_train_iter')

        self._iter += 1

    self.call_hook('after_train_epoch')
    self._epoch += 1`

the loss backward opeation is done by the hook function , /share/jiawenhao/miniconda3/lib/python3.7/site-packages/mmcv-0.4.3-py3.7-linux-x86_64.egg/mmcv/runner/hooks/optimizer.py

` def after_train_iter(self, runner):

    runner.optimizer.zero_grad()
    runner.outputs['loss'].backward()
    if self.grad_clip is not None:
        self.clip_grads(runner.model.parameters())
    runner.optimizer.step()
    `

did not know why the function does not run actually.

**so manually add these operations in the runn.py ,

then ,the loss could decrease ...**

`self.optimizer.zero_grad()

        self.outputs['loss'].backward()

        self.optimizer.step()

`

atomtony commented 4 years ago

i found that the core training phrase is done in the mmcv module(in my machine,is at /xxxxxxx/miniconda3/lib/python3.7/site-packages/mmcv-0.4.3-py3.7-linux-x86_64.egg/mmcv/runner/runner.py),

`def train(self, data_loader, *kwargs): self.model.train() self.mode = 'train' self.data_loader = data_loader self._max_iters = self._max_epochs len(data_loader) self.call_hook('before_train_epoch') for i, data_batch in enumerate(data_loader): self._inner_iter = i self.call_hook('before_train_iter') outputs = self.batch_processor( self.model, data_batch, train_mode=True, **kwargs) if not isinstance(outputs, dict): raise TypeError('batch_processor() must return a dict') if 'log_vars' in outputs: self.log_buffer.update(outputs['log_vars'], outputs['num_samples']) self.outputs = outputs

        self.optimizer.zero_grad()
        self.outputs['loss'].backward()
        self.optimizer.step()

        self.call_hook('after_train_iter')

        self._iter += 1

    self.call_hook('after_train_epoch')
    self._epoch += 1`

the loss backward opeation is done by the hook function , /share/jiawenhao/miniconda3/lib/python3.7/site-packages/mmcv-0.4.3-py3.7-linux-x86_64.egg/mmcv/runner/hooks/optimizer.py

` def after_train_iter(self, runner):

    runner.optimizer.zero_grad()
    runner.outputs['loss'].backward()
    if self.grad_clip is not None:
        self.clip_grads(runner.model.parameters())
    runner.optimizer.step()
    `

did not know why the function does not run actually.

**so manually add these operations in the runn.py ,

then ,the loss could decrease ...**

`self.optimizer.zero_grad()

        self.outputs['loss'].backward()

        self.optimizer.step()

`

Thanks, I modified the code according to your suggestion, after training is completely correct

atomtony commented 4 years ago

Finally, I modified the training_hooks configuration of the train.yaml file, and the changes are as follows:


training_hooks: 
    lr_config: 
      policy: 'step' 
      step: [20, 30, 40, 50] 
    log_config: 
      interval: 100 
      hooks: 
        - type: TextLoggerHook 
    checkpoint_config: 
      interval: 5 
    optimizer_config: 
      grad_clip:
rashidch commented 4 years ago

Finally, I modified the training_hooks configuration of the train.yaml file, and the changes are as follows:

training_hooks: 

    lr_config: 

      policy: 'step' 

      step: [20, 30, 40, 50] 

    log_config: 

      interval: 100 

      hooks: 

        - type: TextLoggerHook 

    checkpoint_config: 

      interval: 5 

    optimizer_config: 

      grad_clip:

Hey, Can you share you complete train.yaml and final value of loss and training and test accuracy?

atomtony commented 4 years ago

Finally, I modified the training_hooks configuration of the train.yaml file, and the changes are as follows:

training_hooks: 

    lr_config: 

      policy: 'step' 

      step: [20, 30, 40, 50] 

    log_config: 

      interval: 100 

      hooks: 

        - type: TextLoggerHook 

    checkpoint_config: 

      interval: 5 

    optimizer_config: 

      grad_clip:

Hey, Can you share you complete train.yaml and final value of loss and training and test accuracy?

This is my training log:

INFO:mmcv.runner.runner:workflow: [('train', 5), ('val', 1)], max: 65 epochs INFO:mmcv.runner.runner:Epoch [1][100/116] lr: 0.10000, eta: 0:17:31, time: 0.141, data_time: 0.007, memory: 456, loss: 2.2317 INFO:mmcv.runner.runner:Epoch [2][100/116] lr: 0.10000, eta: 0:12:11, time: 0.074, data_time: 0.011, memory: 456, loss: 1.7097 INFO:mmcv.runner.runner:Epoch [3][100/116] lr: 0.10000, eta: 0:10:27, time: 0.073, data_time: 0.012, memory: 456, loss: 1.6194 INFO:mmcv.runner.runner:Epoch [4][100/116] lr: 0.10000, eta: 0:09:35, time: 0.074, data_time: 0.011, memory: 456, loss: 1.5610 INFO:mmcv.runner.runner:Epoch [5][100/116] lr: 0.10000, eta: 0:09:03, time: 0.076, data_time: 0.012, memory: 456, loss: 1.4833 INFO:mmcv.runner.runner:Epoch(train) [5][5] loss: 1.4931, top1: 0.3500, top5: 1.0000 INFO:mmcv.runner.runner:Epoch [6][100/116] lr: 0.10000, eta: 0:08:37, time: 0.074, data_time: 0.012, memory: 456, loss: 1.4393 INFO:mmcv.runner.runner:Epoch [7][100/116] lr: 0.10000, eta: 0:08:17, time: 0.074, data_time: 0.011, memory: 456, loss: 1.3877 INFO:mmcv.runner.runner:Epoch [8][100/116] lr: 0.10000, eta: 0:08:00, time: 0.073, data_time: 0.010, memory: 456, loss: 1.2841 INFO:mmcv.runner.runner:Epoch [9][100/116] lr: 0.10000, eta: 0:07:45, time: 0.074, data_time: 0.011, memory: 456, loss: 1.1788 INFO:mmcv.runner.runner:Epoch [10][100/116] lr: 0.10000, eta: 0:07:32, time: 0.075, data_time: 0.012, memory: 456, loss: 1.0855 INFO:mmcv.runner.runner:Epoch(train) [10][5] loss: 1.2552, top1: 0.5125, top5: 1.0000 INFO:mmcv.runner.runner:Epoch [11][100/116] lr: 0.10000, eta: 0:07:20, time: 0.075, data_time: 0.011, memory: 456, loss: 0.8774 INFO:mmcv.runner.runner:Epoch [12][100/116] lr: 0.10000, eta: 0:07:09, time: 0.074, data_time: 0.011, memory: 456, loss: 0.5458 INFO:mmcv.runner.runner:Epoch [13][100/116] lr: 0.10000, eta: 0:06:58, time: 0.075, data_time: 0.011, memory: 456, loss: 0.3136 INFO:mmcv.runner.runner:Epoch [14][100/116] lr: 0.10000, eta: 0:06:48, time: 0.075, data_time: 0.011, memory: 456, loss: 0.2149 INFO:mmcv.runner.runner:Epoch [15][100/116] lr: 0.10000, eta: 0:06:39, time: 0.075, data_time: 0.012, memory: 456, loss: 0.1297 INFO:mmcv.runner.runner:Epoch(train) [15][5] loss: 0.1492, top1: 1.0000, top5: 1.0000 INFO:mmcv.runner.runner:Epoch [16][100/116] lr: 0.10000, eta: 0:06:29, time: 0.075, data_time: 0.011, memory: 456, loss: 0.1112 INFO:mmcv.runner.runner:Epoch [17][100/116] lr: 0.10000, eta: 0:06:20, time: 0.075, data_time: 0.012, memory: 456, loss: 0.0726 INFO:mmcv.runner.runner:Epoch [18][100/116] lr: 0.10000, eta: 0:06:11, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0188 INFO:mmcv.runner.runner:Epoch [19][100/116] lr: 0.10000, eta: 0:06:03, time: 0.075, data_time: 0.010, memory: 456, loss: 0.0255 INFO:mmcv.runner.runner:Epoch [20][100/116] lr: 0.10000, eta: 0:05:54, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0666 INFO:mmcv.runner.runner:Epoch(train) [20][5] loss: 0.0752, top1: 1.0000, top5: 1.0000 INFO:mmcv.runner.runner:Epoch [21][100/116] lr: 0.01000, eta: 0:05:45, time: 0.074, data_time: 0.011, memory: 456, loss: 0.0129 INFO:mmcv.runner.runner:Epoch [22][100/116] lr: 0.01000, eta: 0:05:37, time: 0.075, data_time: 0.010, memory: 456, loss: 0.0077 INFO:mmcv.runner.runner:Epoch [23][100/116] lr: 0.01000, eta: 0:05:28, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0054 INFO:mmcv.runner.runner:Epoch [24][100/116] lr: 0.01000, eta: 0:05:20, time: 0.074, data_time: 0.011, memory: 456, loss: 0.0059 INFO:mmcv.runner.runner:Epoch [25][100/116] lr: 0.01000, eta: 0:05:12, time: 0.076, data_time: 0.011, memory: 456, loss: 0.0062 INFO:mmcv.runner.runner:Epoch(train) [25][5] loss: 0.0363, top1: 1.0000, top5: 1.0000 INFO:mmcv.runner.runner:Epoch [26][100/116] lr: 0.01000, eta: 0:05:03, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0040 INFO:mmcv.runner.runner:Epoch [27][100/116] lr: 0.01000, eta: 0:04:55, time: 0.074, data_time: 0.011, memory: 456, loss: 0.0035 INFO:mmcv.runner.runner:Epoch [28][100/116] lr: 0.01000, eta: 0:04:47, time: 0.075, data_time: 0.012, memory: 456, loss: 0.0042 INFO:mmcv.runner.runner:Epoch [29][100/116] lr: 0.01000, eta: 0:04:39, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0038 INFO:mmcv.runner.runner:Epoch [30][100/116] lr: 0.01000, eta: 0:04:31, time: 0.076, data_time: 0.011, memory: 456, loss: 0.0034 INFO:mmcv.runner.runner:Epoch(train) [30][5] loss: 0.0447, top1: 1.0000, top5: 1.0000 INFO:mmcv.runner.runner:Epoch [31][100/116] lr: 0.00100, eta: 0:04:23, time: 0.074, data_time: 0.011, memory: 456, loss: 0.0044 INFO:mmcv.runner.runner:Epoch [32][100/116] lr: 0.00100, eta: 0:04:15, time: 0.075, data_time: 0.013, memory: 456, loss: 0.0039 INFO:mmcv.runner.runner:Epoch [33][100/116] lr: 0.00100, eta: 0:04:07, time: 0.076, data_time: 0.011, memory: 456, loss: 0.0050 INFO:mmcv.runner.runner:Epoch [34][100/116] lr: 0.00100, eta: 0:03:59, time: 0.074, data_time: 0.011, memory: 456, loss: 0.0039 INFO:mmcv.runner.runner:Epoch [35][100/116] lr: 0.00100, eta: 0:03:51, time: 0.074, data_time: 0.010, memory: 456, loss: 0.0044 INFO:mmcv.runner.runner:Epoch(train) [35][5] loss: 0.0343, top1: 1.0000, top5: 1.0000 INFO:mmcv.runner.runner:Epoch [36][100/116] lr: 0.00100, eta: 0:03:43, time: 0.076, data_time: 0.011, memory: 456, loss: 0.0034 INFO:mmcv.runner.runner:Epoch [37][100/116] lr: 0.00100, eta: 0:03:36, time: 0.075, data_time: 0.010, memory: 456, loss: 0.0036 INFO:mmcv.runner.runner:Epoch [38][100/116] lr: 0.00100, eta: 0:03:28, time: 0.075, data_time: 0.012, memory: 456, loss: 0.0031 INFO:mmcv.runner.runner:Epoch [39][100/116] lr: 0.00100, eta: 0:03:20, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0054 INFO:mmcv.runner.runner:Epoch [40][100/116] lr: 0.00100, eta: 0:03:12, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0041 INFO:mmcv.runner.runner:Epoch(train) [40][5] loss: 0.0408, top1: 1.0000, top5: 1.0000 INFO:mmcv.runner.runner:Epoch [41][100/116] lr: 0.00010, eta: 0:03:04, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0046 INFO:mmcv.runner.runner:Epoch [42][100/116] lr: 0.00010, eta: 0:02:57, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0039 INFO:mmcv.runner.runner:Epoch [43][100/116] lr: 0.00010, eta: 0:02:49, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0029 INFO:mmcv.runner.runner:Epoch [44][100/116] lr: 0.00010, eta: 0:02:41, time: 0.074, data_time: 0.011, memory: 456, loss: 0.0033 INFO:mmcv.runner.runner:Epoch [45][100/116] lr: 0.00010, eta: 0:02:34, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0030 INFO:mmcv.runner.runner:Epoch(train) [45][5] loss: 0.0346, top1: 1.0000, top5: 1.0000 INFO:mmcv.runner.runner:Epoch [46][100/116] lr: 0.00010, eta: 0:02:26, time: 0.076, data_time: 0.011, memory: 456, loss: 0.0039 INFO:mmcv.runner.runner:Epoch [47][100/116] lr: 0.00010, eta: 0:02:18, time: 0.076, data_time: 0.011, memory: 456, loss: 0.0030 INFO:mmcv.runner.runner:Epoch [48][100/116] lr: 0.00010, eta: 0:02:10, time: 0.074, data_time: 0.011, memory: 456, loss: 0.0032 INFO:mmcv.runner.runner:Epoch [49][100/116] lr: 0.00010, eta: 0:02:03, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0040 INFO:mmcv.runner.runner:Epoch [50][100/116] lr: 0.00010, eta: 0:01:55, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0033 INFO:mmcv.runner.runner:Epoch(train) [50][5] loss: 0.0390, top1: 1.0000, top5: 1.0000 INFO:mmcv.runner.runner:Epoch [51][100/116] lr: 0.00001, eta: 0:01:47, time: 0.076, data_time: 0.011, memory: 456, loss: 0.0039 INFO:mmcv.runner.runner:Epoch [52][100/116] lr: 0.00001, eta: 0:01:40, time: 0.075, data_time: 0.012, memory: 456, loss: 0.0031 INFO:mmcv.runner.runner:Epoch [53][100/116] lr: 0.00001, eta: 0:01:32, time: 0.076, data_time: 0.011, memory: 456, loss: 0.0036 INFO:mmcv.runner.runner:Epoch [54][100/116] lr: 0.00001, eta: 0:01:24, time: 0.075, data_time: 0.012, memory: 456, loss: 0.0061 INFO:mmcv.runner.runner:Epoch [55][100/116] lr: 0.00001, eta: 0:01:17, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0031 INFO:mmcv.runner.runner:Epoch(train) [55][5] loss: 0.0326, top1: 1.0000, top5: 1.0000 INFO:mmcv.runner.runner:Epoch [56][100/116] lr: 0.00001, eta: 0:01:09, time: 0.076, data_time: 0.010, memory: 456, loss: 0.0027 INFO:mmcv.runner.runner:Epoch [57][100/116] lr: 0.00001, eta: 0:01:02, time: 0.077, data_time: 0.011, memory: 456, loss: 0.0031 INFO:mmcv.runner.runner:Epoch [58][100/116] lr: 0.00001, eta: 0:00:54, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0028 INFO:mmcv.runner.runner:Epoch [59][100/116] lr: 0.00001, eta: 0:00:46, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0034 INFO:mmcv.runner.runner:Epoch [60][100/116] lr: 0.00001, eta: 0:00:39, time: 0.074, data_time: 0.011, memory: 456, loss: 0.0035 INFO:mmcv.runner.runner:Epoch(train) [60][5] loss: 0.0372, top1: 1.0000, top5: 1.0000 INFO:mmcv.runner.runner:Epoch [61][100/116] lr: 0.00001, eta: 0:00:31, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0040 INFO:mmcv.runner.runner:Epoch [62][100/116] lr: 0.00001, eta: 0:00:23, time: 0.076, data_time: 0.012, memory: 456, loss: 0.0034 INFO:mmcv.runner.runner:Epoch [63][100/116] lr: 0.00001, eta: 0:00:16, time: 0.076, data_time: 0.011, memory: 456, loss: 0.0034 INFO:mmcv.runner.runner:Epoch [64][100/116] lr: 0.00001, eta: 0:00:08, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0032 INFO:mmcv.runner.runner:Epoch [65][100/116] lr: 0.00001, eta: 0:00:01, time: 0.076, data_time: 0.011, memory: 456, loss: 0.0032 INFO:mmcv.runner.runner:Epoch(train) [65][5] loss: 0.0310, top1: 1.0000, top5: 1.0000

This is my test log:

[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 93/93, 89.6 task/s, elapsed: 1s, ETA: 0sTop 1: 100.00% Top 5: 100.00%

rashidch commented 4 years ago

Hey, Did you use the same training configuration file as example_dataset?

atomtony commented 4 years ago

Hey, Did you use the same training configuration file as example_dataset?

train.yaml


argparse_cfg:
gpus:
bind_to: processor_cfg.gpus
help: number of gpus
work_dir:
bind_to: processor_cfg.work_dir
help: the dir to save logs and models
batch_size:
bind_to: processor_cfg.batch_size
resume_from:
bind_to: processor_cfg.resume_from
help: the checkpoint file to resume from

processor_cfg: type: 'processor.recognition.train' workers: 16

model setting

model_cfg: type: 'models.backbones.ST_GCN_18' in_channels: 3 num_class: 10 edge_importance_weighting: True graph_cfg: layout: 'coco' strategy: 'spatial' loss_cfg: type: 'torch.nn.CrossEntropyLoss'

dataset setting

dataset_cfg:

training set

- type: "datasets.DataPipeline"
  data_source:
    type: "datasets.SkeletonLoader"
    data_dir:  ./data/actions_as_space_time_shapes
    num_track: 2
    num_keypoints: 17
    repeat: 20
  pipeline:
    - {type: "datasets.skeleton.normalize_by_resolution"}
    - {type: "datasets.skeleton.mask_by_visibility"}
    - {type: "datasets.skeleton.pad_zero", size: 150 }
    - {type: "datasets.skeleton.random_crop", size: 150 }
    - {type: "datasets.skeleton.simulate_camera_moving"}
    - {type: "datasets.skeleton.transpose", order: [0, 2, 1, 3]}
    - {type: "datasets.skeleton.to_tuple"}

- type: "datasets.DataPipeline"
  data_source:
    type: "datasets.SkeletonLoader"
    data_dir:  ./data/actions_as_space_time_shapes
    num_track: 2
    num_keypoints: 17
  pipeline:
    - {type: "datasets.skeleton.normalize_by_resolution"}
    - {type: "datasets.skeleton.mask_by_visibility"}
    - {type: "datasets.skeleton.pad_zero", size: 300 }
    - {type: "datasets.skeleton.random_crop", size: 300 }
    - {type: "datasets.skeleton.transpose", order: [0, 2, 1, 3]}
    - {type: "datasets.skeleton.to_tuple"}

dataloader setting

batch_size: 16 gpus: 4

optimizer setting

optimizer_cfg: type: 'torch.optim.SGD' lr: 0.1 momentum: 0.9 nesterov: true weight_decay: 0.0001

runtime setting

workflow: [['train', 5], ['val', 1]] work_dir: ./work_dir/recognition/st_gcn/actions_as_space_time_shapes total_epochs: 65 training_hooks: lr_config: policy: 'step' step: [20, 30, 40, 50] log_config: interval: 100 hooks:

rashidch commented 4 years ago

Hey, Did you use the same training configuration file as example_dataset?

train.yaml

argparse_cfg:
  gpus:
    bind_to: processor_cfg.gpus
    help: number of gpus
  work_dir:
    bind_to: processor_cfg.work_dir
    help: the dir to save logs and models
  batch_size:
    bind_to: processor_cfg.batch_size
  resume_from:
    bind_to: processor_cfg.resume_from
    help: the checkpoint file to resume from

processor_cfg:
  type: 'processor.recognition.train'
  workers: 16

  # model setting
  model_cfg:
    type: 'models.backbones.ST_GCN_18'
    in_channels: 3
    num_class: 10
    edge_importance_weighting: True
    graph_cfg:
      layout: 'coco'
      strategy: 'spatial'
  loss_cfg:
    type: 'torch.nn.CrossEntropyLoss'

  # dataset setting
  dataset_cfg:
    # training set
    - type: "datasets.DataPipeline"
      data_source:
        type: "datasets.SkeletonLoader"
        data_dir:  ./data/actions_as_space_time_shapes
        num_track: 2
        num_keypoints: 17
        repeat: 20
      pipeline:
        - {type: "datasets.skeleton.normalize_by_resolution"}
        - {type: "datasets.skeleton.mask_by_visibility"}
        - {type: "datasets.skeleton.pad_zero", size: 150 }
        - {type: "datasets.skeleton.random_crop", size: 150 }
        - {type: "datasets.skeleton.simulate_camera_moving"}
        - {type: "datasets.skeleton.transpose", order: [0, 2, 1, 3]}
        - {type: "datasets.skeleton.to_tuple"}

    - type: "datasets.DataPipeline"
      data_source:
        type: "datasets.SkeletonLoader"
        data_dir:  ./data/actions_as_space_time_shapes
        num_track: 2
        num_keypoints: 17
      pipeline:
        - {type: "datasets.skeleton.normalize_by_resolution"}
        - {type: "datasets.skeleton.mask_by_visibility"}
        - {type: "datasets.skeleton.pad_zero", size: 300 }
        - {type: "datasets.skeleton.random_crop", size: 300 }
        - {type: "datasets.skeleton.transpose", order: [0, 2, 1, 3]}
        - {type: "datasets.skeleton.to_tuple"}

  # dataloader setting
  batch_size: 16
  gpus: 4

  # optimizer setting
  optimizer_cfg:
    type: 'torch.optim.SGD'
    lr: 0.1
    momentum: 0.9
    nesterov: true
    weight_decay: 0.0001

  # runtime setting
  workflow: [['train', 5], ['val', 1]]
  work_dir: ./work_dir/recognition/st_gcn/actions_as_space_time_shapes
  total_epochs: 65
  training_hooks:
    lr_config:
      policy: 'step'
      step: [20, 30, 40, 50]
    log_config:
      interval: 100
      hooks:
        - type: TextLoggerHook
    checkpoint_config:
      interval: 5
    optimizer_config:
      grad_clip:
  resume_from:
  load_from:

Thanks! I will try my training again and show my results .

jiawenhao2015 commented 4 years ago

Finally, I modified the training_hooks configuration of the train.yaml file, and the changes are as follows:

training_hooks: 
    lr_config: 
      policy: 'step' 
      step: [20, 30, 40, 50] 
    log_config: 
      interval: 100 
      hooks: 
        - type: TextLoggerHook 
    checkpoint_config: 
      interval: 5 
    optimizer_config: 
      grad_clip:

great!thank you👍~so,the only diff is add the option grad_clip ?

atomtony commented 4 years ago

Finally, I modified the training_hooks configuration of the train.yaml file, and the changes are as follows:

training_hooks: 
    lr_config: 
      policy: 'step' 
      step: [20, 30, 40, 50] 
    log_config: 
      interval: 100 
      hooks: 
        - type: TextLoggerHook 
    checkpoint_config: 
      interval: 5 
    optimizer_config: 
      grad_clip:

great!thank you~so,the only diff is add the option grad_clip ?

Yes

CamilleMaurice commented 4 years ago

I can't thank enough people on this thread that found the error !

rashidch commented 4 years ago

i found that the core training phrase is done in the mmcv module(in my machine,is at /xxxxxxx/miniconda3/lib/python3.7/site-packages/mmcv-0.4.3-py3.7-linux-x86_64.egg/mmcv/runner/runner.py), `def train(self, data_loader, *kwargs): self.model.train() self.mode = 'train' self.data_loader = data_loader self._max_iters = self._max_epochs len(data_loader) self.call_hook('before_train_epoch') for i, data_batch in enumerate(data_loader): self._inner_iter = i self.call_hook('before_train_iter') outputs = self.batch_processor( self.model, data_batch, train_mode=True, **kwargs) if not isinstance(outputs, dict): raise TypeError('batch_processor() must return a dict') if 'log_vars' in outputs: self.log_buffer.update(outputs['log_vars'], outputs['num_samples']) self.outputs = outputs

        self.optimizer.zero_grad()
        self.outputs['loss'].backward()
        self.optimizer.step()

        self.call_hook('after_train_iter')

        self._iter += 1

    self.call_hook('after_train_epoch')
    self._epoch += 1`

the loss backward opeation is done by the hook function , /share/jiawenhao/miniconda3/lib/python3.7/site-packages/mmcv-0.4.3-py3.7-linux-x86_64.egg/mmcv/runner/hooks/optimizer.py ` def after_train_iter(self, runner):

    runner.optimizer.zero_grad()
    runner.outputs['loss'].backward()
    if self.grad_clip is not None:
        self.clip_grads(runner.model.parameters())
    runner.optimizer.step()
    `

did not know why the function does not run actually. so manually add these operations in the runn.py , then ,the loss could decrease ... `self.optimizer.zero_grad()

        self.outputs['loss'].backward()

        self.optimizer.step()

`

Thanks, I modified the code according to your suggestion, after training is completely correct

Hey,

Did anyone get this error after adding code to runner.py? RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

rashidch commented 4 years ago

I can't thank enough people on this thread that found the error !

Hey,

Does it work for you?

jiawenhao2015 commented 4 years ago

RuntimeError the error seems like you have two times backward ? i did not met it before...

CamilleMaurice commented 4 years ago

The process is still running, the loss function is decreasing, which was not the case before the following modificiation:

I did not change anything under /share/jiawenhao/miniconda3/lib/python3.7/site-packages/mmcv-0.4.3-py3.7-linux-x86_64.egg/mmcv/runner/hooks/optimizer.py

rashidch commented 4 years ago

The process is still running, the loss function is decreasing, which was not the case before the following modificiation:

  • Add grad_clip: under optimizer_config: in the training.yaml file

I did not change anything under /share/jiawenhao/miniconda3/lib/python3.7/site-packages/mmcv-0.4.3-py3.7-linux-x86_64.egg/mmcv/runner/hooks/optimizer.py

@CamilleMaurice @jiawenhao2015 Ok. Thank you.

rashidch commented 4 years ago

Hey,

Anyone has idea how to get result on single video for trained model?

CamilleMaurice commented 4 years ago

@rashidch Have you tried to create a configuration file similar to test.yaml ?

rashidch commented 4 years ago

@rashidch Have you tried to create a configuration file similar to test.yaml ?

Yeah.

CamilleMaurice commented 4 years ago

@rashidch Then you are able to get the result on a single video for a trained model through using test.yaml but you are looking for a more flexible way ?

rashidch commented 4 years ago

@rashidch Then you are able to get the result on a single video for a trained model through using test.yaml but you are looking for a more flexible way ?

Right now, I only get test accuracy on test data. I did not implement single video inference yet. I want to implement it, but I was little busy.

rashidch commented 4 years ago

@rashidch Then you are able to get the result on a single video for a trained model through using test.yaml but you are looking for a more flexible way ?

I want to implement single video inference where we can show frame by frame actions recognized by the system in video.

vivek87799 commented 4 years ago

Finally, I modified the training_hooks configuration of the train.yaml file, and the changes are as follows:

training_hooks: 
    lr_config: 
      policy: 'step' 
      step: [20, 30, 40, 50] 
    log_config: 
      interval: 100 
      hooks: 
        - type: TextLoggerHook 
    checkpoint_config: 
      interval: 5 
    optimizer_config: 
      grad_clip:

Thank you very much. It worked for me.

Out of curiosity what is grad_clip?

YeTaoY commented 4 years ago

Finally, I modified the training_hooks configuration of the train.yaml file, and the changes are as follows:

training_hooks: 
    lr_config: 
      policy: 'step' 
      step: [20, 30, 40, 50] 
    log_config: 
      interval: 100 
      hooks: 
        - type: TextLoggerHook 
    checkpoint_config: 
      interval: 5 
    optimizer_config: 
      grad_clip:

Thank you very much. It worked for me.

Out of curiosity what is grad_clip?

@vivek87799 Did you get your answer now? I want to know what is grad_clip, too

mytk2012 commented 4 years ago

i found that the core training phrase is done in the mmcv module(in my machine,is at /xxxxxxx/miniconda3/lib/python3.7/site-packages/mmcv-0.4.3-py3.7-linux-x86_64.egg/mmcv/runner/runner.py), `def train(self, data_loader, *kwargs): self.model.train() self.mode = 'train' self.data_loader = data_loader self._max_iters = self._max_epochs len(data_loader) self.call_hook('before_train_epoch') for i, data_batch in enumerate(data_loader): self._inner_iter = i self.call_hook('before_train_iter') outputs = self.batch_processor( self.model, data_batch, train_mode=True, **kwargs) if not isinstance(outputs, dict): raise TypeError('batch_processor() must return a dict') if 'log_vars' in outputs: self.log_buffer.update(outputs['log_vars'], outputs['num_samples']) self.outputs = outputs

        self.optimizer.zero_grad()
        self.outputs['loss'].backward()
        self.optimizer.step()

        self.call_hook('after_train_iter')

        self._iter += 1

    self.call_hook('after_train_epoch')
    self._epoch += 1`

the loss backward opeation is done by the hook function , /share/jiawenhao/miniconda3/lib/python3.7/site-packages/mmcv-0.4.3-py3.7-linux-x86_64.egg/mmcv/runner/hooks/optimizer.py ` def after_train_iter(self, runner):

    runner.optimizer.zero_grad()
    runner.outputs['loss'].backward()
    if self.grad_clip is not None:
        self.clip_grads(runner.model.parameters())
    runner.optimizer.step()
    `

did not know why the function does not run actually. so manually add these operations in the runn.py , then ,the loss could decrease ... `self.optimizer.zero_grad()

        self.outputs['loss'].backward()

        self.optimizer.step()

`

Thanks, I modified the code according to your suggestion, after training is completely correct

Hey,

Did anyone get this error after adding code to runner.py? RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

Have you fixed it?

little2Rabbit commented 4 years ago

@rashidch @jiawenhao2015 After training the model and obtaining the test results, how to print out the single video classification results

2795449476 commented 3 years ago

我发现核心训练短语是在mmcv模块中完成的(在我的机器上是/xxxxxxx/miniconda3/lib/python3.7/site-packages/mmcv-0.4.3-py3.7-linux-x86_64.egg /mmcv/runner/runner.py), `DEF培养(个体,data_loader,* kwargs): self.model.train() self.mode = '训练' self.data_loader = data_loader self._max_iters = self._max_epochs len(data_loader) self.call_hook('before_train_epoch') for i,enumerate(data_loader)中的 data_batch :self._inner_iter = i self.call_hook('before_train_iter') 输出= self.batch_processor( self.model,data_batch,train_mode = True ,** kwargs) 如果不是isinstance(outputs,dict): 提高TypeError('batch_processor()必须返回一个dict') 如果输出中为'log_vars': self.log_buffer.update(outputs ['log_vars'],output ['num_samples']) self.outputs =输出

        self.optimizer.zero_grad()
        self.outputs['loss'].backward()
        self.optimizer.step()

        self.call_hook('after_train_iter')

        self._iter += 1

    self.call_hook('after_train_epoch')
    self._epoch += 1`

钩子函数 /share/jiawenhao/miniconda3/lib/python3.7/site-packages/mmcv-0.4.3-py3.7-linux-x86_64.egg/mmcv/runner/hooks/ optimizer.py ` 高清after_train_iter(个体经营,亚军):

    runner.optimizer.zero_grad()
    runner.outputs['loss'].backward()
    if self.grad_clip is not None:
        self.clip_grads(runner.model.parameters())
    runner.optimizer.step()
    `

不知道为什么该功能实际上无法运行。 所以在runn.py中手动添加这些操作, 然后,损失可能会减少... `self.optimizer.zero_grad()

        self.outputs['loss'].backward()

        self.optimizer.step()

`

谢谢,我在训练完全正确之后根据您的建议修改了代码

嘿, 向runner.py添加代码后,有人收到此错误吗? RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

你修好了吗? why?Can you tell me?

Fanthers commented 3 years ago

Finally, I modified the training_hooks configuration of the train.yaml file, and the changes are as follows:

training_hooks: 
    lr_config: 
      policy: 'step' 
      step: [20, 30, 40, 50] 
    log_config: 
      interval: 100 
      hooks: 
        - type: TextLoggerHook 
    checkpoint_config: 
      interval: 5 
    optimizer_config: 
      grad_clip:

thank you! it running!

Fanthers commented 3 years ago

Finally, I modified the training_hooks configuration of the train.yaml file, and the changes are as follows:

training_hooks: 
    lr_config: 
      policy: 'step' 
      step: [20, 30, 40, 50] 
    log_config: 
      interval: 100 
      hooks: 
        - type: TextLoggerHook 
    checkpoint_config: 
      interval: 5 
    optimizer_config: 
      grad_clip:

Can you tell me how big your data set is?