leeyegy / SimCC

[ECCV'2022 Oral] PyTorch implementation for: SimCC: a Simple Coordinate Classification Perspective for Human Pose Estimation (http://arxiv.org/abs/2107.03332). Old name: SimDR
326 stars 31 forks source link

Missing code for TokenPose-S Implementation #14

Closed josebert2 closed 2 years ago

josebert2 commented 2 years ago

Hey, first of all, thanks for the amazing work!

I observed that the model/experiment files for the TokenPose implementation are missing, even though you present those results in your paper and in this repo. Can we hope for those files to be uploaded anytime soon? I was not able to include TokenPose in the present code base by myself.

leeyegy commented 2 years ago

Thanks for your attention.

Once we have time, we will make a complete update for this repo, adding some new features (e.g., more advanced tech, TokenPose yaml, etc.). But sorry for that it may take a relatively long time (the author is quite busy recently).

FYI, the TokenPose is already released in: https://github.com/leeyegy/TokenPose. To apply this work to TokenPose, you may just change the training target from heatmap to the proposed method and adopt the corresponding loss function, which are technically easy to implement.

josebert2 commented 2 years ago

Thanks for the tip, i got the training process of TokenPose up and running! One disadvantage I encountered with TokenPose is, that the pretrained models from HRNet are unusable for the current network architecture. Apparently the pretrained model comes with a 'final_layer' which can't be found in the TokenPose architecture.

josebert2 commented 2 years ago

Ultimately I encountered another problem. Validation Loss is not decreasing at all while training the model. So i did some digging and found following passage in the corresponding paper for SimDR:

But for HRNet [29] and TokenPose [17], they have no extra independent modules as the decoder. To apply SimDR to them, we directly append an extra linear layer to the original HRNet and replace the MLP head of TokenPose with a linear layer.

As for now i just added a MLP-Head for both x and y positions at the end of the TokenPose model, but didn't remove anything. This allows me to train it, but as I said, the model does not learn. I am not entirely sure as to where those modifications on the model have to be done.

But in the end it isn't enough to simply adapt the loss function and change the training target

leeyegy commented 2 years ago

Hi there~ here comes the advice: Replace the mlp_head in the original TokenPose by two classifiers (for x- and y- coordinate classification, respectively). Both classifiers accept the output keypoint token embeddings as input and generate the corresponding predictions. Take classifier for x- coordinate classification as an example, it is a simple linear layer as follows: *nn.Linear(embedding_size, x_bordersplitting factor)* x_bordersplitting factor is the number of bins for x coordinate.

Hope these may help, and if there is still any question, plz let us know~:)

josebert2 commented 2 years ago

Thank you very much, you were really helpful :) I noticed that after 60-70 epochs the network starts overfitting on the MPII dataset: newplot(3) Up until that point, the PCK accuracy is around 80%. Should i just stop the training process at this point, or is there still room for improvement?

Maybe I did something wrong? So what I did was to replace the mlp_head with two classifiers, for both x and y coordinations, which gave me two linear layers: nn.Linear(embedding_size, x_border splitting factor) nn.Linear(embedding_size, y_border splitting factor)

I chose following values for the mlp heads:

Furthermore I adopted the files from the SimDR repository that are necessary for the calculations. (train/valid, loss, dataset, transforms, ...)

Here is my yaml maybe you can spot something off:

AUTO_RESUME: true
CUDNN:
  BENCHMARK: true
  DETERMINISTIC: false
  ENABLED: true
DATA_DIR: ''
GPUS: (0,)
OUTPUT_DIR: 'output'
LOG_DIR: 'log'
WORKERS: 24
PRINT_FREQ: 100

DATASET:
  COLOR_RGB: true
  DATASET: mpii
  DATA_FORMAT: jpg
  FLIP: true
  NUM_JOINTS_HALF_BODY: 8
  PROB_HALF_BODY: -1.0
  ROOT: '/home/data/mpii'
  ROT_FACTOR: 30
  SCALE_FACTOR: 0.25
  TEST_SET: valid
  TRAIN_SET: train
MODEL:
  INIT_WEIGHTS: true
  NAME: pose_tokenpose_s
  NUM_JOINTS: 16
  PRETRAINED: ''
  HEAD_INPUT: 192
  # TARGET_TYPE: gaussian
  COORD_REPRESENTATION: 'simdr'
  SIMDR_SPLIT_RATIO: 2.0
  TRANSFORMER_DEPTH: 12
  TRANSFORMER_HEADS: 8
  TRANSFORMER_MLP_RATIO: 3
  POS_EMBEDDING_TYPE: 'sine-full'
  INIT: true
  DIM: 192 # 4*4*3
  PATCH_SIZE:
  - 3
  - 4
  IMAGE_SIZE:
  - 192
  - 256
  HEATMAP_SIZE:
  - 48
  - 64
  SIGMA: 2
  EXTRA:
    PRETRAINED_LAYERS:
    - 'conv1'
    - 'bn1'
    - 'conv2'
    - 'bn2'
    - 'layer1'
    - 'transition1'
    - 'stage2'
    - 'transition2'
    - 'stage3'
    - 'transition3'
    - 'stage4'
    FINAL_CONV_KERNEL: 1
    STAGE2:
      NUM_MODULES: 1
      NUM_BRANCHES: 2
      BLOCK: BASIC
      NUM_BLOCKS:
      - 4
      - 4
      NUM_CHANNELS:
      - 32
      - 64
      FUSE_METHOD: SUM
    STAGE3:
      NUM_MODULES: 4
      NUM_BRANCHES: 3
      BLOCK: BASIC
      NUM_BLOCKS:
      - 4
      - 4
      - 4
      NUM_CHANNELS:
      - 32
      - 64
      - 128
      FUSE_METHOD: SUM
    STAGE4:
      NUM_MODULES: 3
      NUM_BRANCHES: 4
      BLOCK: BASIC
      NUM_BLOCKS:
      - 4
      - 4
      - 4
      - 4
      NUM_CHANNELS:
      - 32
      - 64
      - 128
      - 256
      FUSE_METHOD: SUM
LOSS:
  USE_TARGET_WEIGHT: true
  TYPE: 'NMTNORMCritierion'
  LABEL_SMOOTHING: 0.2
TRAIN:
  BATCH_SIZE_PER_GPU: 16
  SHUFFLE: true
  BEGIN_EPOCH: 0
  END_EPOCH: 100
  OPTIMIZER: adam
  LR: 0.0001
  LR_FACTOR: 0.1
  LR_STEP:
  - 200
  - 260
  WD: 0.0001
  GAMMA1: 0.99
  GAMMA2: 0.0
  MOMENTUM: 0.9
  NESTEROV: false
TEST:
  BATCH_SIZE_PER_GPU: 16
  COCO_BBOX_FILE: '/data/dataset/COCO_2017/COCO_val2017_detections_AP_H_56_person.json'
  BBOX_THRE: 1.0
  IMAGE_THRE: 0.0
  IN_VIS_THRE: 0.2
  MODEL_FILE: ''
  NMS_THRE: 1.0
  OKS_THRE: 0.9
  USE_GT_BBOX: true
  FLIP_TEST: true
  POST_PROCESS: true
  SHIFT_HEATMAP: true
  BLUR_KERNEL: 11
DEBUG:
  DEBUG: true
  SAVE_BATCH_IMAGES_GT: true
  SAVE_BATCH_IMAGES_PRED: true
  SAVE_HEATMAPS_GT: true
  SAVE_HEATMAPS_PRED: true
leeyegy commented 2 years ago

It seems that you changed the original training settings which may be the reason to cause this drop. For example,

  1. the recommended batch size in original TokenPose-S is 4 gpu x 32 = 128, however, in your setting, it's changed to 1 gpu x 16 = 16. Transformer is a hyper-parameter sensitive architecture, plz follow our original settings (e.g., batch size) in TokenPose as much as possible.