mit-han-lab / temporal-shift-module

[ICCV 2019] TSM: Temporal Shift Module for Efficient Video Understanding
https://arxiv.org/abs/1811.08383
MIT License
2.06k stars 416 forks source link

how to train mobilenetv2 model #39

Open 121649982 opened 4 years ago

121649982 commented 4 years ago

Thank you very much for your codebase. I have trained my own data with resnet50 successfully,but I when train it with mobilenet, the accuracy is very low.

python main.py ucf101 RGB --arch mobilenetv2 --num_segments 8 --gd 20 --lr 0.001 --lr_steps 10 20 --epochs 25 --batch-size 2 -j 16 --dropout 0.8 --consensus_type=avg --eval-freq=1 --shift --shift_div=8 --shift_place=blockres

Freezing BatchNorm2D except the first one. Epoch: [24][0/104], lr: 0.00001 Time 15.333 (15.333) Data 15.214 (15.214) Loss 0.6946 (0.6946) Prec@1 50.000 (50.000) Prec@5 100.000 (100.000) Epoch: [24][20/104], lr: 0.00001 Time 0.085 (0.815) Data 0.000 (0.725) Loss 0.6946 (0.6896) Prec@1 50.000 (54.762) Prec@5 100.000 (100.000) Epoch: [24][40/104], lr: 0.00001 Time 0.084 (0.459) Data 0.000 (0.371) Loss 0.6947 (0.6907) Prec@1 50.000 (53.659) Prec@5 100.000 (100.000) Epoch: [24][60/104], lr: 0.00001 Time 0.086 (0.336) Data 0.000 (0.250) Loss 0.6946 (0.6894) Prec@1 50.000 (54.918) Prec@5 100.000 (100.000) Epoch: [24][80/104], lr: 0.00001 Time 0.082 (0.274) Data 0.000 (0.188) Loss 0.6391 (0.6893) Prec@1 100.000 (54.938) Prec@5 100.000 (100.000) Epoch: [24][100/104], lr: 0.00001 Time 0.084 (0.236) Data 0.000 (0.151) Loss 0.6946 (0.6926) Prec@1 50.000 (51.980) Prec@5 100.000 (100.000) Test: [0/12] Time 2.424 (2.424) Loss 0.7487 (0.7487) Prec@1 0.000 (0.000) Prec@5 100.000 (100.000) Testing Results: Prec@1 52.174 Prec@5 100.000 Loss 0.69226 Best Prec@1: 52.174

why?

tonylins commented 4 years ago

Hi, I have not trained MobileNetV2 on UCF by myself. I would suggest you fine-tune from the Kinetics pre-trained weights since smaller models are generally more difficult to train.

bravewhh commented 4 years ago

Hi, I have not trained MobileNetV2 on UCF by myself. I would suggest you fine-tune from the Kinetics pre-trained weights since smaller models are generally more difficult to train.

hi,I think you reply is very useful,but I can not find the model that provided for pretrained(for mobilenetv2),isn't it?

wwdok commented 3 years ago

@bravewhh The code will automatically download the model : image

wwdok commented 3 years ago

I type following command :

python main.py jester RGB \
     --arch mobilenetv2 --num_segments 8 \
     --gd 20 --lr 0.001 --lr_steps 10 20 --epochs 25 \
     --batch-size 8 -j 8 --dropout 0.8 --consensus_type=avg --eval-freq=1 \
     --shift --shift_div=8 --shift_place=blockres \
     --tune_from=online_demo/mobilenetv2_jester_online.pth.tar

trained 15 epochs got image Then i manage to run this model in online_demo/main_windows.py. Because the model is raw, i write a function to modify it :

# used for renaming self-trained model
def rename_state_dict(pth_path):
    pth = torch.load(pth_path)
    state_dict = pth['state_dict']
    new_state_dict = dict()
    for k, v in state_dict.items():
        if k.startswith('module.base_model.'):
            new_state_dict[k.replace('module.base_model.', '').replace('.net', '')] = v
        elif k.startswith('module.new_fc'):
            new_state_dict[k.replace('module.new_fc', 'classifier').replace('.net', '')] = v

    for k, v in new_state_dict.items():
        print(k)

    return new_state_dict

and change torch_module.load_state_dict(torch.load(model_path)) to torch_module.load_state_dict(rename_state_dict(model_path)). Finnaly i run the demo and open camera, but i found the result is always no gesture no matter what gesture i made, no gesture's score is always very high :

avg_logit is [[ 22.340822    6.5533843  50.539978   -6.065401   -0.741372  -20.378466
   -6.2159    -20.626362  -20.334791    7.939289    5.943706  -23.043537
  -22.903278    4.4982567   9.428827    0.0870559 -26.094301  -22.054436
   -1.2793571  -1.4129375   4.517989   -3.4708936  -0.2565832  14.422895
   16.626217   15.432453   16.409939 ]]
279 frame, recognition result is No gesture
avg_logit is [[ 22.17796      6.2511935   49.412605    -5.6404104   -0.52879286
  -19.771511    -5.904038   -20.105488   -19.830935     7.5743732
    5.7956963  -22.469093   -22.24118      4.3991537    9.207253
    0.12220014 -25.271044   -21.428812    -1.1572013   -1.3778756
    4.4399977   -3.5914447   -0.50066507  13.804983    15.938972
   14.783237    15.780186  ]]
281 frame, recognition result is No gesture
avg_logit is [[ 22.180958     6.1000786   48.909077    -5.4299574   -0.4247962
  -19.500866    -5.760176   -19.859396   -19.579912     7.403986
    5.703137   -22.20191    -21.919268     4.3430147    9.11828
    0.11797364 -24.868698   -21.108702    -1.1050811   -1.3677037
    4.409947    -3.6534653   -0.6306344   13.488837    15.595351
   14.445818    15.466529  ]]
283 frame, recognition result is No gesture
avg_logit is [[ 22.055128     5.826027    47.89605     -5.0477147   -0.2218489
  -18.961802    -5.473299   -19.398214   -19.121082     7.0886984
    5.5682364  -21.691662   -21.338387     4.2532845    8.919545
    0.14981724 -24.133896   -20.533306    -0.9922131   -1.3438832
    4.32492     -3.7541132   -0.83523726  12.927706    14.965378
   13.861613    14.88865   ]]
285 frame, recognition result is No gesture
avg_logit is [[ 21.968521     5.602677    47.0569      -4.7339234   -0.05610415
  -18.5347      -5.2590575  -19.011139   -18.742971     6.8353863
    5.435157   -21.298359   -20.89486      4.1619096    8.77436
    0.17440723 -23.564438   -20.085413    -0.89869905  -1.3255008
    4.2516346   -3.8005748   -0.9630685   12.502383    14.475084
   13.391574    14.422092  ]]
287 frame, recognition result is No gesture
avg_logit is [[ 21.748917     5.5050898   45.88289     -4.499616     0.12401614
  -18.094845    -5.0777416  -18.469624   -18.181099     6.670557
    5.2405853  -21.099339   -20.68506      3.9974198    8.781888
    0.15012178 -23.352407   -19.963812    -0.86198545  -1.3199841
    4.132883    -3.6076944   -0.78165925  12.34535     14.163363
   13.083709    14.054122  ]]
289 frame, recognition result is No gesture
avg_logit is [[ 21.63152      5.3609514   45.644783    -4.3256407    0.17421141
  -17.86486     -4.9670362  -18.401573   -18.158514     6.553738
    5.272513   -20.759995   -20.294794     4.0131555    8.547821
    0.24509555 -22.870964   -19.568396    -0.7856485   -1.3132954
    4.087539    -3.7881541   -1.0165278   11.99528     13.850357
   12.8261795   13.800946  ]]
291 frame, recognition result is No gesture
avg_logit is [[ 21.46786      5.359134    45.42929     -4.2824726    0.16652384
  -17.696587    -4.950445   -18.357807   -18.141844     6.5484796
    5.286404   -20.666899   -20.198881     3.9980087    8.476922
    0.31411338 -22.725739   -19.488968    -0.7481018   -1.3360809
    4.068129    -3.8019323   -1.0203681   11.929721    13.777028
   12.766082    13.7178    ]]
293 frame, recognition result is No gesture
avg_logit is [[ 21.34182      5.3350086   44.971577    -4.2091866    0.20372805
  -17.47935     -4.890396   -18.138481   -17.923254     6.5156717
    5.185838   -20.651243   -20.190723     3.8889291    8.501047
    0.28822786 -22.658953   -19.489553    -0.7470191   -1.2915494
    4.0240216   -3.6551523   -0.8475129   11.91179     13.651511
   12.679753    13.562803  ]]
295 frame, recognition result is No gesture
avg_logit is [[ 21.460854     5.2326274   44.53007     -4.0520663    0.28569117
  -17.326513    -4.831665   -17.851751   -17.625885     6.4168
    5.0186796  -20.650888   -20.191128     3.7345753    8.555279
    0.24853873 -22.55812    -19.406641    -0.7191258   -1.2404392
    3.9580843   -3.5137684   -0.71063286  11.840729    13.461693
   12.496143    13.329902  ]]
297 frame, recognition result is No gesture
avg_logit is [[ 21.543562     5.248145    44.71801     -4.100203     0.22891116
  -17.387608    -4.8854446  -17.913378   -17.694712     6.4584913
    5.0214977  -20.704025   -20.25106      3.7124085    8.571009
    0.23945399 -22.593369   -19.43127     -0.7407799   -1.222097
    3.961918    -3.495773    -0.68020856  11.870217    13.499371
   12.54111     13.37663   ]]
299 frame, recognition result is No gesture
avg_logit is [[ 21.361473     5.3155193   44.3853      -4.093155     0.23724303
  -17.221155    -4.8600945  -17.752514   -17.504498     6.4969296
    4.9664264  -20.84703    -20.431925     3.6308913    8.699615
    0.23598577 -22.709074   -19.605043    -0.7234338   -1.2285644
    3.9669604   -3.3336782   -0.4685365   11.970868    13.506671
   12.561577    13.333607  ]]

@tonylins could you please give me some insight and advice ? Thank you in advance ! This is my trained model pth ckpt.best.pth.tar.zip

lambda765 commented 3 years ago

I type following command :

python main.py jester RGB \
     --arch mobilenetv2 --num_segments 8 \
     --gd 20 --lr 0.001 --lr_steps 10 20 --epochs 25 \
     --batch-size 8 -j 8 --dropout 0.8 --consensus_type=avg --eval-freq=1 \
     --shift --shift_div=8 --shift_place=blockres \
     --tune_from=online_demo/mobilenetv2_jester_online.pth.tar

trained 15 epochs got image Then i manage to run this model in online_demo/main_windows.py. Because the model is raw, i write a function to modify it :

# used for renaming self-trained model
def rename_state_dict(pth_path):
    pth = torch.load(pth_path)
    state_dict = pth['state_dict']
    new_state_dict = dict()
    for k, v in state_dict.items():
        if k.startswith('module.base_model.'):
            new_state_dict[k.replace('module.base_model.', '').replace('.net', '')] = v
        elif k.startswith('module.new_fc'):
            new_state_dict[k.replace('module.new_fc', 'classifier').replace('.net', '')] = v

    for k, v in new_state_dict.items():
        print(k)

    return new_state_dict

and change torch_module.load_state_dict(torch.load(model_path)) to torch_module.load_state_dict(rename_state_dict(model_path)). Finnaly i run the demo and open camera, but i found the result is always no gesture no matter what gesture i made, no gesture's score is always very high :

avg_logit is [[ 22.340822    6.5533843  50.539978   -6.065401   -0.741372  -20.378466
   -6.2159    -20.626362  -20.334791    7.939289    5.943706  -23.043537
  -22.903278    4.4982567   9.428827    0.0870559 -26.094301  -22.054436
   -1.2793571  -1.4129375   4.517989   -3.4708936  -0.2565832  14.422895
   16.626217   15.432453   16.409939 ]]
279 frame, recognition result is No gesture
avg_logit is [[ 22.17796      6.2511935   49.412605    -5.6404104   -0.52879286
  -19.771511    -5.904038   -20.105488   -19.830935     7.5743732
    5.7956963  -22.469093   -22.24118      4.3991537    9.207253
    0.12220014 -25.271044   -21.428812    -1.1572013   -1.3778756
    4.4399977   -3.5914447   -0.50066507  13.804983    15.938972
   14.783237    15.780186  ]]
281 frame, recognition result is No gesture
avg_logit is [[ 22.180958     6.1000786   48.909077    -5.4299574   -0.4247962
  -19.500866    -5.760176   -19.859396   -19.579912     7.403986
    5.703137   -22.20191    -21.919268     4.3430147    9.11828
    0.11797364 -24.868698   -21.108702    -1.1050811   -1.3677037
    4.409947    -3.6534653   -0.6306344   13.488837    15.595351
   14.445818    15.466529  ]]
283 frame, recognition result is No gesture
avg_logit is [[ 22.055128     5.826027    47.89605     -5.0477147   -0.2218489
  -18.961802    -5.473299   -19.398214   -19.121082     7.0886984
    5.5682364  -21.691662   -21.338387     4.2532845    8.919545
    0.14981724 -24.133896   -20.533306    -0.9922131   -1.3438832
    4.32492     -3.7541132   -0.83523726  12.927706    14.965378
   13.861613    14.88865   ]]
285 frame, recognition result is No gesture
avg_logit is [[ 21.968521     5.602677    47.0569      -4.7339234   -0.05610415
  -18.5347      -5.2590575  -19.011139   -18.742971     6.8353863
    5.435157   -21.298359   -20.89486      4.1619096    8.77436
    0.17440723 -23.564438   -20.085413    -0.89869905  -1.3255008
    4.2516346   -3.8005748   -0.9630685   12.502383    14.475084
   13.391574    14.422092  ]]
287 frame, recognition result is No gesture
avg_logit is [[ 21.748917     5.5050898   45.88289     -4.499616     0.12401614
  -18.094845    -5.0777416  -18.469624   -18.181099     6.670557
    5.2405853  -21.099339   -20.68506      3.9974198    8.781888
    0.15012178 -23.352407   -19.963812    -0.86198545  -1.3199841
    4.132883    -3.6076944   -0.78165925  12.34535     14.163363
   13.083709    14.054122  ]]
289 frame, recognition result is No gesture
avg_logit is [[ 21.63152      5.3609514   45.644783    -4.3256407    0.17421141
  -17.86486     -4.9670362  -18.401573   -18.158514     6.553738
    5.272513   -20.759995   -20.294794     4.0131555    8.547821
    0.24509555 -22.870964   -19.568396    -0.7856485   -1.3132954
    4.087539    -3.7881541   -1.0165278   11.99528     13.850357
   12.8261795   13.800946  ]]
291 frame, recognition result is No gesture
avg_logit is [[ 21.46786      5.359134    45.42929     -4.2824726    0.16652384
  -17.696587    -4.950445   -18.357807   -18.141844     6.5484796
    5.286404   -20.666899   -20.198881     3.9980087    8.476922
    0.31411338 -22.725739   -19.488968    -0.7481018   -1.3360809
    4.068129    -3.8019323   -1.0203681   11.929721    13.777028
   12.766082    13.7178    ]]
293 frame, recognition result is No gesture
avg_logit is [[ 21.34182      5.3350086   44.971577    -4.2091866    0.20372805
  -17.47935     -4.890396   -18.138481   -17.923254     6.5156717
    5.185838   -20.651243   -20.190723     3.8889291    8.501047
    0.28822786 -22.658953   -19.489553    -0.7470191   -1.2915494
    4.0240216   -3.6551523   -0.8475129   11.91179     13.651511
   12.679753    13.562803  ]]
295 frame, recognition result is No gesture
avg_logit is [[ 21.460854     5.2326274   44.53007     -4.0520663    0.28569117
  -17.326513    -4.831665   -17.851751   -17.625885     6.4168
    5.0186796  -20.650888   -20.191128     3.7345753    8.555279
    0.24853873 -22.55812    -19.406641    -0.7191258   -1.2404392
    3.9580843   -3.5137684   -0.71063286  11.840729    13.461693
   12.496143    13.329902  ]]
297 frame, recognition result is No gesture
avg_logit is [[ 21.543562     5.248145    44.71801     -4.100203     0.22891116
  -17.387608    -4.8854446  -17.913378   -17.694712     6.4584913
    5.0214977  -20.704025   -20.25106      3.7124085    8.571009
    0.23945399 -22.593369   -19.43127     -0.7407799   -1.222097
    3.961918    -3.495773    -0.68020856  11.870217    13.499371
   12.54111     13.37663   ]]
299 frame, recognition result is No gesture
avg_logit is [[ 21.361473     5.3155193   44.3853      -4.093155     0.23724303
  -17.221155    -4.8600945  -17.752514   -17.504498     6.4969296
    4.9664264  -20.84703    -20.431925     3.6308913    8.699615
    0.23598577 -22.709074   -19.605043    -0.7234338   -1.2285644
    3.9669604   -3.3336782   -0.4685365   11.970868    13.506671
   12.561577    13.333607  ]]

@tonylins could you please give me some insight and advice ? Thank you in advance ! This is my trained model pth ckpt.best.pth.tar.zip

Have you solved this issue? I met the same problem.

wwdok commented 3 years ago

@hzz765 No, afterwards, i paused this project and went to do another paoject...

NB-Xie commented 3 years ago

I type following command :

python main.py jester RGB \
     --arch mobilenetv2 --num_segments 8 \
     --gd 20 --lr 0.001 --lr_steps 10 20 --epochs 25 \
     --batch-size 8 -j 8 --dropout 0.8 --consensus_type=avg --eval-freq=1 \
     --shift --shift_div=8 --shift_place=blockres \
     --tune_from=online_demo/mobilenetv2_jester_online.pth.tar

trained 15 epochs got image Then i manage to run this model in online_demo/main_windows.py. Because the model is raw, i write a function to modify it :

# used for renaming self-trained model
def rename_state_dict(pth_path):
    pth = torch.load(pth_path)
    state_dict = pth['state_dict']
    new_state_dict = dict()
    for k, v in state_dict.items():
        if k.startswith('module.base_model.'):
            new_state_dict[k.replace('module.base_model.', '').replace('.net', '')] = v
        elif k.startswith('module.new_fc'):
            new_state_dict[k.replace('module.new_fc', 'classifier').replace('.net', '')] = v

    for k, v in new_state_dict.items():
        print(k)

    return new_state_dict

and change torch_module.load_state_dict(torch.load(model_path)) to torch_module.load_state_dict(rename_state_dict(model_path)). Finnaly i run the demo and open camera, but i found the result is always no gesture no matter what gesture i made, no gesture's score is always very high :

avg_logit is [[ 22.340822    6.5533843  50.539978   -6.065401   -0.741372  -20.378466
   -6.2159    -20.626362  -20.334791    7.939289    5.943706  -23.043537
  -22.903278    4.4982567   9.428827    0.0870559 -26.094301  -22.054436
   -1.2793571  -1.4129375   4.517989   -3.4708936  -0.2565832  14.422895
   16.626217   15.432453   16.409939 ]]
279 frame, recognition result is No gesture
avg_logit is [[ 22.17796      6.2511935   49.412605    -5.6404104   -0.52879286
  -19.771511    -5.904038   -20.105488   -19.830935     7.5743732
    5.7956963  -22.469093   -22.24118      4.3991537    9.207253
    0.12220014 -25.271044   -21.428812    -1.1572013   -1.3778756
    4.4399977   -3.5914447   -0.50066507  13.804983    15.938972
   14.783237    15.780186  ]]
281 frame, recognition result is No gesture
avg_logit is [[ 22.180958     6.1000786   48.909077    -5.4299574   -0.4247962
  -19.500866    -5.760176   -19.859396   -19.579912     7.403986
    5.703137   -22.20191    -21.919268     4.3430147    9.11828
    0.11797364 -24.868698   -21.108702    -1.1050811   -1.3677037
    4.409947    -3.6534653   -0.6306344   13.488837    15.595351
   14.445818    15.466529  ]]
283 frame, recognition result is No gesture
avg_logit is [[ 22.055128     5.826027    47.89605     -5.0477147   -0.2218489
  -18.961802    -5.473299   -19.398214   -19.121082     7.0886984
    5.5682364  -21.691662   -21.338387     4.2532845    8.919545
    0.14981724 -24.133896   -20.533306    -0.9922131   -1.3438832
    4.32492     -3.7541132   -0.83523726  12.927706    14.965378
   13.861613    14.88865   ]]
285 frame, recognition result is No gesture
avg_logit is [[ 21.968521     5.602677    47.0569      -4.7339234   -0.05610415
  -18.5347      -5.2590575  -19.011139   -18.742971     6.8353863
    5.435157   -21.298359   -20.89486      4.1619096    8.77436
    0.17440723 -23.564438   -20.085413    -0.89869905  -1.3255008
    4.2516346   -3.8005748   -0.9630685   12.502383    14.475084
   13.391574    14.422092  ]]
287 frame, recognition result is No gesture
avg_logit is [[ 21.748917     5.5050898   45.88289     -4.499616     0.12401614
  -18.094845    -5.0777416  -18.469624   -18.181099     6.670557
    5.2405853  -21.099339   -20.68506      3.9974198    8.781888
    0.15012178 -23.352407   -19.963812    -0.86198545  -1.3199841
    4.132883    -3.6076944   -0.78165925  12.34535     14.163363
   13.083709    14.054122  ]]
289 frame, recognition result is No gesture
avg_logit is [[ 21.63152      5.3609514   45.644783    -4.3256407    0.17421141
  -17.86486     -4.9670362  -18.401573   -18.158514     6.553738
    5.272513   -20.759995   -20.294794     4.0131555    8.547821
    0.24509555 -22.870964   -19.568396    -0.7856485   -1.3132954
    4.087539    -3.7881541   -1.0165278   11.99528     13.850357
   12.8261795   13.800946  ]]
291 frame, recognition result is No gesture
avg_logit is [[ 21.46786      5.359134    45.42929     -4.2824726    0.16652384
  -17.696587    -4.950445   -18.357807   -18.141844     6.5484796
    5.286404   -20.666899   -20.198881     3.9980087    8.476922
    0.31411338 -22.725739   -19.488968    -0.7481018   -1.3360809
    4.068129    -3.8019323   -1.0203681   11.929721    13.777028
   12.766082    13.7178    ]]
293 frame, recognition result is No gesture
avg_logit is [[ 21.34182      5.3350086   44.971577    -4.2091866    0.20372805
  -17.47935     -4.890396   -18.138481   -17.923254     6.5156717
    5.185838   -20.651243   -20.190723     3.8889291    8.501047
    0.28822786 -22.658953   -19.489553    -0.7470191   -1.2915494
    4.0240216   -3.6551523   -0.8475129   11.91179     13.651511
   12.679753    13.562803  ]]
295 frame, recognition result is No gesture
avg_logit is [[ 21.460854     5.2326274   44.53007     -4.0520663    0.28569117
  -17.326513    -4.831665   -17.851751   -17.625885     6.4168
    5.0186796  -20.650888   -20.191128     3.7345753    8.555279
    0.24853873 -22.55812    -19.406641    -0.7191258   -1.2404392
    3.9580843   -3.5137684   -0.71063286  11.840729    13.461693
   12.496143    13.329902  ]]
297 frame, recognition result is No gesture
avg_logit is [[ 21.543562     5.248145    44.71801     -4.100203     0.22891116
  -17.387608    -4.8854446  -17.913378   -17.694712     6.4584913
    5.0214977  -20.704025   -20.25106      3.7124085    8.571009
    0.23945399 -22.593369   -19.43127     -0.7407799   -1.222097
    3.961918    -3.495773    -0.68020856  11.870217    13.499371
   12.54111     13.37663   ]]
299 frame, recognition result is No gesture
avg_logit is [[ 21.361473     5.3155193   44.3853      -4.093155     0.23724303
  -17.221155    -4.8600945  -17.752514   -17.504498     6.4969296
    4.9664264  -20.84703    -20.431925     3.6308913    8.699615
    0.23598577 -22.709074   -19.605043    -0.7234338   -1.2285644
    3.9669604   -3.3336782   -0.4685365   11.970868    13.506671
   12.561577    13.333607  ]]

@tonylins could you please give me some insight and advice ? Thank you in advance ! This is my trained model pth ckpt.best.pth.tar.zip

Have you solved this issue? I met the same problem.

Same issue here. Have you solved the issue? @hzz765 @wwdok

And did you change the bi-direction shift to uni-direction shift in training? Do you think the problem may be relevant with this ?

@tonylins could you please give us some insight and advice ? Thank you

dengfenglai321 commented 2 years ago

Same issue here. Have you solved the issue?

I met the same error. Have you solved the issue?

cherylngsy commented 1 year ago

could you share how did you run with mobilenetv2 finetune on the pre-trained online demo? i faced the issue sd = sd['state_dict']