open-mmlab / mmaction

An open-source toolbox for action understanding based on PyTorch
https://open-mmlab.github.io/
Apache License 2.0
1.86k stars 351 forks source link

Question about AVA performance #42

Open happyjin opened 5 years ago

happyjin commented 5 years ago

Hi, Thanks for your contribution of mmaction which is an awesome open-source project on the GitHub! I have a few questions about the performance of AVA model which is in the model zoo. My questions are:

  1. How many epochs do I need until achieving the score 21.2 by mAP@0.5?
  2. Can I utilize the default setting file in order to achieve this score?
  3. If I use default batch_size=2, then my GPU memory will be full. So I change the batch_size to 1. Should I change the learning rate etc? Because with the default setting except for the batch size, I cannot achieve 21.2 mAP@0.5 after 12 epochs.
zhaoyue-zephyrus commented 5 years ago

Hi @happyjin

  1. 12 epochs as specified in https://github.com/open-mmlab/mmaction/blob/master/configs/ava/ava_fast_rcnn_nl_r50_c4_1x_kinetics_pretrain_crop.py.

  2. You can use the default setting file to get the score. If you see any difference, let me know in this issue.

  3. Lr should be proportional to your effective number of sample. In your case, try halve your lr and see if it works out.

TheShadow29 commented 5 years ago

Hi @zhaoyue-zephyrus Thanks for the amazing repository. I tried with the default settings, but I use 4 gpus. I trained for ~ 2.5 days (12 epochs) with learning rate halved. Unfortunately, I get only 0.04 mAP.

Things I changed: halved learning rate, doubled warmup iterations, doubled warmup steps.

My guess (I have yet to experiment) is that the number of epochs need to be doubled.

zhaoyue-zephyrus commented 5 years ago

@TheShadow29 Sorry I haven't got enough facilities on hand to reproduce that. I will figure it out a little bit later.

TheShadow29 commented 5 years ago

@zhaoyue-zephyrus I tried with more epochs (another 10 epochs), the result is still the same map ~ 0.03.

I also noticed the following in the log just before report of the scores: 2019-07-17 10:43:49,392 - INFO - The following classes have no ground truth examples: [ 2 16 18 19 21 23 25 31 32 33 35 39 40 42 44 50 53 55 71 75] I am not sure how to interpret this.

Also, do you have the log file for the training of ava model? Comparing the logs might reveal something.

Again, cheers for creating such an amazing repository and thank you for your patience.

zhaoyue-zephyrus commented 5 years ago

@TheShadow29

Hi , the line [ 2 16 18 19 21 23 25 31 32 33 35 39 40 42 44 50 53 55 71 75] means that AVA is evaluating 60 classes out of the whole 80 classes. You don't need to worry about it.

Before we figure out the training issue, could you please first run the testing code to check if you could reproduce the reported results? I will try to reproduce your configuration soon.

zhaoyue-zephyrus commented 5 years ago

@TheShadow29 BTW, what GPU are you using? Do you halve the videos_per_gpu as well?

TheShadow29 commented 5 years ago

Yes, I used the testing code and using your pretrained model from the model zoo I get 21 map. PascalBoxes_Precision/mAP@0.5IOU= 0.21313359468022483

I am using 4 gpus, each being a 1080Ti. I use videos_per_gpu=2 and workers_per_gpu=2.

The config file is as it is for the model, data part. Here is the optimizer config:

optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=1e-6)                                                                                                 
optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))                                                                                                      
# learning policy                                                                                                                                                      
lr_config = dict(                                                                                                                                                      
    policy='step',                                                                                                                                                     
    warmup='linear',                                                                                                                                                   
    warmup_iters=1000,                                                                                                                                                 
    warmup_ratio=1.0 / 4,                                                                                                                                              
    step=[16, 22])                                                                                                                                                     
checkpoint_config = dict(interval=1)                                                                                                                                   
# yapf:disable                                                                                                                                                         
log_config = dict(                                                                                                                                                     
    interval=50,                                                                                                                                                       
    hooks=[                                                                                                                                                            
        dict(type='TextLoggerHook'),                                                                                                                                   
        # dict(type='TensorboardLoggerHook')                                                                                                                           
    ])                                                                                                                                                                 
# yapf:enable                                                                                                                                                          
# runtime settings                                                                                                                                                     
total_epochs = 24                                                                                                                                                      
dist_params = dict(backend='nccl')                                                                                                                                     
log_level = 'INFO'                                                                                                                                                     
work_dir = './work_dirs/ava_fast_rcnn_nl_r50_c4_1x_f32s2_kinetics_pretrain_crop_multiscale'                                                                   
load_from = None                                                                                                                                                       
resume_from = None                                                                                                                                                     
workflow = [('train', 1)]                                                                                                                                              

I performed diff between original config and my current config, here are the differences: lr=0.02, warmup_iters=1000, step=[16,22], total_epochs=24

leaderj1001 commented 4 years ago

How can I achieve 21.3 mAP. I use 8 gpus, cuda 10.0, python 3.7, pytorch 1.1.0 Thank you.

guancheng817 commented 4 years ago

Yes, I used the testing code and using your pretrained model from the model zoo I get 21 map. PascalBoxes_Precision/mAP@0.5IOU= 0.21313359468022483

I am using 4 gpus, each being a 1080Ti. I use videos_per_gpu=2 and workers_per_gpu=2.

The config file is as it is for the model, data part. Here is the optimizer config:

optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=1e-6)                                                                                                 
optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))                                                                                                      
# learning policy                                                                                                                                                      
lr_config = dict(                                                                                                                                                      
    policy='step',                                                                                                                                                     
    warmup='linear',                                                                                                                                                   
    warmup_iters=1000,                                                                                                                                                 
    warmup_ratio=1.0 / 4,                                                                                                                                              
    step=[16, 22])                                                                                                                                                     
checkpoint_config = dict(interval=1)                                                                                                                                   
# yapf:disable                                                                                                                                                         
log_config = dict(                                                                                                                                                     
    interval=50,                                                                                                                                                       
    hooks=[                                                                                                                                                            
        dict(type='TextLoggerHook'),                                                                                                                                   
        # dict(type='TensorboardLoggerHook')                                                                                                                           
    ])                                                                                                                                                                 
# yapf:enable                                                                                                                                                          
# runtime settings                                                                                                                                                     
total_epochs = 24                                                                                                                                                      
dist_params = dict(backend='nccl')                                                                                                                                     
log_level = 'INFO'                                                                                                                                                     
work_dir = './work_dirs/ava_fast_rcnn_nl_r50_c4_1x_f32s2_kinetics_pretrain_crop_multiscale'                                                                   
load_from = None                                                                                                                                                       
resume_from = None                                                                                                                                                     
workflow = [('train', 1)]                                                                                                                                              

I performed diff between original config and my current config, here are the differences: lr=0.02, warmup_iters=1000, step=[16,22], total_epochs=24

Have you soveled your problem, i can not reproduce the results 21.3 mAP.. thanks

guancheng817 commented 4 years ago

How can I achieve 21.3 mAP. I use 8 gpus, cuda 10.0, python 3.7, pytorch 1.1.0 Thank you.

hi, Have your ever reproduced the 21.3 mAP results.

leaderj1001 commented 4 years ago

How can I achieve 21.3 mAP. I use 8 gpus, cuda 10.0, python 3.7, pytorch 1.1.0 Thank you.

hi, Have your ever reproduced the 21.3 mAP results.

It is not over 16.35 mAP. Have you training?

guancheng817 commented 4 years ago

How can I achieve 21.3 mAP. I use 8 gpus, cuda 10.0, python 3.7, pytorch 1.1.0 Thank you.

hi, Have your ever reproduced the 21.3 mAP results.

It is not over 0.16 mAP. Have you training?

only 0.3~0.4 mAP.. Then I use other network as backbone, it could get about 12 mAP.