The training speed is too slowly

louielu1027 commented 5 years ago

This is my training command:
python detection_train.py --config config/tridentnet_r101v2c4_c5_multiscale_addminival_3x_fp16.py

[03:27:36] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable) 09-07 11:34:06 Epoch[0] Batch [20] Speed: 1.29 samples/sec Train-RpnAcc=0.554313, RpnL1=1.112343, RcnnAcc=0.540504, RcnnL1=2.836474, 09-07 11:39:05 Epoch[0] Batch [40] Speed: 1.60 samples/sec Train-RpnAcc=0.673116, RpnL1=0.962950, RcnnAcc=0.734650, RcnnL1=2.822075, 09-07 11:44:00 Epoch[0] Batch [60] Speed: 1.63 samples/sec Train-RpnAcc=0.753687, RpnL1=0.881134, RcnnAcc=0.796921, RcnnL1=2.820415, 09-07 11:50:30 Epoch[0] Batch [80] Speed: 1.23 samples/sec Train-RpnAcc=0.796221, RpnL1=0.812162, RcnnAcc=0.828212, RcnnL1=2.809139, 09-07 11:59:41 Epoch[0] Batch [100] Speed: 0.87 samples/sec Train-RpnAcc=0.820873, RpnL1=0.768616, RcnnAcc=0.846827, RcnnL1=2.811466, 09-07 12:05:51 Epoch[0] Batch [120] Speed: 1.30 samples/sec Train-RpnAcc=0.837998, RpnL1=0.729496, RcnnAcc=0.860605, RcnnL1=2.807163, 09-07 12:13:06 Epoch[0] Batch [140] Speed: 1.10 samples/sec Train-RpnAcc=0.850185, RpnL1=0.700992, RcnnAcc=0.869596, RcnnL1=2.804221, 09-07 12:20:44 Epoch[0] Batch [160] Speed: 1.05 samples/sec Train-RpnAcc=0.859989, RpnL1=0.677671, RcnnAcc=0.875914, RcnnL1=2.799287, 09-07 12:28:19 Epoch[0] Batch [180] Speed: 1.05 samples/sec Train-RpnAcc=0.867247, RpnL1=0.662773, RcnnAcc=0.880708, RcnnL1=2.793266, 09-07 12:36:31 Epoch[0] Batch [200] Speed: 0.97 samples/sec Train-RpnAcc=0.873369, RpnL1=0.647463, RcnnAcc=0.884404, RcnnL1=2.789041, 09-07 12:44:33 Epoch[0] Batch [220] Speed: 1.00 samples/sec Train-RpnAcc=0.878552, RpnL1=0.635126, RcnnAcc=0.887782, RcnnL1=2.782216, 09-07 12:51:30 Epoch[0] Batch [240] Speed: 1.15 samples/sec Train-RpnAcc=0.882332, RpnL1=0.627403, RcnnAcc=0.890268, RcnnL1=2.776619, 09-07 12:59:11 Epoch[0] Batch [260] Speed: 1.04 samples/sec Train-RpnAcc=0.885757, RpnL1=0.616748, RcnnAcc=0.892246, RcnnL1=2.770003,

I use 8 GPUs (1080Ti) ,16 Cpus, why so slowly? I have no idea....

RogerChern commented 5 years ago

This is quite strange. On my 8X 1080Ti, trident r101 with syncbn uses 1000%CPU and 45G MEM and runs at 7.9 samples/s

09-08 21:49:41 MEM usage: 12687 MiB
09-08 21:49:41 Initialized bbox_cls_logit_bias as bias: 0.0
09-08 21:49:41 Initialized bbox_cls_logit_weight as ["normal", {"sigma": 0.01}]: 0.0099972505
09-08 21:49:41 Initialized bbox_reg_delta_bias as bias: 0.0
09-08 21:49:41 Initialized bbox_reg_delta_weight as ["normal", {"sigma": 0.001}]: 0.0009971757
09-08 21:49:41 Initialized rpn_bbox_delta_bias as bias: 0.0
09-08 21:49:41 Initialized rpn_bbox_delta_weight as ["normal", {"sigma": 0.01}]: 0.010014005
09-08 21:49:41 Initialized rpn_cls_logit_bias as bias: 0.0
09-08 21:49:41 Initialized rpn_cls_logit_weight as ["normal", {"sigma": 0.01}]: 0.009995574
09-08 21:49:41 Initialized rpn_conv_3x3_bias as bias: 0.0
09-08 21:49:41 Initialized rpn_conv_3x3_weight as ["normal", {"sigma": 0.01}]: 0.009975786
09-08 21:49:41 Initialized stage3_unit21_conv2_offset_bias as bias: 0.0
09-08 21:49:41 Initialized stage3_unit21_conv2_offset_weight as weight: 0.029449591
09-08 21:49:41 Initialized stage3_unit22_conv2_offset_bias as bias: 0.0
09-08 21:49:41 Initialized stage3_unit22_conv2_offset_weight as weight: 0.029397728
09-08 21:49:41 Initialized stage3_unit23_conv2_offset_bias as bias: 0.0
09-08 21:49:41 Initialized stage3_unit23_conv2_offset_weight as weight: 0.029486561
[13:49:42] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this
 can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
09-08 21:51:07 Epoch[0] Batch [20]      Speed: 7.89 samples/sec Train-RpnAcc=0.528209,  RpnL1=0.912266, RcnnAcc=0.570690,       RcnnL1=2.843624,
09-08 21:52:08 Epoch[0] Batch [40]      Speed: 7.87 samples/sec Train-RpnAcc=0.616592,  RpnL1=0.828870, RcnnAcc=0.757691,       RcnnL1=2.826370,

RogerChern commented 5 years ago

Could you please add more information about your platforms?

RogerChern commented 5 years ago

Maybe you can use the built-in profiler to find the bottleneck.

Turn on the profile option in the config->General->profile and you can share the profile.json in your experiment directory for me to help you locate the problem.

https://github.com/TuSimple/simpledet/blob/df9713cd0ebd09f68e03e90d28043f23d44d19f8/doc/fully_annotated_config.py#L30-L31

louielu1027 commented 5 years ago

during training, I meet this error:

ValueError: could not broadcast input array from shape (334,5) into shape (100,5)

max_num_gt=100, but the real gt bboxes number is 334 in that img. BUT, the training always goes on. Is it necessary to fix this error(or bug)?

huangzehao commented 5 years ago

@louielu1027 You should set max_num_gt to the max number of gt in your dataset.

louielu1027 commented 5 years ago

@huangzehao 谢谢回复，我的意思是现在设置100，训练会有报错，但是这个报错之后，训练还会继续（不会断掉），那还有必要把max_num_gt 设置成最大的那个gt数吗？

louielu1027 commented 5 years ago

@RogerChern Hi, I have got the profile.json，but it is too large, I don't know how to analyze this file ...

RogerChern commented 5 years ago

Open your chrome browser. Type chrome://tracing/ in the address line, and then load the profile.json.

You must increase the num_max_gt or the final prediction only has 100 classes.

RogerChern commented 5 years ago

python detection_test.py --config config/tridentnet_r101v2c4_c5_multiscale_addminival_3x_fp16.py should work.

On Wed, Sep 18, 2019 at 11:59 AM louielu1027 notifications@github.com wrote:

@RogerChern https://github.com/RogerChern How to use this config file "tridentnet_r101v2c4_c5_multiscale_addminival_3x_fp16.py" test on COCO test-dev? Could you release your testing code? I have some problem on multi-scale test for single image.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/TuSimple/simpledet/issues/226?email_source=notifications&email_token=ABGODH5ND5Y7X5JJG45T2RLQKGRSBA5CNFSM4IUO6MCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD66W5VY#issuecomment-532508375, or mute the thread https://github.com/notifications/unsubscribe-auth/ABGODH43B5J3VTZZXFUER2TQKGRSBANCNFSM4IUO6MCA .

tusen-ai / simpledet

The training speed is too slowly #226