tianweiy / CenterPoint

MIT License
1.84k stars 450 forks source link

Train resulting as NaN loss #396

Open OuYaozhong opened 1 year ago

OuYaozhong commented 1 year ago

Hi, I am try to train from scratch by myself to reproduce the training result of the paper.

But I found that the loss of the training will come to NaN after several hours.

I build the environment follow the INSTALL.md with nuScene Dataset.

Environment:

GPU Driver:

image

PyTorch: ffmpeg 4.3 hf484d3e_0 pytorch pytorch 2.0.1 py3.9_cuda11.8_cudnn8.7.0_0 pytorch pytorch-cuda 11.8 h7e8668a_5 pytorch pytorch-mutex 1.0 cuda pytorch torchtriton 2.0.0 py39 pytorch torchvision 0.15.2 py39_cu118 pytorch

CUDA_HOME:

$ echo $CUDA_HOME
/usr/local/cuda-11.8/

Some Modification:

image

train.py

image

requirement.txt

image

[for both deform_pool_cuda.cpp and deform_conv_cuda.cpp, substitude all "AT_CHECK" with "TORCH_CHECK"] deform_pool_cuda.cpp

image

deform_conv_cuda.cpp

image

Log:

-> Command: (mvp) $ torchrun --nproc_per_node=2 ./tools/train.py ./configs/mvp/nusc_centerpoint_voxelnet_0075voxel_fix_bn_z_scale_virtual.py

-> Log file CenterPoint/work_dirs/nusc_centerpoint_voxelnet_0075voxel_fix_bn_z_scale_virtual/20230709_212542.log

2023-07-09 21:25:42,078 - INFO - Start running, host: ..., work_dir: ...CenterPoint/work_dirs/nusc_centerpoint_voxelnet_0075voxel_fix_bn_z_scale_virtual
2023-07-09 21:25:42,078 - INFO - workflow: [('train', 1)], max: 20 epochs
2023-07-09 21:27:00,427 - INFO - Epoch [1/20][5/15448]  lr: 0.00010, eta: 56 days, 0:44:16, time: 15.669, data_time: 6.412, transfer_time: 0.023, forward_time: 0.589, loss_parse_time: 0.000 memory: 7003, 
2023-07-09 21:27:00,427 - INFO - task : ['car'], loss: 16.5108, hm_loss: 5.3816, loc_loss: 44.5170, loc_loss_elem: ['3.9960', '4.4768', '13.8939', '3.5633', '3.9826', '2.5209', '3.9625', '4.9418', '6.3818', '3.9209'], num_positive: 54.0000
2023-07-09 21:27:00,428 - INFO - task : ['truck', 'construction_vehicle'], loss: 33.3630, hm_loss: 21.8048, loc_loss: 46.2328, loc_loss_elem: ['4.5750', '7.1211', '7.1248', '5.1617', '3.3515', '5.1795', '4.3006', '8.2093', '5.3650', '5.8522'], num_positive: 27.6000
2023-07-09 21:27:00,428 - INFO - task : ['bus', 'trailer'], loss: 46.8869, hm_loss: 39.0528, loc_loss: 31.3363, loc_loss_elem: ['3.1959', '4.1191', '4.2974', '2.6856', '3.9514', '3.1382', '4.2657', '7.0581', '4.3812', '3.3026'], num_positive: 22.2000
2023-07-09 21:27:00,428 - INFO - task : ['barrier'], loss: 38.3416, hm_loss: 19.3048, loc_loss: 76.1475, loc_loss_elem: ['17.8052', '7.7582', '11.3469', '7.6502', '6.7226', '6.7809', '6.5041', '6.0544', '10.2082', '5.3636'], num_positive: 38.0000
2023-07-09 21:27:00,428 - INFO - task : ['motorcycle', 'bicycle'], loss: 32.1542, hm_loss: 19.0562, loc_loss: 52.3918, loc_loss_elem: ['6.7211', '6.6712', '15.9350', '3.7224', '3.3566', '4.6889', '4.7669', '5.6455', '3.9710', '5.2431'], num_positive: 41.2000
2023-07-09 21:27:00,428 - INFO - task : ['pedestrian', 'traffic_cone'], loss: 42.4973, hm_loss: 25.1576, loc_loss: 69.3589, loc_loss_elem: ['6.9013', '6.1598', '14.6911', '4.6068', '5.6175', '5.6351', '9.9897', '10.2892', '8.2892', '13.4024'], num_positive: 28.4000

2023-07-09 21:27:07,797 - INFO - Epoch [1/20][10/15448] lr: 0.00010, eta: 30 days, 15:36:38, time: 1.474, data_time: 0.578, transfer_time: 0.021, forward_time: 0.369, loss_parse_time: 0.000 memory: 7527, 
2023-07-09 21:27:07,797 - INFO - task : ['car'], loss: 10.8206, hm_loss: 4.2498, loc_loss: 26.2833, loc_loss_elem: ['2.5260', '2.8105', '7.5488', '2.3111', '2.5731', '1.5947', '2.7838', '3.1014', '3.0734', '2.6687'], num_positive: 63.4000
2023-07-09 21:27:07,797 - INFO - task : ['truck', 'construction_vehicle'], loss: 19.5815, hm_loss: 13.6572, loc_loss: 23.6973, loc_loss_elem: ['2.4139', '2.9423', '3.6419', '2.8161', '1.9625', '3.2340', '2.6705', '4.2424', '2.6038', '2.7002'], num_positive: 29.0000
2023-07-09 21:27:07,797 - INFO - task : ['bus', 'trailer'], loss: 24.2755, hm_loss: 17.1798, loc_loss: 28.3831, loc_loss_elem: ['3.7832', '3.0236', '5.5242', '2.1372', '3.5866', '2.3797', '2.9376', '3.9165', '3.2531', '3.3248'], num_positive: 23.8000
2023-07-09 21:27:07,797 - INFO - task : ['barrier'], loss: 23.1140, hm_loss: 12.7906, loc_loss: 41.2938, loc_loss_elem: ['7.4719', '3.3597', '8.9418', '3.7879', '4.3072', '3.5921', '3.7401', '4.2947', '4.8814', '3.3449'], num_positive: 28.6000
2023-07-09 21:27:07,797 - INFO - task : ['motorcycle', 'bicycle'], loss: 18.0098, hm_loss: 9.4185, loc_loss: 34.3654, loc_loss_elem: ['3.6293', '4.8634', '9.6247', '2.7023', '2.6011', '3.8894', '3.2744', '3.7622', '2.2092', '3.4387'], num_positive: 40.0000
2023-07-09 21:27:07,797 - INFO - task : ['pedestrian', 'traffic_cone'], loss: 22.2361, hm_loss: 11.5873, loc_loss: 42.5953, loc_loss_elem: ['3.4575', '4.7063', '8.8348', '3.5326', '3.2648', '3.1636', '6.7733', '4.9222', '6.1251', '7.1714'], num_positive: 37.8000

2023-07-09 21:27:20,877 - INFO - Epoch [1/20][15/15448] lr: 0.00010, eta: 23 days, 13:14:31, time: 2.616, data_time: 1.968, transfer_time: 0.020, forward_time: 0.243, loss_parse_time: 0.000 memory: 7766, 
2023-07-09 21:27:20,878 - INFO - task : ['car'], loss: 13.5274, hm_loss: 5.7041, loc_loss: 31.2934, loc_loss_elem: ['3.8299', '3.2160', '7.3882', '3.1934', '2.9355', '2.7282', '3.0090', '3.5887', '3.2064', '3.4763'], num_positive: 42.0000
2023-07-09 21:27:20,878 - INFO - task : ['truck', 'construction_vehicle'], loss: 14.0053, hm_loss: 9.6582, loc_loss: 17.3887, loc_loss_elem: ['1.9723', '2.1698', '2.6261', '1.7503', '1.6135', '2.1626', '2.1163', '2.3670', '1.9544', '2.2428'], num_positive: 36.6000
2023-07-09 21:27:20,878 - INFO - task : ['bus', 'trailer'], loss: 21.6122, hm_loss: 15.7104, loc_loss: 23.6072, loc_loss_elem: ['2.6558', '2.9196', '3.7883', '2.3317', '2.9741', '2.3649', '3.3527', '2.7984', '2.3547', '2.9879'], num_positive: 22.4000
2023-07-09 21:27:20,878 - INFO - task : ['barrier'], loss: 11.5021, hm_loss: 6.7298, loc_loss: 19.0889, loc_loss_elem: ['2.9508', '2.0841', '3.5892', '1.8986', '1.7477', '1.4164', '2.2023', '1.8822', '2.8069', '1.7783'], num_positive: 34.2000
2023-07-09 21:27:20,878 - INFO - task : ['motorcycle', 'bicycle'], loss: 13.0546, hm_loss: 8.3679, loc_loss: 18.7469, loc_loss_elem: ['2.0107', '2.9339', '4.1940', '1.6526', '1.2773', '1.7768', '2.2384', '2.7843', '1.8571', '2.0400'], num_positive: 41.0000
2023-07-09 21:27:20,878 - INFO - task : ['pedestrian', 'traffic_cone'], loss: 18.9593, hm_loss: 12.1488, loc_loss: 27.2421, loc_loss_elem: ['2.8225', '3.0878', '4.6823', '2.6078', '3.1110', '2.6306', '3.2760', '3.4880', '2.9377', '4.0097'], num_positive: 29.8000

2023-07-09 21:27:34,247 - INFO - Epoch [1/20][20/15448] lr: 0.00010, eta: 20 days, 1:15:53, time: 2.673, data_time: 1.715, transfer_time: 0.021, forward_time: 0.437, loss_parse_time: 0.000 memory: 7766, 
2023-07-09 21:27:34,247 - INFO - task : ['car'], loss: 7.3018, hm_loss: 4.2332, loc_loss: 12.2747, loc_loss_elem: ['1.0438', '1.2418', '3.3763', '0.7369', '1.7604', '0.6061', '1.6393', '2.1004', '1.5989', '1.1626'], num_positive: 49.4000
2023-07-09 21:27:34,247 - INFO - task : ['truck', 'construction_vehicle'], loss: 14.1740, hm_loss: 9.7614, loc_loss: 17.6503, loc_loss_elem: ['1.6838', '2.6925', '2.3009', '1.5748', '1.7685', '2.0444', '1.6115', '2.5455', '1.7810', '2.9730'], num_positive: 30.6000
2023-07-09 21:27:34,247 - INFO - task : ['bus', 'trailer'], loss: 22.9734, hm_loss: 15.6939, loc_loss: 29.1177, loc_loss_elem: ['1.9653', '2.6295', '4.8794', '2.6741', '4.8540', '2.0434', '4.9209', '3.8473', '4.8242', '3.4941'], num_positive: 19.0000
2023-07-09 21:27:34,247 - INFO - task : ['barrier'], loss: 11.5642, hm_loss: 5.5589, loc_loss: 24.0213, loc_loss_elem: ['4.3741', '2.1053', '4.1135', '2.3074', '2.2710', '1.8462', '2.1599', '2.2775', '3.9803', '2.1360'], num_positive: 32.8000
2023-07-09 21:27:34,247 - INFO - task : ['motorcycle', 'bicycle'], loss: 11.9095, hm_loss: 7.4250, loc_loss: 17.9382, loc_loss_elem: ['1.7692', '2.8661', '3.0344', '1.6079', '1.0668', '1.8657', '2.4863', '3.1576', '2.1147', '2.4845'], num_positive: 42.0000
2023-07-09 21:27:34,247 - INFO - task : ['pedestrian', 'traffic_cone'], loss: 13.4539, hm_loss: 8.4542, loc_loss: 19.9990, loc_loss_elem: ['2.2295', '2.9165', '4.0986', '1.7874', '2.1307', '1.4518', '2.2649', '1.8076', '2.1415', '2.4287'], num_positive: 40.6000

.......

2023-07-10 02:06:40,629 - INFO - Epoch [1/20][6830/15448]   lr: 0.00011, eta: 8 days, 15:09:08, time: 2.241, data_time: 0.637, transfer_time: 0.018, forward_time: 1.306, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:06:40,629 - INFO - task : ['car'], loss: 1.5883, hm_loss: 1.0870, loc_loss: 2.0051, loc_loss_elem: ['0.1995', '0.2067', '0.2430', '0.0918', '0.0753', '0.0880', '0.7774', '0.9751', '0.3889', '0.3613'], num_positive: 30.4000
2023-07-10 02:06:40,629 - INFO - task : ['truck', 'construction_vehicle'], loss: 2.0589, hm_loss: 1.4553, loc_loss: 2.4147, loc_loss_elem: ['0.2154', '0.2262', '0.3614', '0.1535', '0.1547', '0.1764', '0.1666', '0.3449', '0.5382', '0.4866'], num_positive: 35.4000
2023-07-10 02:06:40,629 - INFO - task : ['bus', 'trailer'], loss: 1.8101, hm_loss: 1.1679, loc_loss: 2.5687, loc_loss_elem: ['0.2064', '0.2135', '0.3522', '0.1034', '0.0990', '0.1240', '0.6667', '1.1368', '0.5485', '0.5609'], num_positive: 22.4000
2023-07-10 02:06:40,629 - INFO - task : ['barrier'], loss: 1.8850, hm_loss: 1.2924, loc_loss: 2.3704, loc_loss_elem: ['0.1753', '0.1766', '0.2282', '0.1886', '0.2550', '0.1378', '0.0398', '0.0670', '0.7153', '0.4721'], num_positive: 10.0000
2023-07-10 02:06:40,629 - INFO - task : ['motorcycle', 'bicycle'], loss: 1.3836, hm_loss: 0.8034, loc_loss: 2.3206, loc_loss_elem: ['0.1548', '0.1649', '0.1787', '0.1698', '0.1068', '0.1270', '0.6427', '0.9525', '0.5001', '0.5994'], num_positive: 39.6000
2023-07-10 02:06:40,629 - INFO - task : ['pedestrian', 'traffic_cone'], loss: 1.4710, hm_loss: 0.8603, loc_loss: 2.4427, loc_loss_elem: ['0.1435', '0.1566', '0.2066', '0.2206', '0.2492', '0.1684', '0.2490', '0.2981', '0.5378', '0.6507'], num_positive: 33.4000

2023-07-10 02:06:51,529 - INFO - Epoch [1/20][6835/15448]   lr: 0.00011, eta: 8 days, 15:07:52, time: 2.180, data_time: 0.267, transfer_time: 0.019, forward_time: 1.627, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:06:51,529 - INFO - task : ['car'], loss: 1.4195, hm_loss: 0.9496, loc_loss: 1.8796, loc_loss_elem: ['0.1860', '0.2031', '0.2257', '0.0755', '0.0716', '0.0957', '0.4902', '0.4735', '0.4142', '0.4150'], num_positive: 57.4000
2023-07-10 02:06:51,529 - INFO - task : ['truck', 'construction_vehicle'], loss: 2.0764, hm_loss: 1.4540, loc_loss: 2.4893, loc_loss_elem: ['0.2232', '0.2075', '0.3721', '0.1433', '0.1466', '0.1621', '0.4089', '0.4865', '0.5134', '0.5420'], num_positive: 30.6000
2023-07-10 02:06:51,529 - INFO - task : ['bus', 'trailer'], loss: 2.2752, hm_loss: 1.5786, loc_loss: 2.7862, loc_loss_elem: ['0.2167', '0.2029', '0.4624', '0.0892', '0.1081', '0.1145', '0.9012', '1.2101', '0.5217', '0.6485'], num_positive: 20.4000
2023-07-10 02:06:51,529 - INFO - task : ['barrier'], loss: 1.6134, hm_loss: 1.0253, loc_loss: 2.3525, loc_loss_elem: ['0.1645', '0.1437', '0.1958', '0.1344', '0.2058', '0.1068', '0.0269', '0.0427', '0.7815', '0.6062'], num_positive: 12.4000
2023-07-10 02:06:51,529 - INFO - task : ['motorcycle', 'bicycle'], loss: 1.2025, hm_loss: 0.6419, loc_loss: 2.2425, loc_loss_elem: ['0.1503', '0.1640', '0.1258', '0.1651', '0.0968', '0.1134', '0.6563', '0.8597', '0.5131', '0.6108'], num_positive: 42.8000
2023-07-10 02:06:51,529 - INFO - task : ['pedestrian', 'traffic_cone'], loss: 1.2541, hm_loss: 0.6807, loc_loss: 2.2936, loc_loss_elem: ['0.1378', '0.1408', '0.1899', '0.1861', '0.2169', '0.1329', '0.3152', '0.3045', '0.6090', '0.5562'], num_positive: 40.0000

2023-07-10 02:07:07,907 - INFO - Epoch [1/20][6840/15448]   lr: 0.00011, eta: 8 days, 15:10:38, time: 3.276, data_time: 0.328, transfer_time: 0.019, forward_time: 2.667, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:07:07,907 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 46.6000
2023-07-10 02:07:07,907 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 31.4000
2023-07-10 02:07:07,907 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 22.6000
2023-07-10 02:07:07,907 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 37.4000
2023-07-10 02:07:07,907 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 39.0000
2023-07-10 02:07:07,907 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 44.8000

2023-07-10 02:07:18,348 - INFO - Epoch [1/20][6845/15448]   lr: 0.00011, eta: 8 days, 15:09:01, time: 2.088, data_time: 0.341, transfer_time: 0.020, forward_time: 1.452, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:07:18,349 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 46.6000
2023-07-10 02:07:18,349 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 32.8000
2023-07-10 02:07:18,349 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 23.4000
2023-07-10 02:07:18,349 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 11.6000
2023-07-10 02:07:18,349 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 40.2000
2023-07-10 02:07:18,349 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 37.0000

2023-07-10 02:07:28,850 - INFO - Epoch [1/20][6850/15448]   lr: 0.00011, eta: 8 days, 15:07:28, time: 2.100, data_time: 0.304, transfer_time: 0.020, forward_time: 1.506, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:07:28,850 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 40.4000
2023-07-10 02:07:28,850 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 36.2000
2023-07-10 02:07:28,851 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 25.0000
2023-07-10 02:07:28,851 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 20.0000
2023-07-10 02:07:28,851 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 43.0000
2023-07-10 02:07:28,851 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 51.6000

2023-07-10 02:07:38,815 - INFO - Epoch [1/20][6855/15448]   lr: 0.00011, eta: 8 days, 15:05:31, time: 1.993, data_time: 0.209, transfer_time: 0.019, forward_time: 1.508, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:07:38,815 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 38.6000
2023-07-10 02:07:38,815 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 30.0000
2023-07-10 02:07:38,815 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 22.8000
2023-07-10 02:07:38,815 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 19.2000
2023-07-10 02:07:38,815 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 40.8000
2023-07-10 02:07:38,815 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 36.0000

2023-07-10 02:07:56,825 - INFO - Epoch [1/20][6860/15448]   lr: 0.00011, eta: 8 days, 15:09:28, time: 3.602, data_time: 0.233, transfer_time: 0.019, forward_time: 3.080, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:07:56,825 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 41.2000
2023-07-10 02:07:56,825 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 32.0000
2023-07-10 02:07:56,825 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 21.2000
2023-07-10 02:07:56,825 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 21.8000
2023-07-10 02:07:56,826 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 41.0000
2023-07-10 02:07:56,826 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 30.8000

2023-07-10 02:08:06,961 - INFO - Epoch [1/20][6865/15448]   lr: 0.00011, eta: 8 days, 15:07:39, time: 2.027, data_time: 0.507, transfer_time: 0.020, forward_time: 1.235, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:08:06,961 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 46.4000
2023-07-10 02:08:06,962 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 32.0000
2023-07-10 02:08:06,962 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 21.8000
2023-07-10 02:08:06,962 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 35.6000
2023-07-10 02:08:06,962 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 40.8000
2023-07-10 02:08:06,962 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 44.0000

2023-07-10 02:08:16,884 - INFO - Epoch [1/20][6870/15448]   lr: 0.00011, eta: 8 days, 15:05:40, time: 1.984, data_time: 0.392, transfer_time: 0.019, forward_time: 1.312, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:08:16,884 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 50.4000
2023-07-10 02:08:16,884 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 33.8000
2023-07-10 02:08:16,884 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 20.8000
2023-07-10 02:08:16,884 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 25.8000
2023-07-10 02:08:16,884 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 38.0000
2023-07-10 02:08:16,884 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 27.8000

2023-07-10 02:08:27,032 - INFO - Epoch [1/20][6875/15448]   lr: 0.00011, eta: 8 days, 15:03:51, time: 2.030, data_time: 0.397, transfer_time: 0.019, forward_time: 1.355, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:08:27,032 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 46.4000
2023-07-10 02:08:27,032 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 31.4000
2023-07-10 02:08:27,032 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 22.6000
2023-07-10 02:08:27,032 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 19.6000
2023-07-10 02:08:27,032 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 40.4000
2023-07-10 02:08:27,032 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 30.0000

2023-07-10 02:08:45,481 - INFO - Epoch [1/20][6880/15448]   lr: 0.00011, eta: 8 days, 15:08:07, time: 3.690, data_time: 0.764, transfer_time: 0.019, forward_time: 2.643, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:08:45,481 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 50.2000
2023-07-10 02:08:45,481 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 35.0000
2023-07-10 02:08:45,481 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 21.0000
2023-07-10 02:08:45,481 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 23.8000
2023-07-10 02:08:45,481 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 40.4000
2023-07-10 02:08:45,481 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 33.0000

2023-07-10 02:08:56,070 - INFO - Epoch [1/20][6885/15448]   lr: 0.00011, eta: 8 days, 15:06:38, time: 2.118, data_time: 0.302, transfer_time: 0.020, forward_time: 1.524, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:08:56,071 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 50.6000
2023-07-10 02:08:56,071 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 31.6000
2023-07-10 02:08:56,071 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 23.0000
2023-07-10 02:08:56,071 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 43.4000
2023-07-10 02:08:56,071 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 41.0000
2023-07-10 02:08:56,071 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 49.8000

2023-07-10 02:09:05,727 - INFO - Epoch [1/20][6890/15448]   lr: 0.00011, eta: 8 days, 15:04:28, time: 1.931, data_time: 0.227, transfer_time: 0.020, forward_time: 1.411, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:09:05,728 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 45.8000
2023-07-10 02:09:05,728 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 30.2000
2023-07-10 02:09:05,728 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 22.0000
2023-07-10 02:09:05,728 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 30.0000
2023-07-10 02:09:05,728 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 40.4000
2023-07-10 02:09:05,728 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 39.6000

2023-07-10 02:09:15,867 - INFO - Epoch [1/20][6895/15448]   lr: 0.00011, eta: 8 days, 15:02:39, time: 2.028, data_time: 0.142, transfer_time: 0.020, forward_time: 1.608, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:09:15,867 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 44.2000
2023-07-10 02:09:15,867 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 31.2000
2023-07-10 02:09:15,867 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 23.2000
2023-07-10 02:09:15,867 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 27.4000
2023-07-10 02:09:15,867 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 40.4000
2023-07-10 02:09:15,867 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 28.8000

2023-07-10 02:09:33,447 - INFO - Epoch [1/20][6900/15448]   lr: 0.00011, eta: 8 days, 15:06:17, time: 3.516, data_time: 0.200, transfer_time: 0.019, forward_time: 3.026, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:09:33,448 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 62.0000
2023-07-10 02:09:33,448 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 32.8000
2023-07-10 02:09:33,448 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 19.0000
2023-07-10 02:09:33,448 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 12.6000
2023-07-10 02:09:33,448 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 40.2000
2023-07-10 02:09:33,448 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 46.4000

2023-07-10 02:09:43,988 - INFO - Epoch [1/20][6905/15448]   lr: 0.00011, eta: 8 days, 15:04:45, time: 2.108, data_time: 0.194, transfer_time: 0.019, forward_time: 1.637, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:09:43,989 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 46.8000
2023-07-10 02:09:43,989 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 30.0000
2023-07-10 02:09:43,989 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 20.4000
2023-07-10 02:09:43,989 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 41.2000
2023-07-10 02:09:43,989 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 39.0000
2023-07-10 02:09:43,989 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 50.8000

2023-07-10 02:09:54,640 - INFO - Epoch [1/20][6910/15448]   lr: 0.00011, eta: 8 days, 15:03:19, time: 2.130, data_time: 0.172, transfer_time: 0.019, forward_time: 1.671, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:09:54,641 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 47.6000
2023-07-10 02:09:54,641 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 32.6000
2023-07-10 02:09:54,641 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 21.8000
2023-07-10 02:09:54,641 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 34.8000
2023-07-10 02:09:54,641 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 42.6000
2023-07-10 02:09:54,641 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 47.0000

2023-07-10 02:10:04,009 - INFO - Epoch [1/20][6915/15448]   lr: 0.00011, eta: 8 days, 15:00:57, time: 1.874, data_time: 0.090, transfer_time: 0.019, forward_time: 1.504, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:10:04,009 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 46.8000
2023-07-10 02:10:04,009 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 33.0000
2023-07-10 02:10:04,009 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 20.4000
2023-07-10 02:10:04,009 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 20.8000
2023-07-10 02:10:04,009 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 39.6000
2023-07-10 02:10:04,009 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 42.8000

2023-07-10 02:10:22,903 - INFO - Epoch [1/20][6920/15448]   lr: 0.00011, eta: 8 days, 15:05:31, time: 3.779, data_time: 0.239, transfer_time: 0.020, forward_time: 3.257, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:10:22,903 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 34.0000
2023-07-10 02:10:22,903 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 34.4000
2023-07-10 02:10:22,903 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 23.0000
2023-07-10 02:10:22,903 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 41.2000
2023-07-10 02:10:22,903 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 40.8000
2023-07-10 02:10:22,903 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 34.8000

2023-07-10 02:10:33,636 - INFO - Epoch [1/20][6925/15448]   lr: 0.00011, eta: 8 days, 15:04:09, time: 2.147, data_time: 0.043, transfer_time: 0.020, forward_time: 1.819, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:10:33,637 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 47.4000
2023-07-10 02:10:33,637 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 33.8000
2023-07-10 02:10:33,637 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 24.0000
2023-07-10 02:10:33,637 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 25.6000
2023-07-10 02:10:33,637 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 42.2000
2023-07-10 02:10:33,637 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 38.0000

2023-07-10 02:10:42,867 - INFO - Epoch [1/20][6930/15448]   lr: 0.00011, eta: 8 days, 15:01:41, time: 1.846, data_time: 0.041, transfer_time: 0.019, forward_time: 1.519, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:10:42,867 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 36.4000
2023-07-10 02:10:42,868 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 34.2000
2023-07-10 02:10:42,868 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 20.4000
2023-07-10 02:10:42,868 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 30.4000
2023-07-10 02:10:42,868 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 41.2000
2023-07-10 02:10:42,868 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 45.8000

2023-07-10 02:10:55,155 - INFO - Epoch [1/20][6935/15448]   lr: 0.00011, eta: 8 days, 15:01:26, time: 2.458, data_time: 0.087, transfer_time: 0.020, forward_time: 2.091, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:10:55,156 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 48.8000
2023-07-10 02:10:55,156 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 30.0000
2023-07-10 02:10:55,156 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 20.8000
2023-07-10 02:10:55,156 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 23.0000
2023-07-10 02:10:55,156 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 40.6000
2023-07-10 02:10:55,156 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 40.6000

Then, the program end with the error:

[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1049167, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803980 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1049166, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803981 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1575427 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 1575428) of binary: .../.conda/envs/mvp/bin/python
Traceback (most recent call last):
  File ".../.conda/envs/mvp/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.0.1', 'console_scripts', 'torchrun')())
  File "..../.conda/envs/mvp/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File .....conda/envs/mvp/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "....../.conda/envs/mvp/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "......./.conda/envs/mvp/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "......./.conda/envs/mvp/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
========================================================
./tools/train.py FAILED
--------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-10_02:41:36
  host      : AI-3090
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 1575428)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 1575428
========================================================

Could anyone solve this problem ? @tianweiy

tianweiy commented 1 year ago

I suggest using openpcdet https://github.com/open-mmlab/OpenPCDet. This codebase is not actively maintained so that newer version of torch / cuda / apex may have some unknown issues