Slow training speed - Githubissues

xxxxhh commented 1 year ago

Hi, thanks for your wonderful paper and code !!

I am currently trying to reproduce the result of the camera-only 3D OD performance. (configs/nuscenes/det/centerhead/lssfpn/camera/256x704/swint/default.yaml)

However, when I am training the model on the 8*V100(32G) machine, I find that the training time is a little bit slow. It shows that the ETA is more than 5days. Is it normal? So I would like to know the training time on your machine. Meanwhile, do you have some ideas of how to speed up the training speed?

Thanks!

2022-10-17 18:10:44,756 - mmdet3d - INFO - Epoch [1][1850/2575] lr: 2.158e-05, eta: 5 days, 12:01:11, time: 9.773, data_time: 0.882, memory: 24562, loss/object/heatmap/task0: 1.6029, loss/object/bbox/task0: 0.6542, loss/object/heatmap/task1: 2.2840, loss/object/bbox/task1: 0.7268, loss/object/heatmap/task2: 2.2729, loss/object/bbox/task2: 0.7697, loss/object/heatmap/task3: 1.6434, loss/object/bbox/task3: 0.5922, loss/object/heatmap/task4: 1.9640, loss/object/bbox/task4: 0.6748, loss/object/heatmap/task5: 1.7826, loss/object/bbox/task5: 0.7014, loss: 15.6690, grad_norm: 28.0689 2022-10-17 18:18:40,129 - mmdet3d - INFO - Epoch [1][1900/2575] lr: 2.167e-05, eta: 5 days, 11:51:48, time: 9.507, data_time: 2.949, memory: 24562, loss/object/heatmap/task0: 1.5837, loss/object/bbox/task0: 0.6480, loss/object/heatmap/task1: 2.3231, loss/object/bbox/task1: 0.7300, loss/object/heatmap/task2: 2.2558, loss/object/bbox/task2: 0.7732, loss/object/heatmap/task3: 1.5614, loss/object/bbox/task3: 0.5765, loss/object/heatmap/task4: 2.0219, loss/object/bbox/task4: 0.6763, loss/object/heatmap/task5: 1.8135, loss/object/bbox/task5: 0.6997, loss: 15.6632, grad_norm: 27.9137 2022-10-17 18:26:48,149 - mmdet3d - INFO - Epoch [1][1950/2575] lr: 2.175e-05, eta: 5 days, 11:47:50, time: 9.760, data_time: 0.626, memory: 24562, loss/object/heatmap/task0: 1.5738, loss/object/bbox/task0: 0.6474, loss/object/heatmap/task1: 2.2906, loss/object/bbox/task1: 0.7330, loss/object/heatmap/task2: 2.2763, loss/object/bbox/task2: 0.7724, loss/object/heatmap/task3: 1.6024, loss/object/bbox/task3: 0.5850, loss/object/heatmap/task4: 1.9927, loss/object/bbox/task4: 0.6809, loss/object/heatmap/task5: 1.7860, loss/object/bbox/task5: 0.6982, loss: 15.6386, grad_norm: 28.2060 2022-10-17 18:34:30,180 - mmdet3d - INFO - Epoch [1][2000/2575] lr: 2.184e-05, eta: 5 days, 11:32:57, time: 9.241, data_time: 0.095, memory: 24562, loss/object/heatmap/task0: 1.5733, loss/object/bbox/task0: 0.6478, loss/object/heatmap/task1: 2.2377, loss/object/bbox/task1: 0.7236, loss/object/heatmap/task2: 2.2541, loss/object/bbox/task2: 0.7560, loss/object/heatmap/task3: 1.5284, loss/object/bbox/task3: 0.5873, loss/object/heatmap/task4: 1.9164, loss/object/bbox/task4: 0.6813, loss/object/heatmap/task5: 1.7515, loss/object/bbox/task5: 0.6947, loss: 15.3519, grad_norm: 29.4954 2022-10-17 18:42:25,542 - mmdet3d - INFO - Epoch [1][2050/2575] lr: 2.194e-05, eta: 5 days, 11:23:46, time: 9.507, data_time: 0.290, memory: 24566, loss/object/heatmap/task0: 1.5809, loss/object/bbox/task0: 0.6428, loss/object/heatmap/task1: 2.2866, loss/object/bbox/task1: 0.7285, loss/object/heatmap/task2: 2.2543, loss/object/bbox/task2: 0.7554, loss/object/heatmap/task3: 1.5566, loss/object/bbox/task3: 0.5840, loss/object/heatmap/task4: 2.0189, loss/object/bbox/task4: 0.6876, loss/object/heatmap/task5: 1.7689, loss/object/bbox/task5: 0.7011, loss: 15.5657, grad_norm: 29.4789 2022-10-17 18:50:32,387 - mmdet3d - INFO - Epoch [1][2100/2575] lr: 2.203e-05, eta: 5 days, 11:19:09, time: 9.737, data_time: 0.096, memory: 24566, loss/object/heatmap/task0: 1.5695, loss/object/bbox/task0: 0.6435, loss/object/heatmap/task1: 2.2889, loss/object/bbox/task1: 0.7271, loss/object/heatmap/task2: 2.3042, loss/object/bbox/task2: 0.7449, loss/object/heatmap/task3: 1.5119, loss/object/bbox/task3: 0.5835, loss/object/heatmap/task4: 1.9646, loss/object/bbox/task4: 0.6840, loss/object/heatmap/task5: 1.7536, loss/object/bbox/task5: 0.6963, loss: 15.4719, grad_norm: 29.8489 2022-10-17 18:58:16,105 - mmdet3d - INFO - Epoch [1][2150/2575] lr: 2.213e-05, eta: 5 days, 11:05:32, time: 9.275, data_time: 1.364, memory: 24566, loss/object/heatmap/task0: 1.5683, loss/object/bbox/task0: 0.6385, loss/object/heatmap/task1: 2.2457, loss/object/bbox/task1: 0.7245, loss/object/heatmap/task2: 2.2663, loss/object/bbox/task2: 0.7795, loss/object/heatmap/task3: 1.5618, loss/object/bbox/task3: 0.5917, loss/object/heatmap/task4: 1.9616, loss/object/bbox/task4: 0.6787, loss/object/heatmap/task5: 1.7494, loss/object/bbox/task5: 0.6960, loss: 15.4620, grad_norm: 30.1794 2022-10-17 19:06:22,347 - mmdet3d - INFO - Epoch [1][2200/2575] lr: 2.223e-05, eta: 5 days, 11:00:35, time: 9.725, data_time: 1.621, memory: 24566, loss/object/heatmap/task0: 1.5545, loss/object/bbox/task0: 0.6409, loss/object/heatmap/task1: 2.2360, loss/object/bbox/task1: 0.7243, loss/object/heatmap/task2: 2.1954, loss/object/bbox/task2: 0.7479, loss/object/heatmap/task3: 1.5084, loss/object/bbox/task3: 0.5702, loss/object/heatmap/task4: 1.8994, loss/object/bbox/task4: 0.6754, loss/object/heatmap/task5: 1.7521, loss/object/bbox/task5: 0.6913, loss: 15.1957, grad_norm: 29.0443 2022-10-17 19:14:27,081 - mmdet3d - INFO - Epoch [1][2250/2575] lr: 2.233e-05, eta: 5 days, 10:54:57, time: 9.695, data_time: 3.186, memory: 24566, loss/object/heatmap/task0: 1.5628, loss/object/bbox/task0: 0.6432, loss/object/heatmap/task1: 2.2464, loss/object/bbox/task1: 0.7225, loss/object/heatmap/task2: 2.2495, loss/object/bbox/task2: 0.7644, loss/object/heatmap/task3: 1.4747, loss/object/bbox/task3: 0.5784, loss/object/heatmap/task4: 1.9596, loss/object/bbox/task4: 0.6867, loss/object/heatmap/task5: 1.7299, loss/object/bbox/task5: 0.6900, loss: 15.3080, grad_norm: 28.5894 2022-10-17 19:22:21,470 - mmdet3d - INFO - Epoch [1][2300/2575] lr: 2.243e-05, eta: 5 days, 10:45:31, time: 9.488, data_time: 0.240, memory: 24566, loss/object/heatmap/task0: 1.5476, loss/object/bbox/task0: 0.6448, loss/object/heatmap/task1: 2.2374, loss/object/bbox/task1: 0.7172, loss/object/heatmap/task2: 2.2728, loss/object/bbox/task2: 0.7746, loss/object/heatmap/task3: 1.4440, loss/object/bbox/task3: 0.5588, loss/object/heatmap/task4: 1.9005, loss/object/bbox/task4: 0.6773, loss/object/heatmap/task5: 1.7471, loss/object/bbox/task5: 0.6935, loss: 15.2155, grad_norm: 29.9565 2022-10-17 19:30:20,785 - mmdet3d - INFO - Epoch [1][2350/2575] lr: 2.254e-05, eta: 5 days, 10:37:52, time: 9.586, data_time: 0.088, memory: 24566, loss/object/heatmap/task0: 1.5471, loss/object/bbox/task0: 0.6408, loss/object/heatmap/task1: 2.1999, loss/object/bbox/task1: 0.7151, loss/object/heatmap/task2: 2.1772, loss/object/bbox/task2: 0.7400, loss/object/heatmap/task3: 1.4451, loss/object/bbox/task3: 0.5609, loss/object/heatmap/task4: 1.8230, loss/object/bbox/task4: 0.6651, loss/object/heatmap/task5: 1.7286, loss/object/bbox/task5: 0.6908, loss: 14.9337, grad_norm: 28.4202 2022-10-17 19:38:22,441 - mmdet3d - INFO - Epoch [1][2400/2575] lr: 2.265e-05, eta: 5 days, 10:31:00, time: 9.633, data_time: 0.099, memory: 24566, loss/object/heatmap/task0: 1.5580, loss/object/bbox/task0: 0.6445, loss/object/heatmap/task1: 2.2365, loss/object/bbox/task1: 0.7167, loss/object/heatmap/task2: 2.1697, loss/object/bbox/task2: 0.7552, loss/object/heatmap/task3: 1.5406, loss/object/bbox/task3: 0.5801, loss/object/heatmap/task4: 1.8532, loss/object/bbox/task4: 0.6714, loss/object/heatmap/task5: 1.7370, loss/object/bbox/task5: 0.6878, loss: 15.1507, grad_norm: 28.7996 2022-10-17 19:46:05,453 - mmdet3d - INFO - Epoch [1][2450/2575] lr: 2.276e-05, eta: 5 days, 10:17:52, time: 9.260, data_time: 0.083, memory: 24566, loss/object/heatmap/task0: 1.5370, loss/object/bbox/task0: 0.6406, loss/object/heatmap/task1: 2.2416, loss/object/bbox/task1: 0.7221, loss/object/heatmap/task2: 2.0932, loss/object/bbox/task2: 0.7401, loss/object/heatmap/task3: 1.4566, loss/object/bbox/task3: 0.5787, loss/object/heatmap/task4: 1.8866, loss/object/bbox/task4: 0.6784, loss/object/heatmap/task5: 1.7390, loss/object/bbox/task5: 0.6900, loss: 15.0037, grad_norm: 27.7580

kentang-mit commented 1 year ago

No it's not quite expected. We expect training to finish within 24 hours if you use RTX3090. V100 should not be that slow, and probably you can check out whether your training procedure is bottlenecked by CPU code.

xxxxhh commented 1 year ago

Got that! By the way, how about the training time and the memory cost for the Lidar+image+convfuser fusion setting given in the config folder?

kentang-mit commented 1 year ago

For the current version of the codebase I think the memory cost is slightly more than 32G. However if you reduce the batch size to 2 it should fit in a 32G GPU and there will not be performance degradation if configured properly. The overall training time on 8xA6000 is around 35 hours (including LiDAR-only) pretraining.

xxxxhh commented 1 year ago

Got it. By the way, if I would like to train the fusion method, do I need to train a LiDAR-only model and then train the fusion-based one based on the pretrained LiDAR-only model?

kentang-mit commented 1 year ago

Yes, your understanding is correct. Alternatively (to save training time), you can also load the pretrained LiDAR-only checkpoint provided by us.

IAMShashankk commented 1 year ago

No it's not quite expected. We expect training to finish within 24 hours if you use RTX3090. V100 should not be that slow, and probably you can check out whether your training procedure is bottlenecked by CPU code.

Hi, I am using 6 RTXA6000 GPU, 4 CPU for each GPU, 45 GB Memoery for each GPU (total 270 GB) and around 500GB CPU memory. I can see that my GPU is utilized around 80% but the CPU is around only 30%. It is showing that it will take several days to complete the training on nuscenes dataset. Please see the logs of few iterations: `2022-10-13 21:26:56,025 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration. 2022-10-13 21:32:05,793 - mmdet3d - INFO - Epoch [1][50/3433] lr: 2.000e-05, eta: 5 days, 9:47:59, time: 6.811, data_time: 0.326, memory: 25805, loss/object/heatmap/task0: 143.1401, loss/object/bbox/task0: 1.3793, loss/object/heatmap/task1: 1678.6350, loss/object/bbox/task1: 1.7243, loss/object/heatmap/task2: 4226.2582, loss/object/bbox/task2: 1.7923, loss/object/heatmap/task3: 813.0423, loss/object/bbox/task3: 1.1332, loss/object/heatmap/task4: 4783.0548, loss/object/bbox/task4: 1.1030, loss/object/heatmap/task5: 527.8169, loss/object/bbox/task5: 1.2341, loss: 12180.3138, grad_norm: nan 2022-10-13 21:35:50,785 - mmdet3d - INFO - Epoch [1][100/3433] lr: 2.000e-05, eta: 4 days, 11:41:46, time: 4.499, data_time: 0.047, memory: 25807, loss/object/heatmap/task0: 2.9395, loss/object/bbox/task0: 0.8122, loss/object/heatmap/task1: 9.7462, loss/object/bbox/task1: 0.9775, loss/object/heatmap/task2: 17.4622, loss/object/bbox/task2: 0.9802, loss/object/heatmap/task3: 7.1500, loss/object/bbox/task3: 0.7631, loss/object/heatmap/task4: 15.1377, loss/object/bbox/task4: 0.8237, loss/object/heatmap/task5: 4.8527, loss/object/bbox/task5: 0.8841, loss: 62.5292, grad_norm: 467.7770 . . . . . . 2022-10-20 12:34:20,618 - mmdet3d - INFO - Epoch [10][3100/3433] lr: 9.392e-05, eta: 6 days, 8:57:24, time: 20.331, data_time: 12.233, memory: 25810, loss/object/heatmap/task0: 0.8217, loss/object/bbox/task0: 0.3159, loss/object/heatmap/task1: 0.6908, loss/object/bbox/task1: 0.3447, loss/object/heatmap/task2: 0.3862, loss/object/bbox/task2: 0.3349, loss/object/heatmap/task3: 0.5433, loss/object/bbox/task3: 0.3280, loss/object/heatmap/task4: 0.3042, loss/object/bbox/task4: 0.3096, loss/object/heatmap/task5: 0.9230, loss/object/bbox/task5: 0.5598, loss: 5.8622, grad_norm: 17.9851

`

After a few epochs, ETA increases. I use samples_per_gpu=6, workers_per_gpu=4.

Could you please advise me on how to improve the training time? As you mentioned in the comment it should be completed within 24 hours; that never happened to me.

kentang-mit commented 1 year ago

I think the problem might be related to your CPU/disk. In our experiments we use 96-thread AMD CPUs and SSD to store all the data. If you are using slow disks (or even NFS), the data loading time could easily bottleneck the entire pipeline.

xxxxhh commented 1 year ago

No it's not quite expected. We expect training to finish within 24 hours if you use RTX3090. V100 should not be that slow, and probably you can check out whether your training procedure is bottlenecked by CPU code.

Hi, I am using 6 RTXA6000 GPU, 4 CPU for each GPU, 45 GB Memoery for each GPU (total 270 GB) and around 500GB CPU memory. I can see that my GPU is utilized around 80% but the CPU is around only 30%. It is showing that it will take several days to complete the training on nuscenes dataset. Please see the logs of few iterations: `2022-10-13 21:26:56,025 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration. 2022-10-13 21:32:05,793 - mmdet3d - INFO - Epoch [1][50/3433] lr: 2.000e-05, eta: 5 days, 9:47:59, time: 6.811, data_time: 0.326, memory: 25805, loss/object/heatmap/task0: 143.1401, loss/object/bbox/task0: 1.3793, loss/object/heatmap/task1: 1678.6350, loss/object/bbox/task1: 1.7243, loss/object/heatmap/task2: 4226.2582, loss/object/bbox/task2: 1.7923, loss/object/heatmap/task3: 813.0423, loss/object/bbox/task3: 1.1332, loss/object/heatmap/task4: 4783.0548, loss/object/bbox/task4: 1.1030, loss/object/heatmap/task5: 527.8169, loss/object/bbox/task5: 1.2341, loss: 12180.3138, grad_norm: nan 2022-10-13 21:35:50,785 - mmdet3d - INFO - Epoch [1][100/3433] lr: 2.000e-05, eta: 4 days, 11:41:46, time: 4.499, data_time: 0.047, memory: 25807, loss/object/heatmap/task0: 2.9395, loss/object/bbox/task0: 0.8122, loss/object/heatmap/task1: 9.7462, loss/object/bbox/task1: 0.9775, loss/object/heatmap/task2: 17.4622, loss/object/bbox/task2: 0.9802, loss/object/heatmap/task3: 7.1500, loss/object/bbox/task3: 0.7631, loss/object/heatmap/task4: 15.1377, loss/object/bbox/task4: 0.8237, loss/object/heatmap/task5: 4.8527, loss/object/bbox/task5: 0.8841, loss: 62.5292, grad_norm: 467.7770 . . . . . . 2022-10-20 12:34:20,618 - mmdet3d - INFO - Epoch [10][3100/3433] lr: 9.392e-05, eta: 6 days, 8:57:24, time: 20.331, data_time: 12.233, memory: 25810, loss/object/heatmap/task0: 0.8217, loss/object/bbox/task0: 0.3159, loss/object/heatmap/task1: 0.6908, loss/object/bbox/task1: 0.3447, loss/object/heatmap/task2: 0.3862, loss/object/bbox/task2: 0.3349, loss/object/heatmap/task3: 0.5433, loss/object/bbox/task3: 0.3280, loss/object/heatmap/task4: 0.3042, loss/object/bbox/task4: 0.3096, loss/object/heatmap/task5: 0.9230, loss/object/bbox/task5: 0.5598, loss: 5.8622, grad_norm: 17.9851

`

After a few epochs, ETA increases. I use samples_per_gpu=6, workers_per_gpu=4.

Could you please advise me on how to improve the training time? As you mentioned in the comment it should be completed within 24 hours; that never happened to me.

Actually, I met with the same problem as you even though I used A100*8 with 96-thread CPUs. For my case (the only modification is downsampling the image to a smaller one (downsample ratio=2)), I noticed that the training time is still more than 5 days.

According to the training log, it seems that the data_time in my machine is not a bottleneck. So I am not quite sure why the training time is so long.

2022-10-20 12:32:46,368 - mmdet3d - INFO - Epoch [4][50/3862] lr: 3.799e-04, eta: 5 days, 1:28:16, time: 6.161, data_time: 0.315, memory: 10779, loss/object/lossheatmap: 0.8057, loss/object/layer-1_losscls: 0.1367, loss/object/layer-1_loss_bbox: 0.9308, stats/object/matched_ious: 0.4581, loss: 1.8733, grad_norm: 3.6592 2022-10-20 12:38:47,635 - mmdet3d - INFO - Epoch [4][100/3862] lr: 3.820e-04, eta: 5 days, 1:25:19, time: 7.225, data_time: 0.009, memory: 10779, loss/object/lossheatmap: 0.7955, loss/object/layer-1_losscls: 0.1367, loss/object/layer-1_loss_bbox: 0.9237, stats/object/matched_ious: 0.4612, loss: 1.8560, grad_norm: 3.6972 2022-10-20 12:44:08,612 - mmdet3d - INFO - Epoch [4][150/3862] lr: 3.841e-04, eta: 5 days, 1:18:37, time: 6.420, data_time: 0.009, memory: 10779, loss/object/lossheatmap: 0.7862, loss/object/layer-1_losscls: 0.1354, loss/object/layer-1_loss_bbox: 0.9142, stats/object/matched_ious: 0.4624, loss: 1.8358, grad_norm: 3.5528 2022-10-20 12:49:37,344 - mmdet3d - INFO - Epoch [4][200/3862] lr: 3.862e-04, eta: 5 days, 1:12:38, time: 6.575, data_time: 0.009, memory: 10779, loss/object/lossheatmap: 0.7925, loss/object/layer-1_losscls: 0.1353, loss/object/layer-1_loss_bbox: 0.9377, stats/object/matched_ious: 0.4625, loss: 1.8655, grad_norm: 3.3618 2022-10-20 12:55:06,685 - mmdet3d - INFO - Epoch [4][250/3862] lr: 3.884e-04, eta: 5 days, 1:06:42, time: 6.587, data_time: 0.009, memory: 10779, loss/object/lossheatmap: 0.7960, loss/object/layer-1_losscls: 0.1368, loss/object/layer-1_loss_bbox: 0.9186, stats/object/matched_ious: 0.4615, loss: 1.8513, grad_norm: 3.5264 2022-10-20 13:01:21,101 - mmdet3d - INFO - Epoch [4][300/3862] lr: 3.905e-04, eta: 5 days, 1:04:55, time: 7.488, data_time: 0.070, memory: 10779, loss/object/lossheatmap: 0.7945, loss/object/layer-1_losscls: 0.1362, loss/object/layer-1_loss_bbox: 0.9268, stats/object/matched_ious: 0.4619, loss: 1.8576, grad_norm: 3.5538 2022-10-20 13:06:54,299 - mmdet3d - INFO - Epoch [4][350/3862] lr: 3.927e-04, eta: 5 days, 0:59:20, time: 6.664, data_time: 0.009, memory: 10779, loss/object/lossheatmap: 0.8071, loss/object/layer-1_losscls: 0.1375, loss/object/layer-1_loss_bbox: 0.9497, stats/object/matched_ious: 0.4594, loss: 1.8943, grad_norm: 3.4423 2022-10-20 13:12:38,149 - mmdet3d - INFO - Epoch [4][400/3862] lr: 3.948e-04, eta: 5 days, 0:54:43, time: 6.877, data_time: 0.010, memory: 10779, loss/object/lossheatmap: 0.8078, loss/object/layer-1_losscls: 0.1385, loss/object/layer-1_loss_bbox: 0.9136, stats/object/matched_ious: 0.4572, loss: 1.8600, grad_norm: 3.5403 2022-10-20 13:17:56,426 - mmdet3d - INFO - Epoch [4][450/3862] lr: 3.969e-04, eta: 5 days, 0:47:47, time: 6.366, data_time: 0.011, memory: 10779, loss/object/lossheatmap: 0.8005, loss/object/layer-1_losscls: 0.1357, loss/object/layer-1_loss_bbox: 0.9419, stats/object/matched_ious: 0.4541, loss: 1.8781, grad_norm: 3.4593 2022-10-20 13:23:38,857 - mmdet3d - INFO - Epoch [4][500/3862] lr: 3.991e-04, eta: 5 days, 0:43:01, time: 6.849, data_time: 0.009, memory: 10779, loss/object/lossheatmap: 0.8033, loss/object/layer-1_losscls: 0.1362, loss/object/layer-1_loss_bbox: 0.9479, stats/object/matched_ious: 0.4600, loss: 1.8874, grad_norm: 3.4067 2022-10-20 13:29:08,462 - mmdet3d - INFO - Epoch [4][550/3862] lr: 4.013e-04, eta: 5 days, 0:37:07, time: 6.592, data_time: 0.010, memory: 10779, loss/object/lossheatmap: 0.7831, loss/object/layer-1_losscls: 0.1326, loss/object/layer-1_loss_bbox: 0.9276, stats/object/matched_ious: 0.4664, loss: 1.8434, grad_norm: 3.3896 2022-10-20 13:34:23,396 - mmdet3d - INFO - Epoch [4][600/3862] lr: 4.034e-04, eta: 5 days, 0:29:55, time: 6.299, data_time: 0.010, memory: 10779, loss/object/lossheatmap: 0.7967, loss/object/layer-1_losscls: 0.1332, loss/object/layer-1_loss_bbox: 0.9301, stats/object/matched_ious: 0.4662, loss: 1.8601, grad_norm: 3.5238 2022-10-20 13:39:53,440 - mmdet3d - INFO - Epoch [4][650/3862] lr: 4.056e-04, eta: 5 days, 0:24:03, time: 6.601, data_time: 0.010, memory: 10779, loss/object/lossheatmap: 0.7989, loss/object/layer-1_losscls: 0.1332, loss/object/layer-1_loss_bbox: 0.9318, stats/object/matched_ious: 0.4624, loss: 1.8640, grad_norm: 3.5998 2022-10-20 13:45:53,603 - mmdet3d - INFO - Epoch [4][700/3862] lr: 4.078e-04, eta: 5 days, 0:20:51, time: 7.203, data_time: 0.009, memory: 10779, loss/object/lossheatmap: 0.7929, loss/object/layer-1_losscls: 0.1340, loss/object/layer-1_loss_bbox: 0.9187, stats/object/matched_ious: 0.4638, loss: 1.8456, grad_norm: 3.2548 2022-10-20 13:50:39,675 - mmdet3d - INFO - Epoch [4][750/3862] lr: 4.099e-04, eta: 5 days, 0:11:08, time: 5.722, data_time: 0.010, memory: 10779, loss/object/lossheatmap: 0.8011, loss/object/layer-1_losscls: 0.1360, loss/object/layer-1_loss_bbox: 0.9165, stats/object/matched_ious: 0.4625, loss: 1.8535, grad_norm: 3.4077 2022-10-20 13:56:53,391 - mmdet3d - INFO - Epoch [4][800/3862] lr: 4.121e-04, eta: 5 days, 0:09:06, time: 7.474, data_time: 0.010, memory: 10779, loss/object/lossheatmap: 0.7956, loss/object/layer-1_losscls: 0.1332, loss/object/layer-1_loss_bbox: 0.9380, stats/object/matched_ious: 0.4619, loss: 1.8668, grad_norm: 3.3312 2022-10-20 14:02:07,286 - mmdet3d - INFO - Epoch [4][850/3862] lr: 4.143e-04, eta: 5 days, 0:01:51, time: 6.278, data_time: 0.008, memory: 10779, loss/object/lossheatmap: 0.8000, loss/object/layer-1_losscls: 0.1359, loss/object/layer-1_loss_bbox: 0.9103, stats/object/matched_ious: 0.4698, loss: 1.8462, grad_norm: 3.5621

kentang-mit commented 1 year ago

I don't think the data_time displayed by mmdetection3d is very accurate. Would you please also trying out some other models (e.g. FCOS3D) with the official mmdet3d and see whether the training time is expected (I remember it is typically something between half a day to 16 hours)?

BHC1205 commented 1 year ago

Got it. By the way, if I would like to train the fusion method, do I need to train a LiDAR-only model and then train the fusion-based one based on the pretrained LiDAR-only model?

it means I need to train a Camera-only model and a LiDAR-only model ,and load the two branch parameters(freeze or not?), and then only train the fusion module. I wonder if this is the right understanding.

sunnyHelen commented 1 year ago

For the current version of the codebase I think the memory cost is slightly more than 32G. However if you reduce the batch size to 2 it should fit in a 32G GPU and there will not be performance degradation if configured properly. The overall training time on 8xA6000 is around 35 hours (including LiDAR-only) pretraining.

If my GPUs can only load models with batch_size 2, how to change the configuration properly to maintain the performance?

kentang-mit commented 1 year ago

@BHC1205,

Loading our pretrained Lidar-only model and 2D detector pretrained on nuImages will be sufficient.

@sunnyHelen,

Let's set

data:
    samples_per_gpu: 2

Best, Haotian

sunnyHelen commented 1 year ago

Yeah. I mean how should we adjust the training configuration (e.g. learning rate etc. ) to maintain the performance if we set the batch_size to 2?

kentang-mit commented 1 year ago

Hi @sunnyHelen,

One way you can try is to add gradient accumulation, which is natively supported by this codebase.

optimizer_config:
  cumulative_iters: 2

It will only work for Swin-T image backbones. You need large batch sizes if BN is present in your image backbone.

Best, Haotian

sunnyHelen commented 1 year ago

Got it. Thanks a lot.

kentang-mit commented 1 year ago

No problem. I'll close this issue since it is resolved.

mit-han-lab / bevfusion

Slow training speed #192