Experiment settings for model training.

supersyq commented 1 year ago

Hi, thanks for sharing this wonderful project. I am currently training the network by running

python main.py configs/waymo/waymo.yaml 4 1 --misc.mode=train --path.dataset_base_local=$YOUR_DATASET_FOLDER.

After 2-epoch training, the following errors occur:

CUDA out of memory. Tried to allocate 406.00 MiB (GPU O; 23.65 GiB total capacity; 21.61 GiB already allocated; 356.31 MiB free; 22.04 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF10%283/2859 [07:52<1:12:56, 1.70s/it] CUDA out of memory. Tried to allocate 406.00 MiB (GPU 0; 23.65 GiB total capacity; 21.92 GiB already allocated; 96.31 MiB free; 22.29 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF10%. |284/2859 [07:53<1:01:19, 1.43s/it] CUDA out of memory. Tried to allocate 406.00 MiB (GPU O; 23.65 GiB total capacity; 21.59 GiB already allocated; 346.31 MiB free; 22.05 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF10%| 284/2859 [07:54<1:11:43, 1.67s/it] Traceback (most recent call last): File "main.py", line 79, in trainer.train() File "/test/Flow-experiments/PCAccumulation-main/libs/trainer.py", line 258, in train self. inference_one_epoch(epoch, 'train') File, "/test/Flow-experiments/PCAccumulation-main/libs/trainer.py", line 243, in inference_one_epoch self. update_tensorboard(stats_meter, curr_iter, phase) File "/test/Flow-experiments/PCAccumulation-main/libs/trainer.py", line 119, in update_tensorboard stats, message = compute_mean_iou_recall_precis ion(stats_meter['mos_metric'], self. mos_mapping) TypeError: 'NoneType' object is not subscriptable。

It seems that all iterations are skipped due to being out of the CUDA memory. And I also set the batch size to 2 and the same problem occurs. Do you have any suggestions to solve this problem? (The training is on: Python 3.8.8, Pytorch 1.12.0+cu116, a NVIDIA TITAN RTX GPU) Besides, I set the hyper-parameter ''iter_size'' from 1 to 2, and the problem seemed to be solved. But I am very worried about whether this would have a negative impact on the model training and could not get the experimental results in the paper. Would you share more details about the parameter settings for model training?

ShengyuH commented 1 year ago

hi, I actually use the same GPU card and didn't experience any issues. I am not sure if that's really CUDA OO memory problem, I would remove try ... except and see what's the error there. If it's really a problem of CUDA memory, then you can decrease the batch-size and proportionally increase the iter_size such that batch_size times iter_size stays the same. This guarantees for each optimisation step, the gradients are calculated over the same amount of samples.

ShengyuH commented 1 year ago

Close due to inactivity.

prs-eth / PCAccumulation

Experiment settings for model training. #5