Closed divided7 closed 3 months ago
I set bs=4 but OOM in 4090Ti
config = edict() config.margin_list = (1.0, 0.0, 0.4) config.network = "r100" config.resume = False config.output = None config.embedding_size = 512 config.sample_rate = 1.0 config.interclass_filtering_threshold = 0 config.fp16 = True config.weight_decay = 5e-4 config.batch_size = 4 config.optimizer = "sgd" config.ngpus = 1 config.lr = (0.1*config.batch_size*config.ngpus)/(1024) config.verbose = 2000 config.dali = False config.rec = "../webface260m/webface42m_arcface" config.num_classes = 2059906 config.num_image = 42474558 config.num_epoch = 20 config.warmup_epoch = 0 config.val_targets = [] config.eta_scale = 0.1 config.eta_t = 0.1 config.eta_theta = 0.1 config.ratio = 0.75
I have above param. And run python train_v2.py configs/wf42m_r100. The output of terminal:
python train_v2.py configs/wf42m_r100
检测到local_rank: 0 Training: 2024-07-29 14:37:26,808-rank_id: 0 /home/luyuxi/miniconda3/envs/face/lib/python3.8/site-packages/torch/nn/parallel/distributed.py:1772: UserWarning: You passed find_unused_parameters=true to DistributedDataParallel, `_set_static_graph` will detect unused parameters automatically, so you do not need to set find_unused_parameters=true, just be sure these unused parameters will not change during training loop while calling `_set_static_graph`. warnings.warn( Training: 2024-07-29 14:37:46,623-: margin_list (1.0, 0.0, 0.4) Training: 2024-07-29 14:37:46,623-: network r100 Training: 2024-07-29 14:37:46,623-: resume False Training: 2024-07-29 14:37:46,623-: save_all_states False Training: 2024-07-29 14:37:46,623-: output work_dirs/wf42m_r100 Training: 2024-07-29 14:37:46,623-: embedding_size 512 Training: 2024-07-29 14:37:46,623-: sample_rate 1.0 Training: 2024-07-29 14:37:46,623-: interclass_filtering_threshold0 Training: 2024-07-29 14:37:46,623-: fp16 True Training: 2024-07-29 14:37:46,623-: batch_size 4 Training: 2024-07-29 14:37:46,623-: optimizer sgd Training: 2024-07-29 14:37:46,623-: lr 0.000390625 Training: 2024-07-29 14:37:46,623-: momentum 0.9 Training: 2024-07-29 14:37:46,623-: weight_decay 0.0005 Training: 2024-07-29 14:37:46,623-: verbose 2000 Training: 2024-07-29 14:37:46,623-: frequent 10 Training: 2024-07-29 14:37:46,623-: dali False Training: 2024-07-29 14:37:46,623-: dali_aug False Training: 2024-07-29 14:37:46,623-: gradient_acc 1 Training: 2024-07-29 14:37:46,623-: seed 2048 Training: 2024-07-29 14:37:46,623-: num_workers 2 Training: 2024-07-29 14:37:46,623-: wandb_key XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Training: 2024-07-29 14:37:46,623-: suffix_run_name None Training: 2024-07-29 14:37:46,623-: using_wandb False Training: 2024-07-29 14:37:46,623-: wandb_entity entity Training: 2024-07-29 14:37:46,623-: wandb_project project Training: 2024-07-29 14:37:46,623-: wandb_log_all True Training: 2024-07-29 14:37:46,623-: save_artifacts False Training: 2024-07-29 14:37:46,623-: wandb_resume False Training: 2024-07-29 14:37:46,623-: ngpus 1 Training: 2024-07-29 14:37:46,623-: rec ../webface260m/webface42m_arcface Training: 2024-07-29 14:37:46,623-: num_classes 2059906 Training: 2024-07-29 14:37:46,623-: num_image 42474558 Training: 2024-07-29 14:37:46,623-: num_epoch 20 Training: 2024-07-29 14:37:46,623-: warmup_epoch 0 Training: 2024-07-29 14:37:46,623-: val_targets [] Training: 2024-07-29 14:37:46,623-: eta_scale 0.1 Training: 2024-07-29 14:37:46,623-: eta_t 0.1 Training: 2024-07-29 14:37:46,623-: eta_theta 0.1 Training: 2024-07-29 14:37:46,623-: ratio 0.75 Training: 2024-07-29 14:37:46,623-: total_batch_size 4 Training: 2024-07-29 14:37:46,623-: warmup_step 0 Training: 2024-07-29 14:37:46,623-: total_step 212372780 /home/luyuxi/miniconda3/envs/face/lib/python3.8/site-packages/torch/nn/functional.py:4289: UserWarning: Default grid_sample and affine_grid behavior has changed to align_corners=False since 1.3.0. Please specify align_corners=True if the old behavior is desired. See the documentation of grid_sample for details. warnings.warn( /home/luyuxi/miniconda3/envs/face/lib/python3.8/site-packages/torch/nn/functional.py:4227: UserWarning: Default grid_sample and affine_grid behavior has changed to align_corners=False since 1.3.0. Please specify align_corners=True if the old behavior is desired. See the documentation of grid_sample for details. warnings.warn( Traceback (most recent call last): File "train_v2.py", line 271, in <module> main(parser.parse_args()) File "train_v2.py", line 201, in main amp.scale(loss).backward() File "/home/luyuxi/miniconda3/envs/face/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/home/luyuxi/miniconda3/envs/face/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.93 GiB (GPU 0; 23.62 GiB total capacity; 20.46 GiB already allocated; 801.38 MiB free; 20.71 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Okay, the batch size is correct~
I set bs=4 but OOM in 4090Ti
I have above param. And run
python train_v2.py configs/wf42m_r100
. The output of terminal: