GPU memory cost - Githubissues

VLadImirluren commented 4 months ago

Could you share the memory cost by different stage and different dataset?

I met the problem that I train fine VAE on 80GB GPU with batch_size=1 will OOM at epoch0

VLadImirluren commented 4 months ago

I met the problem :

I know you have written the code to skip OOM batch

BUT it just warning and no error no failed and no terminate and the GPU memory will not released and no move forward...

How should I deal with this problem? (Except hand craft and I am reproducing the result show in paper so load your ckpt is not a solution too)

VLadImirluren commented 4 months ago

Finetune on your ckpt still Runtime Error

nohup: ignoring input 2024-07-27 14:39:26.122 | INFO | main::171 - This is train_auto.py! Please note that you should use 300 instead of 300.0 for resuming. git root error: Cmd('git') failed due to: exit code(128) cmdline: git rev-parse --show-toplevel stderr: 'fatal: detected dubious ownership in repository at '/mnt/pfs/users/dengken/code/XCube' To add an exception for this directory, call:

git config --global --add safe.directory /mnt/pfs/users/dengken/code/XCube'

git root error: Cmd('git') failed due to: exit code(128) cmdline: git rev-parse --show-toplevel stderr: 'fatal: detected dubious ownership in repository at '/mnt/pfs/users/dengken/code/XCube' To add an exception for this directory, call:

git config --global --add safe.directory /mnt/pfs/users/dengken/code/XCube'

wandb: Currently logged in as: 13532152291 (13532152291-sun-yat-sen-university). Use `wandb login --relogin` to force relogin wandb: - Waiting for wandb.init()... wandb: \ Waiting for wandb.init()... wandb: Tracking run with wandb version 0.17.3 wandb: Run data is saved locally in ../wandb/wandb/run-20240727_143927-afca2fj3 wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run chair_VAE_sparse/512_to_128-kld-1.0 wandb: ⭐️ View project at https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet wandb: 🚀 View run at https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet/runs/afca2fj3 [rank: 0] Global seed set to 0 GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs 2024-07-27 14:39:42.125 | INFO | xcube.modules.autoencoding.sunet:init:241 - latent dim: 8 [rank: 0] Global seed set to 0 Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1

distributed_backend=nccl All distributed processes registered. Starting with 1 processes

dk-process-data-master-0:84258:84258 [0] NCCL INFO Bootstrap : Using eth0:172.16.28.236<0> dk-process-data-master-0:84258:84258 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol. dk-process-data-master-0:84258:84258 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) dk-process-data-master-0:84258:84258 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol. dk-process-data-master-0:84258:84258 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) dk-process-data-master-0:84258:84258 [0] NCCL INFO cudaDriverVersion 12020 NCCL version 2.19.3+cuda12.3 Restoring states from the checkpoint path at /mnt/pfs/users/dengken/code/XCube/checkpoints/chair_download/fine_vae/last.ckpt /root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:1452: UserWarning: Be aware that when using ckpt_path, callbacks used to create the checkpoint need to be provided during Trainer instantiation. Please add the following callbacks: ["ModelCheckpoint{'monitor': 'val_step', 'mode': 'max', 'every_n_train_steps': 5000, 'every_n_epochs': 0, 'train_time_interval': None}"]. rank_zero_warn( LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

| Name | Type | Params

0 | encoder | Encoder | 1.2 K 1 | unet | StructPredictionNet | 3.8 M 2 | loss | Loss | 0

3.8 M Trainable params 0 Non-trainable params 3.8 M Total params 15.203 Total estimated model params size (MB) Restored all states from the checkpoint file at /mnt/pfs/users/dengken/code/XCube/checkpoints/chair_download/fine_vae/last.ckpt

======= MODEL HYPER-PARAMETERS ======= <<<< exec: null include: null test_set_shuffle: false batch_size: 1 accumulate_grad_batches: 32 visualize: false name: shapenet/chair_VAE_sparse model: autoencoder tree_depth: 3 voxel_size:

0.0025

0.0025

0.0025 resolution: 512 use_fvdb_loader: true use_hash_tree: true use_input_normal: true use_input_semantic: false use_input_intensity: false cut_ratio: 16 kl_weight: 1.0 normalize_kld: true enable_anneal: false kl_weight_min: 1.0e-07 kl_weight_max: 1.0 anneal_star_iter: 0 anneal_end_iter: 70000 supervision: structure_weight: 20.0 normal_weight: 300.0 color_weight: 0.0 semantic_weight: 0.0 optimizer: Adam learning_rate: init: 0.0001 decay_mult: 0.7 decay_step: 50000 clip: 1.0e-06 weight_decay: 0.0 grad_clip: 0.5 network: encoder: c_dim: 32 unet: target: StructPredictionNet params: in_channels: 32 num_blocks: 3 f_maps: 32 neck_dense_type: UNCHANGED neck_bound:

64

64

64 num_res_blocks: 1 use_residual: false order: gcr is_add_dec: false use_attention: false use_checkpoint: false _shapenet_path: ../data/shapenet/ _shapenet_categories:

'03001627' _shapenet_custom_name: shapenet train_dataset: ShapeNetDataset train_val_num_workers: 0 train_kwargs: onet_base_path: ../data/shapenet/ resolution: 512 categories:

'03001627' custom_name: shapenet split: train random_seed: 0 val_dataset: ShapeNetDataset val_kwargs: onet_base_path: ../data/shapenet/ resolution: 512 categories:

'03001627' custom_name: shapenet split: val random_seed: fixed test_dataset: ShapeNetDataset test_num_workers: 0 test_kwargs: onet_base_path: ../data/shapenet/ resolution: 512 categories:

'03001627' custom_name: shapenet split: test random_seed: fixed remain_h: false pretrained_weight: null use_input_color: false with_color_branch: false with_normal_branch: true with_semantic_branch: false

====================================== <<<<

Sanity Checking: 0it [00:00, ?it/s]/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 128 which is the number of cpus on this machine) in theDataLoader` init to improve performance. rank_zero_warn( dk-process-data-master-0:84258:86092 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so dk-process-data-master-0:84258:86092 [0] NCCL INFO P2P plugin IBext dk-process-data-master-0:84258:86092 [0] NCCL INFO NET/IB : No device found. dk-process-data-master-0:84258:86092 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0. dk-process-data-master-0:84258:86092 [0] NCCL INFO NET/IB : No device found. dk-process-data-master-0:84258:86092 [0] NCCL INFO NET/Socket : Using [0]eth0:172.16.28.236<0> dk-process-data-master-0:84258:86092 [0] NCCL INFO Using non-device net plugin version 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Using network Socket dk-process-data-master-0:84258:86092 [0] NCCL INFO comm 0x561eb43b3e00 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId ad000 commId 0x68b3dc29606196e0 - Init START dk-process-data-master-0:84258:86092 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC dk-process-data-master-0:84258:86092 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff,00000000 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 00/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 01/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 02/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 03/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 04/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 05/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 06/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 07/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 08/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 09/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 10/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 11/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 12/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 13/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 14/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 15/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 16/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 17/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 18/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 19/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 20/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 21/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 22/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 23/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 24/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 25/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 26/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 27/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 28/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 29/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 30/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 31/32 : 0 dk-process-data-master-0:84258:86092 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1 dk-process-data-master-0:84258:86092 [0] NCCL INFO P2P Chunksize set to 131072 dk-process-data-master-0:84258:86092 [0] NCCL INFO Connected all rings dk-process-data-master-0:84258:86092 [0] NCCL INFO Connected all trees dk-process-data-master-0:84258:86092 [0] NCCL INFO 32 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer dk-process-data-master-0:84258:86092 [0] NCCL INFO comm 0x561eb43b3e00 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId ad000 commId 0x68b3dc29606196e0 - Init COMPLETE

Sanity Checking: 0%| | 0/2 [00:00<?, ?it/s] Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/utilities/data.py:84: UserWarning: Trying to infer the batch_size from an ambiguous collection. The batch size we found is 1016724. To avoid any miscalculations, use self.log(..., batch_size=batch_size). warning_cache.warn( /root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:232: UserWarning: You called self.log('val_step', ...) in your validation_step but the value needs to be floating point. Converting it to torch.float32. warning_cache.warn(

Sanity Checking DataLoader 0: 50%|█████ | 1/2 [00:02<00:02, 2.27s/it]/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/utilities/data.py:84: UserWarning: Trying to infer the batch_size from an ambiguous collection. The batch size we found is 507080. To avoid any miscalculations, use self.log(..., batch_size=batch_size). warning_cache.warn(

Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:02<00:00, 1.32s/it]/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_metric/struct-acc-2', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices. warning_cache.warn( /root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_metric/struct-acc-1', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices. warning_cache.warn( /root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_metric/struct-acc-0', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices. warning_cache.warn( /root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/struct-2', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices. warning_cache.warn( /root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/struct-1', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices. warning_cache.warn( /root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/struct-0', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices. warning_cache.warn( /root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/normal', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices. warning_cache.warn( /root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/kld', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices. warning_cache.warn( /root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/mu-0', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices. warning_cache.warn( /root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/logvar-0', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices. warning_cache.warn( /root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/kld-true-0', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices. warning_cache.warn( /root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/kld-total-0', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices. warning_cache.warn( /root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices. warning_cache.warn( /root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_step', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices. warning_cache.warn(

/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 128 which is the number of cpus on this machine) in theDataLoader` init to improve performance. rank_zero_warn(

Training: 594it [00:00, ?it/s] Training: 0%| | 0/6271 [00:00<00:00, -20590219.64it/s] Epoch 100: 0%| | 0/6271 [00:00<?, ?it/s] /root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:232: UserWarning: You called self.log('val_step', ...) in your training_step but the value needs to be floating point. Converting it to torch.float32. warning_cache.warn(

Epoch 100: 0%| | 1/6271 [00:01<3:04:46, 1.77s/it] Epoch 100: 0%| | 1/6271 [00:01<3:04:54, 1.77s/it, loss=18.8, v_num=2fj3] Epoch 100: 0%| | 2/6271 [00:03<3:27:35, 1.99s/it, loss=18.8, v_num=2fj3] Epoch 100: 0%| | 2/6271 [00:03<3:27:39, 1.99s/it, loss=26.3, v_num=2fj3] Epoch 100: 0%| | 3/6271 [00:04<2:45:00, 1.58s/it, loss=26.3, v_num=2fj3] Epoch 100: 0%| | 3/6271 [00:04<2:45:02, 1.58s/it, loss=23.6, v_num=2fj3] Epoch 100: 0%| | 4/6271 [00:05<2:14:35, 1.29s/it, loss=23.6, v_num=2fj3] Epoch 100: 0%| | 4/6271 [00:05<2:14:37, 1.29s/it, loss=23.9, v_num=2fj3] Epoch 100: 0%| | 5/6271 [00:05<1:56:01, 1.11s/it, loss=23.9, v_num=2fj3] Epoch 100: 0%| | 5/6271 [00:05<1:56:03, 1.11s/it, loss=23.5, v_num=2fj3] Epoch 100: 0%| | 6/6271 [00:07<2:12:47, 1.27s/it, loss=23.5, v_num=2fj3] Epoch 100: 0%| | 6/6271 [00:07<2:12:48, 1.27s/it, loss=24.2, v_num=2fj3] Epoch 100: 0%| | 7/6271 [00:08<2:06:06, 1.21s/it, loss=24.2, v_num=2fj3] Epoch 100: 0%| | 7/6271 [00:08<2:06:07, 1.21s/it, loss=24.2, v_num=2fj3] Epoch 100: 0%| | 8/6271 [00:08<1:56:14, 1.11s/it, loss=24.2, v_num=2fj3] Epoch 100: 0%| | 8/6271 [00:08<1:56:15, 1.11s/it, loss=23.1, v_num=2fj3] Epoch 100: 0%| | 9/6271 [00:09<1:52:52, 1.08s/it, loss=23.1, v_num=2fj3] Epoch 100: 0%| | 9/6271 [00:09<1:52:53, 1.08s/it, loss=23, v_num=2fj3]
Epoch 100: 0%| | 10/6271 [00:10<1:52:31, 1.08s/it, loss=23, v_num=2fj3] Epoch 100: 0%| | 10/6271 [00:10<1:52:31, 1.08s/it, loss=22.9, v_num=2fj3] Epoch 100: 0%| | 11/6271 [00:11<1:48:02, 1.04s/it, loss=22.9, v_num=2fj3] Epoch 100: 0%| | 11/6271 [00:11<1:48:02, 1.04s/it, loss=22.5, v_num=2fj3] Epoch 100: 0%| | 12/6271 [00:11<1:42:54, 1.01it/s, loss=22.5, v_num=2fj3] Epoch 100: 0%| | 12/6271 [00:11<1:42:54, 1.01it/s, loss=23, v_num=2fj3]
Epoch 100: 0%| | 13/6271 [00:12<1:39:30, 1.05it/s, loss=23, v_num=2fj3] Epoch 100: 0%| | 13/6271 [00:12<1:39:30, 1.05it/s, loss=22.7, v_num=2fj3] Epoch 100: 0%| | 14/6271 [00:13<1:41:52, 1.02it/s, loss=22.7, v_num=2fj3] Epoch 100: 0%| | 14/6271 [00:13<1:41:52, 1.02it/s, loss=22.8, v_num=2fj3] Epoch 100: 0%| | 15/6271 [00:14<1:38:20, 1.06it/s, loss=22.8, v_num=2fj3] Epoch 100: 0%| | 15/6271 [00:14<1:38:21, 1.06it/s, loss=22.2, v_num=2fj3] Epoch 100: 0%| | 16/6271 [00:15<1:42:53, 1.01it/s, loss=22.2, v_num=2fj3] Epoch 100: 0%| | 16/6271 [00:15<1:42:53, 1.01it/s, loss=23.2, v_num=2fj3] Epoch 100: 0%| | 17/6271 [00:16<1:40:30, 1.04it/s, loss=23.2, v_num=2fj3] Epoch 100: 0%| | 17/6271 [00:16<1:40:30, 1.04it/s, loss=22.7, v_num=2fj3] Epoch 100: 0%| | 18/6271 [00:18<1:45:25, 1.01s/it, loss=22.7, v_num=2fj3] Epoch 100: 0%| | 18/6271 [00:18<1:45:25, 1.01s/it, loss=23.6, v_num=2fj3] Epoch 100: 0%| | 19/6271 [00:19<1:47:14, 1.03s/it, loss=23.6, v_num=2fj3] Epoch 100: 0%| | 19/6271 [00:19<1:47:15, 1.03s/it, loss=23.5, v_num=2fj3] Epoch 100: 0%| | 20/6271 [00:20<1:44:26, 1.00s/it, loss=23.5, v_num=2fj3] Epoch 100: 0%| | 20/6271 [00:20<1:44:26, 1.00s/it, loss=23.4, v_num=2fj3] Epoch 100: 0%| | 21/6271 [00:21<1:44:16, 1.00s/it, loss=23.4, v_num=2fj3] Epoch 100: 0%| | 21/6271 [00:21<1:44:17, 1.00s/it, loss=23.3, v_num=2fj3] Epoch 100: 0%| | 22/6271 [00:22<1:45:58, 1.02s/it, loss=23.3, v_num=2fj3] Epoch 100: 0%| | 22/6271 [00:22<1:45:59, 1.02s/it, loss=22.7, v_num=2fj3] Epoch 100: 0%| | 23/6271 [00:22<1:43:30, 1.01it/s, loss=22.7, v_num=2fj3] Epoch 100: 0%| | 23/6271 [00:22<1:43:30, 1.01it/s, loss=22.8, v_num=2fj3] Epoch 100: 0%| | 24/6271 [00:24<1:44:31, 1.00s/it, loss=22.8, v_num=2fj3] Epoch 100: 0%| | 24/6271 [00:24<1:44:31, 1.00s/it, loss=23.1, v_num=2fj3] Epoch 100: 0%| | 25/6271 [00:24<1:42:20, 1.02it/s, loss=23.1, v_num=2fj3] Epoch 100: 0%| | 25/6271 [00:24<1:42:21, 1.02it/s, loss=22.9, v_num=2fj3] Epoch 100: 0%| | 26/6271 [00:25<1:40:29, 1.04it/s, loss=22.9, v_num=2fj3] Epoch 100: 0%| | 26/6271 [00:25<1:40:30, 1.04it/s, loss=22.6, v_num=2fj3] Epoch 100: 0%| | 27/6271 [00:25<1:38:16, 1.06it/s, loss=22.6, v_num=2fj3] Epoch 100: 0%| | 27/6271 [00:25<1:38:16, 1.06it/s, loss=22.5, v_num=2fj3] Epoch 100: 0%| | 28/6271 [00:26<1:36:59, 1.07it/s, loss=22.5, v_num=2fj3] Epoch 100: 0%| | 28/6271 [00:26<1:36:59, 1.07it/s, loss=22.6, v_num=2fj3] Epoch 100: 0%| | 29/6271 [00:26<1:35:29, 1.09it/s, loss=22.6, v_num=2fj3] Epoch 100: 0%| | 29/6271 [00:26<1:35:30, 1.09it/s, loss=22.2, v_num=2fj3] Epoch 100: 0%| | 30/6271 [00:27<1:34:08, 1.10it/s, loss=22.2, v_num=2fj3] Epoch 100: 0%| | 30/6271 [00:27<1:34:08, 1.10it/s, loss=22.3, v_num=2fj3] Epoch 100: 0%| | 31/6271 [00:29<1:37:20, 1.07it/s, loss=22.3, v_num=2fj3] Epoch 100: 0%| | 31/6271 [00:29<1:37:20, 1.07it/s, loss=22.8, v_num=2fj3][rank0]:[W reducer.cpp:1360] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) Traceback (most recent call last): File "/mnt/pfs/users/dengken/code/XCube/train.py", line 407, in trainer.fit(net_model, ckpt_path=last_ckpt_path) File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit call._call_and_handle_interrupt( File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, kwargs) File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run results = self._run_stage() File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage self._run_train() File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1214, in _run_train self.fit_loop.run() File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, *kwargs) File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(args, kwargs) File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 213, in advance batch_output = self.batch_loop.run(kwargs) File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, kwargs) File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance outputs = self.optimizer_loop.run(optimizers, kwargs) File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, *kwargs) File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 202, in advance result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position]) File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 249, in _run_optimization self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure) File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 370, in _optimizer_step self.trainer._call_lightning_module_hook( File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1356, in _call_lightning_module_hook output = fn(args, kwargs) File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/module.py", line 1742, in optimizer_step optimizer.step(closure=optimizer_closure) File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 169, in step step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, kwargs) File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 280, in optimizer_step optimizer_output = super().optimizer_step(optimizer, opt_idx, closure, model, kwargs) File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 234, in optimizer_step return self.precision_plugin.optimizer_step( File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 119, in optimizer_step return optimizer.step(closure=closure, kwargs) File "/root/miniconda3/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 75, in wrapper return wrapped(*args, *kwargs) File "/root/miniconda3/lib/python3.9/site-packages/torch/optim/optimizer.py", line 385, in wrapper out = func(args, kwargs) File "/root/miniconda3/lib/python3.9/site-packages/torch/optim/optimizer.py", line 76, in _use_grad ret = func(self, *args, **kwargs) File "/root/miniconda3/lib/python3.9/site-packages/torch/optim/adamw.py", line 187, in step adamw( File "/root/miniconda3/lib/python3.9/site-packages/torch/optim/adamw.py", line 339, in adamw func( File "/root/miniconda3/lib/python3.9/site-packages/torch/optim/adamw.py", line 549, in _multi_tensor_adamw torch._foreachlerp(device_exp_avgs, device_grads, 1 - beta1) RuntimeError: The size of tensor a (32) must match the size of tensor b (36) at non-singleton dimension 1 Training Finished. Best path = ../wandb/xcube-shapenet/afca2fj3/checkpoints/epoch=000100-step=000029700.ckpt wandb: - 0.014 MB of 0.014 MB uploaded wandb: \ 0.019 MB of 0.042 MB uploaded wandb: | 0.036 MB of 0.042 MB uploaded wandb: 🚀 View run chair_VAE_sparse/512_to_128-kld-1.0 at: https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet/runs/afca2fj3 wandb: ⭐️ View project at: https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) wandb: Find logs at: ../wandb/wandb/run-20240727_143927-afca2fj3/logs wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with wandb.require("core")! See https://wandb.me/wandb-core for more information. Exception ignored in: <function tqdm.del at 0x7f5980fb2ca0> Traceback (most recent call last): File "/root/miniconda3/lib/python3.9/site-packages/tqdm/std.py", line 1152, in del File "/root/miniconda3/lib/python3.9/site-packages/tqdm/std.py", line 1306, in close File "/root/miniconda3/lib/python3.9/site-packages/tqdm/std.py", line 1499, in display File "/root/miniconda3/lib/python3.9/site-packages/tqdm/std.py", line 1155, in str File "/root/miniconda3/lib/python3.9/site-packages/tqdm/std.py", line 1457, in format_dict TypeError: cannot unpack non-iterable NoneType object dk-process-data-master-0:84258:86123 [0] NCCL INFO [Service thread] Connection closed by localRank 0 dk-process-data-master-0:84258:84258 [0] NCCL INFO comm 0x561eb43b3e00 rank 0 nranks 1 cudaDev 0 busId ad000 - Abort COMPLETE

xrenaa commented 3 months ago

Hi, could you watch your GPU memory usage by watch nvidia-smi? Since the first error is an illegal memory usage and the second error is The size of tensor a (32) must match the size of tensor b (36) at non-singleton dimension 1? Btw, I am also using 80GB A100. If the problem is truely a OOM issue, I would suggest you remove the sample that triggers OOM.

VLadImirluren commented 3 months ago

Hi, could you watch your GPU memory usage by watch nvidia-smi? Since the first error is an illegal memory usage and the second error is The size of tensor a (32) must match the size of tensor b (36) at non-singleton dimension 1? Btw, I am also using 80GB A100. If the problem is truely a OOM issue, I would suggest you remove the sample that triggers OOM.

I always monitor by "watch -n 0.1 nvidia-smi"

For the first error, I met it many times. After it occur, the program would "[rank0]:[W CUDAGuardImpl.h:115] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent) Aborted (core dumped)", "watch -n 0.1 nvidia-smi" will stop at and the GPU memory will not released so I need to "pkill -f python" to release it.

For the second problem, I don't what's the reason because I just load the ckpt and finetune it (NO CODE BEING CHANGED) I don't know why, because the ckpt is given by you and I have no way to make sure the ckpt have any problem in finetuning.

I am not sure it is a OOM problem, because "[rank0]:[W CUDAGuardImpl.h:115] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent)" give no information about it, even no log file. I believe that you have met this problem before, could you please give me some instruction or prompt about it?

At last, I will try the solution you give "remove the sample that triggers OOM.".

Thanks for your reply

VLadImirluren commented 3 months ago

Hi, could you watch your GPU memory usage by watch nvidia-smi? Since the first error is an illegal memory usage and the second error is The size of tensor a (32) must match the size of tensor b (36) at non-singleton dimension 1? Btw, I am also using 80GB A100. If the problem is truely a OOM issue, I would suggest you remove the sample that triggers OOM.

The link to shapenet dataset is empty. please at least release the list...

xrenaa commented 3 months ago

Could you try https://drive.google.com/file/d/1PQmSomS1B7UR7wNuqp5RtgkdXo7stKzG/view?usp=sharing?

VLadImirluren commented 3 months ago

Hi, could you watch your GPU memory usage by watch nvidia-smi? Since the first error is an illegal memory usage and the second error is The size of tensor a (32) must match the size of tensor b (36) at non-singleton dimension 1? Btw, I am also using 80GB A100. If the problem is truely a OOM issue, I would suggest you remove the sample that triggers OOM.

I just try your suggestion. It is not a good idea to deal with this problem. Because it is not a problem with any sample, it is about the whole batch (batchsize * gradient_accumulation).

VLadImirluren commented 3 months ago

Could you try https://drive.google.com/file/d/1PQmSomS1B7UR7wNuqp5RtgkdXo7stKzG/view?usp=sharing?

I sent an application, please check it.Thanks

LeoDarcy commented 1 month ago

Hi, could you watch your GPU memory usage by watch nvidia-smi? Since the first error is an illegal memory usage and the second error is The size of tensor a (32) must match the size of tensor b (36) at non-singleton dimension 1? Btw, I am also using 80GB A100. If the problem is truely a OOM issue, I would suggest you remove the sample that triggers OOM.

I just try your suggestion. It is not a good idea to deal with this problem. Because it is not a problem with any sample, it is about the whole batch (batchsize * gradient_accumulation).

Hi, I have the same problem. Have you solved this problem? I have tried to reduce the batch size and removed some samples, but it doesn't work. It occurs in the second epoch.

VLadImirluren commented 1 month ago

Hi, could you watch your GPU memory usage by watch nvidia-smi? Since the first error is an illegal memory usage and the second error is The size of tensor a (32) must match the size of tensor b (36) at non-singleton dimension 1? Btw, I am also using 80GB A100. If the problem is truely a OOM issue, I would suggest you remove the sample that triggers OOM.

I just try your suggestion. It is not a good idea to deal with this problem. Because it is not a problem with any sample, it is about the whole batch (batchsize * gradient_accumulation).

Hi, I have the same problem. Have you solved this problem? I have tried to reduce the batch size and removed some samples, but it doesn't work. It occurs in the second epoch.

NO!

nv-tlabs / XCube

GPU memory cost #19

distributed_backend=nccl All distributed processes registered. Starting with 1 processes

| Name | Type | Params

0 | encoder | Encoder | 1.2 K 1 | unet | StructPredictionNet | 3.8 M 2 | loss | Loss | 0