Open VLadImirluren opened 4 months ago
I met the problem :
I know you have written the code to skip OOM batch
BUT it just warning and no error no failed and no terminate and the GPU memory will not released and no move forward...
How should I deal with this problem? (Except hand craft and I am reproducing the result show in paper so load your ckpt is not a solution too)
Finetune on your ckpt still Runtime Error
nohup: ignoring input
2024-07-27 14:39:26.122 | INFO | main:
git config --global --add safe.directory /mnt/pfs/users/dengken/code/XCube'
git root error: Cmd('git') failed due to: exit code(128) cmdline: git rev-parse --show-toplevel stderr: 'fatal: detected dubious ownership in repository at '/mnt/pfs/users/dengken/code/XCube' To add an exception for this directory, call:
git config --global --add safe.directory /mnt/pfs/users/dengken/code/XCube'
wandb login --relogin
to force relogin
wandb: - Waiting for wandb.init()...
wandb: \ Waiting for wandb.init()...
wandb: Tracking run with wandb version 0.17.3
wandb: Run data is saved locally in ../wandb/wandb/run-20240727_143927-afca2fj3
wandb: Run wandb offline
to turn off syncing.
wandb: Syncing run chair_VAE_sparse/512_to_128-kld-1.0
wandb: ⭐️ View project at https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet
wandb: 🚀 View run at https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet/runs/afca2fj3
[rank: 0] Global seed set to 0
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2024-07-27 14:39:42.125 | INFO | xcube.modules.autoencoding.sunet:init:241 - latent dim: 8
[rank: 0] Global seed set to 0
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1dk-process-data-master-0:84258:84258 [0] NCCL INFO Bootstrap : Using eth0:172.16.28.236<0>
dk-process-data-master-0:84258:84258 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
dk-process-data-master-0:84258:84258 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
dk-process-data-master-0:84258:84258 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
dk-process-data-master-0:84258:84258 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
dk-process-data-master-0:84258:84258 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda12.3
Restoring states from the checkpoint path at /mnt/pfs/users/dengken/code/XCube/checkpoints/chair_download/fine_vae/last.ckpt
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:1452: UserWarning: Be aware that when using ckpt_path
, callbacks used to create the checkpoint need to be provided during Trainer
instantiation. Please add the following callbacks: ["ModelCheckpoint{'monitor': 'val_step', 'mode': 'max', 'every_n_train_steps': 5000, 'every_n_epochs': 0, 'train_time_interval': None}"].
rank_zero_warn(
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
3.8 M Trainable params 0 Non-trainable params 3.8 M Total params 15.203 Total estimated model params size (MB) Restored all states from the checkpoint file at /mnt/pfs/users/dengken/code/XCube/checkpoints/chair_download/fine_vae/last.ckpt
======= MODEL HYPER-PARAMETERS ======= <<<< exec: null include: null test_set_shuffle: false batch_size: 1 accumulate_grad_batches: 32 visualize: false name: shapenet/chair_VAE_sparse model: autoencoder tree_depth: 3 voxel_size:
- 0.0025
- 0.0025
- 0.0025 resolution: 512 use_fvdb_loader: true use_hash_tree: true use_input_normal: true use_input_semantic: false use_input_intensity: false cut_ratio: 16 kl_weight: 1.0 normalize_kld: true enable_anneal: false kl_weight_min: 1.0e-07 kl_weight_max: 1.0 anneal_star_iter: 0 anneal_end_iter: 70000 supervision: structure_weight: 20.0 normal_weight: 300.0 color_weight: 0.0 semantic_weight: 0.0 optimizer: Adam learning_rate: init: 0.0001 decay_mult: 0.7 decay_step: 50000 clip: 1.0e-06 weight_decay: 0.0 grad_clip: 0.5 network: encoder: c_dim: 32 unet: target: StructPredictionNet params: in_channels: 32 num_blocks: 3 f_maps: 32 neck_dense_type: UNCHANGED neck_bound:
- 64
- 64
- 64 num_res_blocks: 1 use_residual: false order: gcr is_add_dec: false use_attention: false use_checkpoint: false _shapenet_path: ../data/shapenet/ _shapenet_categories:
- '03001627' _shapenet_custom_name: shapenet train_dataset: ShapeNetDataset train_val_num_workers: 0 train_kwargs: onet_base_path: ../data/shapenet/ resolution: 512 categories:
- '03001627' custom_name: shapenet split: train random_seed: 0 val_dataset: ShapeNetDataset val_kwargs: onet_base_path: ../data/shapenet/ resolution: 512 categories:
- '03001627' custom_name: shapenet split: val random_seed: fixed test_dataset: ShapeNetDataset test_num_workers: 0 test_kwargs: onet_base_path: ../data/shapenet/ resolution: 512 categories:
- '03001627' custom_name: shapenet split: test random_seed: fixed remain_h: false pretrained_weight: null use_input_color: false with_color_branch: false with_normal_branch: true with_semantic_branch: false
====================================== <<<<
Sanity Checking: 0it [00:00, ?it/s]/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers
argument(try 128 which is the number of cpus on this machine) in the
DataLoader` init to improve performance.
rank_zero_warn(
dk-process-data-master-0:84258:86092 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
dk-process-data-master-0:84258:86092 [0] NCCL INFO P2P plugin IBext
dk-process-data-master-0:84258:86092 [0] NCCL INFO NET/IB : No device found.
dk-process-data-master-0:84258:86092 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
dk-process-data-master-0:84258:86092 [0] NCCL INFO NET/IB : No device found.
dk-process-data-master-0:84258:86092 [0] NCCL INFO NET/Socket : Using [0]eth0:172.16.28.236<0>
dk-process-data-master-0:84258:86092 [0] NCCL INFO Using non-device net plugin version 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Using network Socket
dk-process-data-master-0:84258:86092 [0] NCCL INFO comm 0x561eb43b3e00 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId ad000 commId 0x68b3dc29606196e0 - Init START
dk-process-data-master-0:84258:86092 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
dk-process-data-master-0:84258:86092 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff,00000000
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 00/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 01/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 02/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 03/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 04/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 05/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 06/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 07/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 08/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 09/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 10/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 11/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 12/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 13/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 14/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 15/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 16/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 17/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 18/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 19/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 20/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 21/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 22/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 23/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 24/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 25/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 26/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 27/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 28/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 29/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 30/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 31/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
dk-process-data-master-0:84258:86092 [0] NCCL INFO P2P Chunksize set to 131072
dk-process-data-master-0:84258:86092 [0] NCCL INFO Connected all rings
dk-process-data-master-0:84258:86092 [0] NCCL INFO Connected all trees
dk-process-data-master-0:84258:86092 [0] NCCL INFO 32 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
dk-process-data-master-0:84258:86092 [0] NCCL INFO comm 0x561eb43b3e00 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId ad000 commId 0x68b3dc29606196e0 - Init COMPLETE
Sanity Checking: 0%| | 0/2 [00:00<?, ?it/s]
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/utilities/data.py:84: UserWarning: Trying to infer the batch_size
from an ambiguous collection. The batch size we found is 1016724. To avoid any miscalculations, use self.log(..., batch_size=batch_size)
.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:232: UserWarning: You called self.log('val_step', ...)
in your validation_step
but the value needs to be floating point. Converting it to torch.float32.
warning_cache.warn(
Sanity Checking DataLoader 0: 50%|█████ | 1/2 [00:02<00:02, 2.27s/it]/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/utilities/data.py:84: UserWarning: Trying to infer the batch_size
from an ambiguous collection. The batch size we found is 507080. To avoid any miscalculations, use self.log(..., batch_size=batch_size)
.
warning_cache.warn(
Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:02<00:00, 1.32s/it]/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_metric/struct-acc-2', ..., sync_dist=True)
when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_metric/struct-acc-1', ..., sync_dist=True)
when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_metric/struct-acc-0', ..., sync_dist=True)
when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/struct-2', ..., sync_dist=True)
when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/struct-1', ..., sync_dist=True)
when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/struct-0', ..., sync_dist=True)
when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/normal', ..., sync_dist=True)
when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/kld', ..., sync_dist=True)
when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/mu-0', ..., sync_dist=True)
when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/logvar-0', ..., sync_dist=True)
when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/kld-true-0', ..., sync_dist=True)
when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/kld-total-0', ..., sync_dist=True)
when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss', ..., sync_dist=True)
when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_step', ..., sync_dist=True)
when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers
argument(try 128 which is the number of cpus on this machine) in the
DataLoader` init to improve performance.
rank_zero_warn(
Training: 594it [00:00, ?it/s]
Training: 0%| | 0/6271 [00:00<00:00, -20590219.64it/s]
Epoch 100: 0%| | 0/6271 [00:00<?, ?it/s] /root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:232: UserWarning: You called self.log('val_step', ...)
in your training_step
but the value needs to be floating point. Converting it to torch.float32.
warning_cache.warn(
Epoch 100: 0%| | 1/6271 [00:01<3:04:46, 1.77s/it]
Epoch 100: 0%| | 1/6271 [00:01<3:04:54, 1.77s/it, loss=18.8, v_num=2fj3]
Epoch 100: 0%| | 2/6271 [00:03<3:27:35, 1.99s/it, loss=18.8, v_num=2fj3]
Epoch 100: 0%| | 2/6271 [00:03<3:27:39, 1.99s/it, loss=26.3, v_num=2fj3]
Epoch 100: 0%| | 3/6271 [00:04<2:45:00, 1.58s/it, loss=26.3, v_num=2fj3]
Epoch 100: 0%| | 3/6271 [00:04<2:45:02, 1.58s/it, loss=23.6, v_num=2fj3]
Epoch 100: 0%| | 4/6271 [00:05<2:14:35, 1.29s/it, loss=23.6, v_num=2fj3]
Epoch 100: 0%| | 4/6271 [00:05<2:14:37, 1.29s/it, loss=23.9, v_num=2fj3]
Epoch 100: 0%| | 5/6271 [00:05<1:56:01, 1.11s/it, loss=23.9, v_num=2fj3]
Epoch 100: 0%| | 5/6271 [00:05<1:56:03, 1.11s/it, loss=23.5, v_num=2fj3]
Epoch 100: 0%| | 6/6271 [00:07<2:12:47, 1.27s/it, loss=23.5, v_num=2fj3]
Epoch 100: 0%| | 6/6271 [00:07<2:12:48, 1.27s/it, loss=24.2, v_num=2fj3]
Epoch 100: 0%| | 7/6271 [00:08<2:06:06, 1.21s/it, loss=24.2, v_num=2fj3]
Epoch 100: 0%| | 7/6271 [00:08<2:06:07, 1.21s/it, loss=24.2, v_num=2fj3]
Epoch 100: 0%| | 8/6271 [00:08<1:56:14, 1.11s/it, loss=24.2, v_num=2fj3]
Epoch 100: 0%| | 8/6271 [00:08<1:56:15, 1.11s/it, loss=23.1, v_num=2fj3]
Epoch 100: 0%| | 9/6271 [00:09<1:52:52, 1.08s/it, loss=23.1, v_num=2fj3]
Epoch 100: 0%| | 9/6271 [00:09<1:52:53, 1.08s/it, loss=23, v_num=2fj3]
Epoch 100: 0%| | 10/6271 [00:10<1:52:31, 1.08s/it, loss=23, v_num=2fj3]
Epoch 100: 0%| | 10/6271 [00:10<1:52:31, 1.08s/it, loss=22.9, v_num=2fj3]
Epoch 100: 0%| | 11/6271 [00:11<1:48:02, 1.04s/it, loss=22.9, v_num=2fj3]
Epoch 100: 0%| | 11/6271 [00:11<1:48:02, 1.04s/it, loss=22.5, v_num=2fj3]
Epoch 100: 0%| | 12/6271 [00:11<1:42:54, 1.01it/s, loss=22.5, v_num=2fj3]
Epoch 100: 0%| | 12/6271 [00:11<1:42:54, 1.01it/s, loss=23, v_num=2fj3]
Epoch 100: 0%| | 13/6271 [00:12<1:39:30, 1.05it/s, loss=23, v_num=2fj3]
Epoch 100: 0%| | 13/6271 [00:12<1:39:30, 1.05it/s, loss=22.7, v_num=2fj3]
Epoch 100: 0%| | 14/6271 [00:13<1:41:52, 1.02it/s, loss=22.7, v_num=2fj3]
Epoch 100: 0%| | 14/6271 [00:13<1:41:52, 1.02it/s, loss=22.8, v_num=2fj3]
Epoch 100: 0%| | 15/6271 [00:14<1:38:20, 1.06it/s, loss=22.8, v_num=2fj3]
Epoch 100: 0%| | 15/6271 [00:14<1:38:21, 1.06it/s, loss=22.2, v_num=2fj3]
Epoch 100: 0%| | 16/6271 [00:15<1:42:53, 1.01it/s, loss=22.2, v_num=2fj3]
Epoch 100: 0%| | 16/6271 [00:15<1:42:53, 1.01it/s, loss=23.2, v_num=2fj3]
Epoch 100: 0%| | 17/6271 [00:16<1:40:30, 1.04it/s, loss=23.2, v_num=2fj3]
Epoch 100: 0%| | 17/6271 [00:16<1:40:30, 1.04it/s, loss=22.7, v_num=2fj3]
Epoch 100: 0%| | 18/6271 [00:18<1:45:25, 1.01s/it, loss=22.7, v_num=2fj3]
Epoch 100: 0%| | 18/6271 [00:18<1:45:25, 1.01s/it, loss=23.6, v_num=2fj3]
Epoch 100: 0%| | 19/6271 [00:19<1:47:14, 1.03s/it, loss=23.6, v_num=2fj3]
Epoch 100: 0%| | 19/6271 [00:19<1:47:15, 1.03s/it, loss=23.5, v_num=2fj3]
Epoch 100: 0%| | 20/6271 [00:20<1:44:26, 1.00s/it, loss=23.5, v_num=2fj3]
Epoch 100: 0%| | 20/6271 [00:20<1:44:26, 1.00s/it, loss=23.4, v_num=2fj3]
Epoch 100: 0%| | 21/6271 [00:21<1:44:16, 1.00s/it, loss=23.4, v_num=2fj3]
Epoch 100: 0%| | 21/6271 [00:21<1:44:17, 1.00s/it, loss=23.3, v_num=2fj3]
Epoch 100: 0%| | 22/6271 [00:22<1:45:58, 1.02s/it, loss=23.3, v_num=2fj3]
Epoch 100: 0%| | 22/6271 [00:22<1:45:59, 1.02s/it, loss=22.7, v_num=2fj3]
Epoch 100: 0%| | 23/6271 [00:22<1:43:30, 1.01it/s, loss=22.7, v_num=2fj3]
Epoch 100: 0%| | 23/6271 [00:22<1:43:30, 1.01it/s, loss=22.8, v_num=2fj3]
Epoch 100: 0%| | 24/6271 [00:24<1:44:31, 1.00s/it, loss=22.8, v_num=2fj3]
Epoch 100: 0%| | 24/6271 [00:24<1:44:31, 1.00s/it, loss=23.1, v_num=2fj3]
Epoch 100: 0%| | 25/6271 [00:24<1:42:20, 1.02it/s, loss=23.1, v_num=2fj3]
Epoch 100: 0%| | 25/6271 [00:24<1:42:21, 1.02it/s, loss=22.9, v_num=2fj3]
Epoch 100: 0%| | 26/6271 [00:25<1:40:29, 1.04it/s, loss=22.9, v_num=2fj3]
Epoch 100: 0%| | 26/6271 [00:25<1:40:30, 1.04it/s, loss=22.6, v_num=2fj3]
Epoch 100: 0%| | 27/6271 [00:25<1:38:16, 1.06it/s, loss=22.6, v_num=2fj3]
Epoch 100: 0%| | 27/6271 [00:25<1:38:16, 1.06it/s, loss=22.5, v_num=2fj3]
Epoch 100: 0%| | 28/6271 [00:26<1:36:59, 1.07it/s, loss=22.5, v_num=2fj3]
Epoch 100: 0%| | 28/6271 [00:26<1:36:59, 1.07it/s, loss=22.6, v_num=2fj3]
Epoch 100: 0%| | 29/6271 [00:26<1:35:29, 1.09it/s, loss=22.6, v_num=2fj3]
Epoch 100: 0%| | 29/6271 [00:26<1:35:30, 1.09it/s, loss=22.2, v_num=2fj3]
Epoch 100: 0%| | 30/6271 [00:27<1:34:08, 1.10it/s, loss=22.2, v_num=2fj3]
Epoch 100: 0%| | 30/6271 [00:27<1:34:08, 1.10it/s, loss=22.3, v_num=2fj3]
Epoch 100: 0%| | 31/6271 [00:29<1:37:20, 1.07it/s, loss=22.3, v_num=2fj3]
Epoch 100: 0%| | 31/6271 [00:29<1:37:20, 1.07it/s, loss=22.8, v_num=2fj3][rank0]:[W reducer.cpp:1360] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
Traceback (most recent call last):
File "/mnt/pfs/users/dengken/code/XCube/train.py", line 407, in wandb.require("core")
! See https://wandb.me/wandb-core for more information.
Exception ignored in: <function tqdm.del at 0x7f5980fb2ca0>
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/tqdm/std.py", line 1152, in del
File "/root/miniconda3/lib/python3.9/site-packages/tqdm/std.py", line 1306, in close
File "/root/miniconda3/lib/python3.9/site-packages/tqdm/std.py", line 1499, in display
File "/root/miniconda3/lib/python3.9/site-packages/tqdm/std.py", line 1155, in str
File "/root/miniconda3/lib/python3.9/site-packages/tqdm/std.py", line 1457, in format_dict
TypeError: cannot unpack non-iterable NoneType object
dk-process-data-master-0:84258:86123 [0] NCCL INFO [Service thread] Connection closed by localRank 0
dk-process-data-master-0:84258:84258 [0] NCCL INFO comm 0x561eb43b3e00 rank 0 nranks 1 cudaDev 0 busId ad000 - Abort COMPLETE
Hi, could you watch your GPU memory usage by watch nvidia-smi
?
Since the first error is an illegal memory usage
and the second error is The size of tensor a (32) must match the size of tensor b (36) at non-singleton dimension 1
?
Btw, I am also using 80GB A100. If the problem is truely a OOM issue, I would suggest you remove the sample that triggers OOM.
Hi, could you watch your GPU memory usage by
watch nvidia-smi
? Since the first error is anillegal memory usage
and the second error isThe size of tensor a (32) must match the size of tensor b (36) at non-singleton dimension 1
? Btw, I am also using 80GB A100. If the problem is truely a OOM issue, I would suggest you remove the sample that triggers OOM.
I always monitor by "watch -n 0.1 nvidia-smi"
For the first error, I met it many times. After it occur, the program would "[rank0]:[W CUDAGuardImpl.h:115] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent) Aborted (core dumped)", "watch -n 0.1 nvidia-smi" will stop at and the GPU memory will not released so I need to "pkill -f python" to release it.
For the second problem, I don't what's the reason because I just load the ckpt and finetune it (NO CODE BEING CHANGED) I don't know why, because the ckpt is given by you and I have no way to make sure the ckpt have any problem in finetuning.
I am not sure it is a OOM problem, because "[rank0]:[W CUDAGuardImpl.h:115] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent)" give no information about it, even no log file. I believe that you have met this problem before, could you please give me some instruction or prompt about it?
At last, I will try the solution you give "remove the sample that triggers OOM.".
Thanks for your reply
Hi, could you watch your GPU memory usage by
watch nvidia-smi
? Since the first error is anillegal memory usage
and the second error isThe size of tensor a (32) must match the size of tensor b (36) at non-singleton dimension 1
? Btw, I am also using 80GB A100. If the problem is truely a OOM issue, I would suggest you remove the sample that triggers OOM.
The link to shapenet dataset is empty. please at least release the list...
Hi, could you watch your GPU memory usage by
watch nvidia-smi
? Since the first error is anillegal memory usage
and the second error isThe size of tensor a (32) must match the size of tensor b (36) at non-singleton dimension 1
? Btw, I am also using 80GB A100. If the problem is truely a OOM issue, I would suggest you remove the sample that triggers OOM.
I just try your suggestion. It is not a good idea to deal with this problem. Because it is not a problem with any sample, it is about the whole batch (batchsize * gradient_accumulation).
Could you try https://drive.google.com/file/d/1PQmSomS1B7UR7wNuqp5RtgkdXo7stKzG/view?usp=sharing?
I sent an application, please check it.Thanks
Hi, could you watch your GPU memory usage by
watch nvidia-smi
? Since the first error is anillegal memory usage
and the second error isThe size of tensor a (32) must match the size of tensor b (36) at non-singleton dimension 1
? Btw, I am also using 80GB A100. If the problem is truely a OOM issue, I would suggest you remove the sample that triggers OOM.I just try your suggestion. It is not a good idea to deal with this problem. Because it is not a problem with any sample, it is about the whole batch (batchsize * gradient_accumulation).
Hi, I have the same problem. Have you solved this problem? I have tried to reduce the batch size and removed some samples, but it doesn't work. It occurs in the second epoch.
Hi, could you watch your GPU memory usage by
watch nvidia-smi
? Since the first error is anillegal memory usage
and the second error isThe size of tensor a (32) must match the size of tensor b (36) at non-singleton dimension 1
? Btw, I am also using 80GB A100. If the problem is truely a OOM issue, I would suggest you remove the sample that triggers OOM.I just try your suggestion. It is not a good idea to deal with this problem. Because it is not a problem with any sample, it is about the whole batch (batchsize * gradient_accumulation).
Hi, I have the same problem. Have you solved this problem? I have tried to reduce the batch size and removed some samples, but it doesn't work. It occurs in the second epoch.
NO!
Could you share the memory cost by different stage and different dataset?
I met the problem that I train fine VAE on 80GB GPU with batch_size=1 will OOM at epoch0