Error occur in training diffusion

VLadImirluren commented 1 month ago

At first I try to train coarse VAE using the given command python train.py ./configs/shapenet/chair/train_vae_16x16x16_dense.yaml --wname 16x16x16-kld-0.03_dim-16 --max_epochs 100 --cut_ratio 16 --gpus 1 --batch_size 16

Due to the GPU different (My gpu is one A800 but 8 * V100 said in paper), I change the bs to 16 and set gradient_accumulation to 2.

After successfully coarse VAE training, I try to train coarse diffusion using the given command (still only bs and gradient_accumulation be changed) python train.py ./configs/shapenet/chair/train_diffusion_16x16x16_dense.yaml --wname 16x16x16_kld-0.03 --eval_interval 5 --gpus 1 --batch_size 8 --accumulate_grad_batches 32

But error occur!!! 2024-07-19 15:47:45.053 | INFO | main::171 - This is train_auto.py! Please note that you should use 300 instead of 300.0 for resuming. git root error: Cmd('git') failed due to: exit code(128) cmdline: git rev-parse --show-toplevel stderr: 'fatal: detected dubious ownership in repository at '/mnt/pfs/users/dengken/code/XCube' To add an exception for this directory, call:

    git config --global --add safe.directory /mnt/pfs/users/dengken/code/XCube'

git root error: Cmd('git') failed due to: exit code(128) cmdline: git rev-parse --show-toplevel stderr: 'fatal: detected dubious ownership in repository at '/mnt/pfs/users/dengken/code/XCube' To add an exception for this directory, call:

    git config --global --add safe.directory /mnt/pfs/users/dengken/code/XCube'

wandb: Currently logged in as: 13532152291 (13532152291-sun-yat-sen-university). Use wandb login --relogin to force relogin wandb: Tracking run with wandb version 0.17.3 wandb: Run data is saved locally in ../wandb/wandb/run-20240719_154747-rk4p0a77 wandb: Run wandb offline to turn off syncing. wandb: Syncing run chair_diffusion_dense/16x16x16_kld-0.03 wandb: ⭐️ View project at https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet wandb: 🚀 View run at https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet/runs/rk4p0a77 [rank: 0] Global seed set to 0 GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs /root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/utilities/distributed.py:258: LightningDeprecationWarning: pytorch_lightning.utilities.distributed.rank_zero_only has been deprecated in v1.8.1 and will be removed in v2.0.0. You can import it from pytorch_lightning.utilities instead. rank_zero_deprecation( 2024-07-19 15:48:01.165 | INFO | xcube.modules.autoencoding.sunet:init:240 - latent dim: 16 Traceback (most recent call last): File "/mnt/pfs/users/dengken/code/XCube/train.py", line 380, in net_model = net_module(model_args) File "/mnt/pfs/users/dengken/code/XCube/xcube/models/diffusion.py", line 84, in init self.vae = self.load_first_stage_from_pretrained().eval() File "/root/miniconda3/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/mnt/pfs/users/dengken/code/XCube/xcube/models/diffusion.py", line 264, in load_first_stage_from_pretrained return net_module.load_from_checkpoint(args_ckpt, hparams=model_args) File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 139, in load_from_checkpoint return _load_from_checkpoint( File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 188, in _load_from_checkpoint return _load_state(cls, checkpoint, strict=strict, kwargs) File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 247, in _load_state keys = obj.load_state_dict(checkpoint["state_dict"], strict=strict) File "/root/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for Model: size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv1.Conv.weight: copying a param with shape torch.Size([64, 512, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([32, 512, 3, 3, 3]). size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv2.GroupNorm.weight: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv2.GroupNorm.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv2.Conv.weight: copying a param with shape torch.Size([64, 64, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([32, 32, 3, 3, 3]). size mismatch for unet.post_kl_bottleneck.post_kl_bottleneck_0.SingleConv1.GroupNorm.weight: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]). size mismatch for unet.post_kl_bottleneck.post_kl_bottleneck_0.SingleConv1.GroupNorm.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]). size mismatch for unet.post_kl_bottleneck.post_kl_bottleneck_0.SingleConv1.Conv.weight: copying a param with shape torch.Size([512, 32, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 16, 3, 3, 3]). Traceback (most recent call last): File "/mnt/pfs/users/dengken/code/XCube/train.py", line 380, in net_model = net_module(model_args) File "/mnt/pfs/users/dengken/code/XCube/xcube/models/diffusion.py", line 84, in init self.vae = self.load_first_stage_from_pretrained().eval() File "/root/miniconda3/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/mnt/pfs/users/dengken/code/XCube/xcube/models/diffusion.py", line 264, in load_first_stage_from_pretrained return net_module.load_from_checkpoint(args_ckpt, hparams=model_args) File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 139, in load_from_checkpoint return _load_from_checkpoint( File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 188, in _load_from_checkpoint return _load_state(cls, checkpoint, strict=strict, kwargs) File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 247, in _load_state keys = obj.load_state_dict(checkpoint["state_dict"], strict=strict) File "/root/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for Model: size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv1.Conv.weight: copying a param with shape torch.Size([64, 512, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([32, 512, 3, 3, 3]). size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv2.GroupNorm.weight: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv2.GroupNorm.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv2.Conv.weight: copying a param with shape torch.Size([64, 64, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([32, 32, 3, 3, 3]). size mismatch for unet.post_kl_bottleneck.post_kl_bottleneck_0.SingleConv1.GroupNorm.weight: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]). size mismatch for unet.post_kl_bottleneck.post_kl_bottleneck_0.SingleConv1.GroupNorm.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]). size mismatch for unet.post_kl_bottleneck.post_kl_bottleneck_0.SingleConv1.Conv.weight: copying a param with shape torch.Size([512, 32, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 16, 3, 3, 3]). wandb: 🚀 View run chair_diffusion_dense/16x16x16_kld-0.03 at: https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet/runs/rk4p0a77 wandb: ⭐️ View project at: https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) wandb: Find logs at: ../wandb/wandb/run-20240719_154747-rk4p0a77/logs wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with wandb.require("core")! See https://wandb.me/wandb-core for more information.

And their is no error using the ckpt download from your VAE

Could you please help me?Thanks

VLadImirluren commented 1 month ago

That's not a problem with git error, I solved the git error but the error following still exist

VLadImirluren commented 1 month ago

In another word, after using the VAE training command you gave, executing the diffusion training you gave will result in an error

tanghaotommy commented 1 month ago

You might want to change the cut_ratio when training the diffusion model. The command you gave was using cut_ratio=16 for training the VAE. The default value for training the diffusion is 32, you might want to change that to 16 as well.

VLadImirluren commented 1 month ago

You might want to change the cut_ratio when training the diffusion model. The command you gave was using cut_ratio=16 for training the VAE. The default value for training the diffusion is 32, you might want to change that to 16 as well.

But the latent dimension gave by paper is 16. The performance whether will be significantly affected by different latent dimension or not?

xrenaa commented 1 month ago

You might want to change the cut_ratio when training the diffusion model. The command you gave was using cut_ratio=16 for training the VAE. The default value for training the diffusion is 32, you might want to change that to 16 as well.

But the latent dimension gave by paper is 16. The performance whether will be significantly affected by different latent dimension or not?

Hi, thanks for your trying! 16 or 8 does not have a big differece. I will fix some instructions.

VLadImirluren commented 1 month ago

You might want to change the cut_ratio when training the diffusion model. The command you gave was using cut_ratio=16 for training the VAE. The default value for training the diffusion is 32, you might want to change that to 16 as well.

But the latent dimension gave by paper is 16. The performance whether will be significantly affected by different latent dimension or not?

Hi, thanks for your trying! 16 or 8 does not have a big differece. I will fix some instructions.

Thanks

nv-tlabs / XCube

Error occur in training diffusion #18