nv-tlabs / LION

Latent Point Diffusion Models for 3D Shape Generation
Other
740 stars 58 forks source link

training loss nan #9

Closed Zhiyuan-R closed 1 year ago

Zhiyuan-R commented 1 year ago

Hi, I train the vae model as the readme part tells. But the training loss become nan. I use 4 gpu and 40 batchsize. And I keep the left the same in the repo.

ZENGXH commented 1 year ago

Are you using the ShapeNet dataset as well? Can you share the training log here?

Zhiyuan-R commented 1 year ago

Yes! I use shapeNet v2 core 15k(downloading from PVD)

Zhiyuan-R commented 1 year ago

2023-01-25 00:17:56.473 | INFO | main:get_args:205 - EXP_ROOT: ./exp + exp name: 0125/car/3dbf3ah_hvae_lion_B40, save dir: ./exp/0125/car/3dbf3ah_hvae_lion_B40 2023-01-25 00:17:56.490 | INFO | main:get_args:210 - save config at ./exp/0125/car/3dbf3ah_hvae_lion_B40/cfg.yml 2023-01-25 00:17:56.491 | INFO | main:get_args:213 - log dir: ./exp/0125/car/3dbf3ah_hvae_lion_B40 2023-01-25 00:17:56.491 | INFO | main::227 - In Rank=0 2023-01-25 00:17:56.491 | INFO | main::233 - Node rank 0, local proc 0, global proc 0 2023-01-25 00:17:56.503 | INFO | main::227 - In Rank=1 2023-01-25 00:17:56.504 | INFO | main::233 - Node rank 0, local proc 1, global proc 1 2023-01-25 00:17:56.515 | INFO | main::227 - In Rank=2 2023-01-25 00:17:56.516 | INFO | main::233 - Node rank 0, local proc 2, global proc 2 2023-01-25 00:17:56.528 | INFO | main::227 - In Rank=3 2023-01-25 00:17:56.529 | INFO | main::233 - Node rank 0, local proc 3, global proc 3 2023-01-25 00:17:56.541 | INFO | main::241 - join 3 2023-01-25 00:17:56.651 | DEBUG | utils.utils:init_processes:1140 - set port as 6011 2023-01-25 00:17:56.652 | INFO | utils.utils:init_processes:1151 - init_process: rank=0, world_size=4 2023-01-25 00:17:56.663 | DEBUG | utils.utils:init_processes:1140 - set port as 6011 2023-01-25 00:17:56.664 | INFO | utils.utils:init_processes:1151 - init_process: rank=1, world_size=4 2023-01-25 00:17:56.679 | DEBUG | utils.utils:init_processes:1140 - set port as 6011 2023-01-25 00:17:56.680 | INFO | utils.utils:init_processes:1151 - init_process: rank=2, world_size=4 2023-01-25 00:17:56.715 | DEBUG | utils.utils:init_processes:1140 - set port as 6011 2023-01-25 00:17:56.716 | INFO | utils.utils:init_processes:1151 - init_process: rank=3, world_size=4 2023-01-25 00:17:57.827 | INFO | main:main:29 - use trainer: trainers.hvae_trainer 2023-01-25 00:17:57.831 | INFO | main:main:29 - use trainer: trainers.hvae_trainer 2023-01-25 00:17:57.832 | INFO | main:main:29 - use trainer: trainers.hvae_trainer 2023-01-25 00:17:57.836 | INFO | main:main:29 - use trainer: trainers.hvae_trainer 2023-01-25 00:18:01.625 | INFO | utils.utils:common_init:466 - [common-init] at rank=2, seed=1 2023-01-25 00:18:01.626 | INFO | utils.utils:init:339 - rank=2, init writer as a blackhole 2023-01-25 00:18:01.626 | INFO | utils.utils:common_init:510 - [common-init] DONE 2023-01-25 00:18:01.670 | INFO | utils.utils:common_init:466 - [common-init] at rank=3, seed=1 2023-01-25 00:18:01.671 | INFO | utils.utils:init:339 - rank=3, init writer as a blackhole 2023-01-25 00:18:01.671 | INFO | utils.utils:common_init:510 - [common-init] DONE 2023-01-25 00:18:01.691 | INFO | utils.utils:common_init:466 - [common-init] at rank=0, seed=1 2023-01-25 00:18:01.691 | INFO | utils.utils:common_init:466 - [common-init] at rank=1, seed=1 2023-01-25 00:18:01.692 | INFO | utils.utils:init:331 - Not init TFB 2023-01-25 00:18:01.692 | INFO | utils.utils:init:339 - rank=1, init writer as a blackhole 2023-01-25 00:18:01.692 | INFO | utils.utils:common_init:510 - [common-init] DONE 2023-01-25 00:18:01.693 | INFO | utils.utils:common_init:510 - [common-init] DONE 2023-01-25 00:18:06.292 | INFO | utils.model_helper:import_model:106 - import: models.shapelatent_modules.PointNetPlusEncoder 2023-01-25 00:18:06.292 | INFO | utils.model_helper:import_model:106 - import: models.shapelatent_modules.PointNetPlusEncoder 2023-01-25 00:18:06.293 | INFO | utils.model_helper:import_model:106 - import: models.shapelatent_modules.PointNetPlusEncoder 2023-01-25 00:18:06.298 | INFO | utils.model_helper:import_model:106 - import: models.shapelatent_modules.PointNetPlusEncoder 2023-01-25 00:18:06.308 | INFO | models.shapelatent_modules:init:29 - [Encoder] zdim=128, out_sigma=True; force_att: 0 2023-01-25 00:18:06.309 | INFO | utils.model_helper:import_model:106 - import: models.latent_points_ada.PointTransPVC 2023-01-25 00:18:06.311 | INFO | models.latent_points_ada:init:38 - [Build Unet] extra_feature_channels=0, input_dim=3 2023-01-25 00:18:06.313 | INFO | models.shapelatent_modules:init:29 - [Encoder] zdim=128, out_sigma=True; force_att: 0 2023-01-25 00:18:06.314 | INFO | utils.model_helper:import_model:106 - import: models.latent_points_ada.PointTransPVC 2023-01-25 00:18:06.317 | INFO | models.latent_points_ada:init:38 - [Build Unet] extra_feature_channels=0, input_dim=3 2023-01-25 00:18:06.318 | INFO | models.shapelatent_modules:init:29 - [Encoder] zdim=128, out_sigma=True; force_att: 0 2023-01-25 00:18:06.318 | INFO | utils.model_helper:import_model:106 - import: models.latent_points_ada.PointTransPVC 2023-01-25 00:18:06.321 | INFO | models.latent_points_ada:init:38 - [Build Unet] extra_feature_channels=0, input_dim=3 2023-01-25 00:18:06.329 | INFO | models.shapelatent_modules:init:29 - [Encoder] zdim=128, out_sigma=True; force_att: 0 2023-01-25 00:18:06.329 | INFO | utils.model_helper:import_model:106 - import: models.latent_points_ada.PointTransPVC 2023-01-25 00:18:06.332 | INFO | models.latent_points_ada:init:38 - [Build Unet] extra_feature_channels=0, input_dim=3 2023-01-25 00:18:06.457 | INFO | utils.model_helper:import_model:106 - import: models.latent_points_ada.LatentPointDecPVC 2023-01-25 00:18:06.458 | INFO | models.latent_points_ada:init:241 - [Build Dec] point_dim=3, context_dim=1 2023-01-25 00:18:06.458 | INFO | models.latent_points_ada:init:38 - [Build Unet] extra_feature_channels=1, input_dim=3 2023-01-25 00:18:06.473 | INFO | utils.model_helper:import_model:106 - import: models.latent_points_ada.LatentPointDecPVC 2023-01-25 00:18:06.474 | INFO | models.latent_points_ada:init:241 - [Build Dec] point_dim=3, context_dim=1 2023-01-25 00:18:06.474 | INFO | models.latent_points_ada:init:38 - [Build Unet] extra_feature_channels=1, input_dim=3 2023-01-25 00:18:06.478 | INFO | utils.model_helper:import_model:106 - import: models.latent_points_ada.LatentPointDecPVC 2023-01-25 00:18:06.479 | INFO | models.latent_points_ada:init:241 - [Build Dec] point_dim=3, context_dim=1 2023-01-25 00:18:06.479 | INFO | models.latent_points_ada:init:38 - [Build Unet] extra_feature_channels=1, input_dim=3 2023-01-25 00:18:06.505 | INFO | utils.model_helper:import_model:106 - import: models.latent_points_ada.LatentPointDecPVC 2023-01-25 00:18:06.505 | INFO | models.latent_points_ada:init:241 - [Build Dec] point_dim=3, context_dim=1 2023-01-25 00:18:06.505 | INFO | models.latent_points_ada:init:38 - [Build Unet] extra_feature_channels=1, input_dim=3 2023-01-25 00:18:06.594 | INFO | models.vae_adain:init:50 - [Build Model] style_encoder: models.shapelatent_modules.PointNetPlusEncoder, encoder: models.latent_points_ada.PointTransPVC, decoder: models.latent_points_ada.LatentPointDecPVC 2023-01-25 00:18:06.610 | INFO | models.vae_adain:init:50 - [Build Model] style_encoder: models.shapelatent_modules.PointNetPlusEncoder, encoder: models.latent_points_ada.PointTransPVC, decoder: models.latent_points_ada.LatentPointDecPVC 2023-01-25 00:18:06.613 | INFO | models.vae_adain:init:50 - [Build Model] style_encoder: models.shapelatent_modules.PointNetPlusEncoder, encoder: models.latent_points_ada.PointTransPVC, decoder: models.latent_points_ada.LatentPointDecPVC 2023-01-25 00:18:06.640 | INFO | models.vae_adain:init:50 - [Build Model] style_encoder: models.shapelatent_modules.PointNetPlusEncoder, encoder: models.latent_points_ada.PointTransPVC, decoder: models.latent_points_ada.LatentPointDecPVC 2023-01-25 00:18:06.655 | INFO | trainers.hvae_trainer:init:53 - broadcast_params: device=cuda:2 2023-01-25 00:18:06.663 | INFO | trainers.hvae_trainer:init:53 - broadcast_params: device=cuda:1 2023-01-25 00:18:06.669 | INFO | trainers.hvae_trainer:init:53 - broadcast_params: device=cuda:3 2023-01-25 00:18:06.689 | INFO | trainers.base_trainer:build_other_module:712 - no other module to build 2023-01-25 00:18:06.689 | INFO | trainers.hvae_trainer:init:58 - waitting for barrier, device=cuda:2 2023-01-25 00:18:06.696 | INFO | trainers.hvae_trainer:init:53 - broadcast_params: device=cuda:0 2023-01-25 00:18:06.704 | INFO | trainers.base_trainer:build_other_module:712 - no other module to build 2023-01-25 00:18:06.704 | INFO | trainers.hvae_trainer:init:58 - waitting for barrier, device=cuda:3 2023-01-25 00:18:06.705 | INFO | trainers.base_trainer:build_other_module:712 - no other module to build 2023-01-25 00:18:06.705 | INFO | trainers.hvae_trainer:init:58 - waitting for barrier, device=cuda:1 2023-01-25 00:18:06.728 | INFO | trainers.base_trainer:build_other_module:712 - no other module to build 2023-01-25 00:18:06.729 | INFO | trainers.hvae_trainer:init:58 - waitting for barrier, device=cuda:0 2023-01-25 00:18:06.729 | INFO | trainers.hvae_trainer:init:60 - pass barrier, device=cuda:0 2023-01-25 00:18:06.729 | INFO | trainers.hvae_trainer:init:60 - pass barrier, device=cuda:2 2023-01-25 00:18:06.729 | INFO | trainers.hvae_trainer:init:60 - pass barrier, device=cuda:1 2023-01-25 00:18:06.729 | INFO | trainers.hvae_trainer:init:60 - pass barrier, device=cuda:3 2023-01-25 00:18:06.729 | INFO | trainers.base_trainer:build_data:152 - start build_data 2023-01-25 00:18:06.729 | INFO | trainers.base_trainer:build_data:152 - start build_data 2023-01-25 00:18:06.729 | INFO | trainers.base_trainer:build_data:152 - start build_data 2023-01-25 00:18:06.729 | INFO | trainers.base_trainer:build_data:152 - start build_data 2023-01-25 00:18:09.476 | INFO | datasets.pointflow_datasets:get_datasets:333 - get_datasets: tr_sample_size=2048, te_sample_size=2048; random_subsample=1 normalize_global=True normalize_std_per_axix=False normalize_per_shape=False recenter_per_shape=False 2023-01-25 00:18:09.477 | INFO | datasets.pointflow_datasets:init:108 - [DATA] cat: car, split: train, full path: ./data/ShapeNetCore.v2.PC15k/; norm global=True, norm-box=False 2023-01-25 00:18:09.478 | INFO | datasets.pointflow_datasets:get_datasets:333 - get_datasets: tr_sample_size=2048, te_sample_size=2048; random_subsample=1 normalize_global=True normalize_std_per_axix=False normalize_per_shape=False recenter_per_shape=False 2023-01-25 00:18:09.478 | INFO | datasets.pointflow_datasets:init:108 - [DATA] cat: car, split: train, full path: ./data/ShapeNetCore.v2.PC15k/; norm global=True, norm-box=False 2023-01-25 00:18:09.487 | INFO | datasets.pointflow_datasets:init:157 - [DATA] number of file [2458] under: ./data/ShapeNetCore.v2.PC15k/02958343/train 2023-01-25 00:18:09.487 | INFO | datasets.pointflow_datasets:init:157 - [DATA] number of file [2458] under: ./data/ShapeNetCore.v2.PC15k/02958343/train 2023-01-25 00:18:09.619 | INFO | datasets.pointflow_datasets:get_datasets:333 - get_datasets: tr_sample_size=2048, te_sample_size=2048; random_subsample=1 normalize_global=True normalize_std_per_axix=False normalize_per_shape=False recenter_per_shape=False 2023-01-25 00:18:09.619 | INFO | datasets.pointflow_datasets:init:108 - [DATA] cat: car, split: train, full path: ./data/ShapeNetCore.v2.PC15k/; norm global=True, norm-box=False 2023-01-25 00:18:09.626 | INFO | datasets.pointflow_datasets:init:157 - [DATA] number of file [2458] under: ./data/ShapeNetCore.v2.PC15k/02958343/train 2023-01-25 00:18:09.781 | INFO | datasets.pointflow_datasets:get_datasets:333 - get_datasets: tr_sample_size=2048, te_sample_size=2048; random_subsample=1 normalize_global=True normalize_std_per_axix=False normalize_per_shape=False recenter_per_shape=False 2023-01-25 00:18:09.781 | INFO | datasets.pointflow_datasets:init:108 - [DATA] cat: car, split: train, full path: ./data/ShapeNetCore.v2.PC15k/; norm global=True, norm-box=False 2023-01-25 00:18:09.787 | INFO | datasets.pointflow_datasets:init:157 - [DATA] number of file [2458] under: ./data/ShapeNetCore.v2.PC15k/02958343/train 2023-01-25 00:18:11.014 | INFO | datasets.pointflow_datasets:init:170 - [DATA] Load data time: 1.5s | dir: ['02958343'] | sample_with_replacement: 1; num points: 2458 2023-01-25 00:18:11.125 | INFO | datasets.pointflow_datasets:init:170 - [DATA] Load data time: 1.5s | dir: ['02958343'] | sample_with_replacement: 1; num points: 2458 2023-01-25 00:18:11.149 | INFO | datasets.pointflow_datasets:init:170 - [DATA] Load data time: 1.7s | dir: ['02958343'] | sample_with_replacement: 1; num points: 2458 2023-01-25 00:18:11.199 | INFO | datasets.pointflow_datasets:init:170 - [DATA] Load data time: 1.4s | dir: ['02958343'] | sample_with_replacement: 1; num points: 2458 2023-01-25 00:18:12.484 | INFO | datasets.pointflow_datasets:init:234 - [DATA] normalize_global: mean=[0.00131747 0.00735971 0.02350355], std=[0.1634924] 2023-01-25 00:18:12.618 | INFO | datasets.pointflow_datasets:init:234 - [DATA] normalize_global: mean=[0.00131747 0.00735971 0.02350355], std=[0.1634924] 2023-01-25 00:18:12.801 | INFO | datasets.pointflow_datasets:init:234 - [DATA] normalize_global: mean=[0.00131747 0.00735971 0.02350355], std=[0.1634924] 2023-01-25 00:18:12.810 | INFO | datasets.pointflow_datasets:init:234 - [DATA] normalize_global: mean=[0.00131747 0.00735971 0.02350355], std=[0.1634924] 2023-01-25 00:18:13.351 | INFO | datasets.pointflow_datasets:init:241 - [DATA] shape=(2458, 15000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=4.166, min=-4.333; num-pts=2048 2023-01-25 00:18:13.375 | INFO | datasets.pointflow_datasets:init:108 - [DATA] cat: car, split: val, full path: ./data/ShapeNetCore.v2.PC15k/; norm global=True, norm-box=False 2023-01-25 00:18:13.376 | INFO | datasets.pointflow_datasets:init:157 - [DATA] number of file [352] under: ./data/ShapeNetCore.v2.PC15k/02958343/val 2023-01-25 00:18:13.450 | INFO | datasets.pointflow_datasets:init:241 - [DATA] shape=(2458, 15000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=4.166, min=-4.333; num-pts=2048 2023-01-25 00:18:13.479 | INFO | datasets.pointflow_datasets:init:108 - [DATA] cat: car, split: val, full path: ./data/ShapeNetCore.v2.PC15k/; norm global=True, norm-box=False 2023-01-25 00:18:13.480 | INFO | datasets.pointflow_datasets:init:157 - [DATA] number of file [352] under: ./data/ShapeNetCore.v2.PC15k/02958343/val 2023-01-25 00:18:13.534 | INFO | datasets.pointflow_datasets:init:170 - [DATA] Load data time: 0.2s | dir: ['02958343'] | sample_with_replacement: 1; num points: 352 2023-01-25 00:18:13.615 | INFO | datasets.pointflow_datasets:init:241 - [DATA] shape=(2458, 15000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=4.166, min=-4.333; num-pts=2048 2023-01-25 00:18:13.623 | INFO | datasets.pointflow_datasets:init:241 - [DATA] shape=(2458, 15000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=4.166, min=-4.333; num-pts=2048 2023-01-25 00:18:13.639 | INFO | datasets.pointflow_datasets:init:108 - [DATA] cat: car, split: val, full path: ./data/ShapeNetCore.v2.PC15k/; norm global=True, norm-box=False 2023-01-25 00:18:13.640 | INFO | datasets.pointflow_datasets:init:157 - [DATA] number of file [352] under: ./data/ShapeNetCore.v2.PC15k/02958343/val 2023-01-25 00:18:13.646 | INFO | datasets.pointflow_datasets:init:108 - [DATA] cat: car, split: val, full path: ./data/ShapeNetCore.v2.PC15k/; norm global=True, norm-box=False 2023-01-25 00:18:13.647 | INFO | datasets.pointflow_datasets:init:157 - [DATA] number of file [352] under: ./data/ShapeNetCore.v2.PC15k/02958343/val 2023-01-25 00:18:13.648 | INFO | datasets.pointflow_datasets:init:170 - [DATA] Load data time: 0.2s | dir: ['02958343'] | sample_with_replacement: 1; num points: 352 2023-01-25 00:18:13.676 | INFO | datasets.pointflow_datasets:init:241 - [DATA] shape=(352, 15000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=4.002, min=-4.059; num-pts=2048 2023-01-25 00:18:13.677 | INFO | datasets.pointflow_datasets:get_data_loaders:398 - [Batch Size] train=40, test=10; drop-last=1 2023-01-25 00:18:13.683 | INFO | trainers.hvae_trainer:init:75 - done init trainer @cuda:2 2023-01-25 00:18:13.794 | INFO | datasets.pointflow_datasets:init:241 - [DATA] shape=(352, 15000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=4.002, min=-4.059; num-pts=2048 2023-01-25 00:18:13.795 | INFO | datasets.pointflow_datasets:get_data_loaders:398 - [Batch Size] train=40, test=10; drop-last=1 2023-01-25 00:18:13.801 | INFO | trainers.hvae_trainer:init:75 - done init trainer @cuda:0 2023-01-25 00:18:13.842 | INFO | datasets.pointflow_datasets:init:170 - [DATA] Load data time: 0.2s | dir: ['02958343'] | sample_with_replacement: 1; num points: 352 2023-01-25 00:18:13.863 | INFO | datasets.pointflow_datasets:init:170 - [DATA] Load data time: 0.2s | dir: ['02958343'] | sample_with_replacement: 1; num points: 352 2023-01-25 00:18:14.004 | INFO | datasets.pointflow_datasets:init:241 - [DATA] shape=(352, 15000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=4.002, min=-4.059; num-pts=2048 2023-01-25 00:18:14.005 | INFO | datasets.pointflow_datasets:get_data_loaders:398 - [Batch Size] train=40, test=10; drop-last=1 2023-01-25 00:18:14.010 | INFO | trainers.hvae_trainer:init:75 - done init trainer @cuda:3 2023-01-25 00:18:14.040 | INFO | datasets.pointflow_datasets:init:241 - [DATA] shape=(352, 15000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=4.002, min=-4.059; num-pts=2048 2023-01-25 00:18:14.042 | INFO | datasets.pointflow_datasets:get_data_loaders:398 - [Batch Size] train=40, test=10; drop-last=1 2023-01-25 00:18:14.053 | INFO | trainers.hvae_trainer:init:75 - done init trainer @cuda:1 2023-01-25 00:18:14.394 | INFO | trainers.base_trainer:prepare_vis_data:676 - [prepare_vis_data] len of train_loader: 15 2023-01-25 00:18:14.655 | INFO | trainers.base_trainer:prepare_vis_data:676 - [prepare_vis_data] len of train_loader: 15 2023-01-25 00:18:14.924 | INFO | trainers.base_trainer:prepare_vis_data:676 - [prepare_vis_data] len of train_loader: 15 2023-01-25 00:18:14.959 | INFO | trainers.base_trainer:prepare_vis_data:676 - [prepare_vis_data] len of train_loader: 15 2023-01-25 00:18:15.220 | INFO | trainers.base_trainer:prepare_vis_data:691 - tr_x: torch.Size([16, 2048, 3]), m_pcs: torch.Size([16, 1, 3]), s_pcs: torch.Size([16, 1, 1]), val_x: torch.Size([16, 2048, 3]) 2023-01-25 00:18:15.247 | INFO | main:main:46 - param size = 22.402731M 2023-01-25 00:18:15.249 | INFO | main:main:68 - not find any checkpoint: ./exp/0125/car/3dbf3ah_hvae_lion_B40/checkpoints, (exist=False), or snapshot ./exp/0125/car/3dbf3ah_hvae_lion_B40/checkpoints/snapshot, (exist=False) 2023-01-25 00:18:15.250 | INFO | trainers.base_trainer:train_epochs:173 - [rank=2] Start epoch: 0 End epoch: 800, batch-size=40 | Niter/epo=15 | log freq=15, viz freq 6000, val freq 200 2023-01-25 00:18:15.580 | INFO | trainers.base_trainer:prepare_vis_data:691 - tr_x: torch.Size([16, 2048, 3]), m_pcs: torch.Size([16, 1, 3]), s_pcs: torch.Size([16, 1, 1]), val_x: torch.Size([16, 2048, 3]) 2023-01-25 00:18:15.614 | INFO | main:main:46 - param size = 22.402731M 2023-01-25 00:18:15.615 | INFO | trainers.base_trainer:set_writer:57 -

./exp/0125/car/3dbf3ah_hvae_lion_B40

2023-01-25 00:18:15.622 | INFO | main:main:68 - not find any checkpoint: ./exp/0125/car/3dbf3ah_hvae_lion_B40/checkpoints, (exist=False), or snapshot ./exp/0125/car/3dbf3ah_hvae_lion_B40/checkpoints/snapshot, (exist=False) 2023-01-25 00:18:15.637 | INFO | trainers.base_trainer:train_epochs:173 - [rank=0] Start epoch: 0 End epoch: 800, batch-size=40 | Niter/epo=15 | log freq=15, viz freq 6000, val freq 200 2023-01-25 00:18:15.845 | INFO | trainers.base_trainer:prepare_vis_data:691 - tr_x: torch.Size([16, 2048, 3]), m_pcs: torch.Size([16, 1, 3]), s_pcs: torch.Size([16, 1, 1]), val_x: torch.Size([16, 2048, 3]) 2023-01-25 00:18:15.865 | INFO | main:main:46 - param size = 22.402731M 2023-01-25 00:18:15.865 | INFO | main:main:68 - not find any checkpoint: ./exp/0125/car/3dbf3ah_hvae_lion_B40/checkpoints, (exist=False), or snapshot ./exp/0125/car/3dbf3ah_hvae_lion_B40/checkpoints/snapshot, (exist=False) 2023-01-25 00:18:15.866 | INFO | trainers.base_trainer:train_epochs:173 - [rank=1] Start epoch: 0 End epoch: 800, batch-size=40 | Niter/epo=15 | log freq=15, viz freq 6000, val freq 200 2023-01-25 00:18:15.904 | INFO | trainers.base_trainer:prepare_vis_data:691 - tr_x: torch.Size([16, 2048, 3]), m_pcs: torch.Size([16, 1, 3]), s_pcs: torch.Size([16, 1, 1]), val_x: torch.Size([16, 2048, 3]) 2023-01-25 00:18:15.947 | INFO | main:main:46 - param size = 22.402731M 2023-01-25 00:18:15.948 | INFO | main:main:68 - not find any checkpoint: ./exp/0125/car/3dbf3ah_hvae_lion_B40/checkpoints, (exist=False), or snapshot ./exp/0125/car/3dbf3ah_hvae_lion_B40/checkpoints/snapshot, (exist=False) 2023-01-25 00:18:15.949 | INFO | trainers.base_trainer:train_epochs:173 - [rank=3] Start epoch: 0 End epoch: 800, batch-size=40 | Niter/epo=15 | log freq=15, viz freq 6000, val freq 200 2023-01-25 00:18:38.808 | INFO | trainers.common_fun:validate_inspect_noprior:104 - writer: none 2023-01-25 00:19:01.551 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E0 iter[ 14/ 15] | [Loss] 1053511558071768433312137216.00 | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 14 | url none | [time] 0.8m (~10h) |[best] 0 -100.000x1e-2 2023-01-25 00:19:26.789 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E1 iter[ 14/ 15] | [Loss] 52998332140144.68 | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 29 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:19:52.065 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E2 iter[ 14/ 15] | [Loss] 2302480512959926.50 | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 44 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:20:17.565 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E3 iter[ 14/ 15] | [Loss] 2568395833090570240.00 | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 59 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:20:43.245 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E4 iter[ 14/ 15] | [Loss] 17809658881111334949003391401984.00 | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 74 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:21:09.074 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E5 iter[ 14/ 15] | [Loss] 51566519.44 | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 89 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:21:34.569 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E6 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 104 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:22:00.025 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E7 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 119 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:22:25.365 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E8 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 134 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:22:50.734 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E9 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 149 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:23:16.079 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E10 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 164 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:23:41.553 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E11 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 179 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:24:07.110 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E12 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 194 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:24:32.557 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E13 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 209 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:24:58.175 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E14 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 224 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:25:23.746 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E15 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 239 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:25:49.360 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E16 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 254 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:26:14.849 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E17 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 269 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:26:40.303 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E18 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 284 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:27:05.658 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E19 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 299 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:27:30.983 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E20 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 314 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:27:56.319 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E21 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 329 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:28:21.645 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E22 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 344 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:28:47.099 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E23 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 359 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:29:12.520 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E24 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 374 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:29:38.024 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E25 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 389 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:30:03.487 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E26 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 404 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:30:28.738 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E27 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 419 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:30:53.995 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E28 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 434 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:31:19.211 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E29 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 449 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:31:44.334 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E30 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 464 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:32:09.531 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E31 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 479 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:32:34.830 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E32 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 494 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:33:00.156 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E33 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 509 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:33:25.532 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E34 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 524 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:33:51.046 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E35 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 539 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:34:16.399 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E36 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 554 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:34:41.735 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E37 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 569 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:35:07.293 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E38 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 584 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:35:32.838 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E39 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 599 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:35:58.269 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E40 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 614 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:36:23.576 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E41 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 629 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:36:48.992 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E42 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 644 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:37:14.421 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E43 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 659 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:37:39.900 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E44 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 674 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:38:05.390 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E45 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 689 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:38:30.775 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E46 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 704 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:38:56.245 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E47 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 719 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:39:21.554 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E48 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 734 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:39:46.962 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E49 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 749 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:40:12.541 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E50 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 764 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:40:37.824 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E51 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 779 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:41:03.322 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E52 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 794 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:41:28.677 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E53 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 809 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:41:54.058 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E54 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 824 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:42:19.352 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E55 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 839 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:42:44.698 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E56 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 854 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:43:10.033 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E57 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 869 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:43:35.285 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E58 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 884 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:44:00.556 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E59 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 899 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:44:26.083 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E60 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 914 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:44:51.436 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E61 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 929 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:45:16.718 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E62 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 944 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:45:42.198 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E63 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 959 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:46:07.561 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E64 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 974 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:46:32.978 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E65 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 989 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:46:58.470 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E66 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1004 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:47:23.890 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E67 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1019 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:47:49.273 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E68 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1034 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:48:14.624 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E69 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1049 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:48:39.986 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E70 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1064 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:48:40.001 | INFO | trainers.base_trainer:save:106 - save model as : ./exp/0125/car/3dbf3ah_hvae_lion_B40/checkpoints/snapshot_bak 2023-01-25 00:49:06.715 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E71 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1079 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:49:31.999 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E72 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1094 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:49:57.406 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E73 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1109 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:50:22.712 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E74 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1124 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:50:48.098 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E75 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1139 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:51:13.579 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E76 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1154 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:51:39.065 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E77 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1169 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:52:04.374 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E78 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1184 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:52:29.969 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E79 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1199 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:52:55.488 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E80 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1214 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:53:20.759 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E81 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1229 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:53:46.118 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E82 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1244 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:54:11.518 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E83 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1259 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:54:36.914 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E84 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1274 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:55:02.136 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E85 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1289 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:55:27.793 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E86 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1304 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:55:53.190 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E87 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1319 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:56:18.534 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E88 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1334 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:56:44.018 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E89 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1349 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:57:09.309 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E90 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1364 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 00:57:34.684 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E91 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1379 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 00:57:59.997 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E92 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1394 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 00:58:25.479 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E93 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1409 | url none | [time] 0.4m (~5h) |[best] 0 -100.000x1e-2 2023-01-25 00:58:50.932 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E94 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1424 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 00:59:16.326 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E95 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1439 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 00:59:41.795 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E96 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1454 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:00:07.162 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E97 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1469 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:00:32.569 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E98 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1484 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:00:58.136 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E99 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1499 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:01:23.533 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E100 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1514 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:01:48.939 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E101 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1529 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:02:14.562 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E102 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1544 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:02:39.900 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E103 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1559 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:03:05.674 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E104 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1574 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:03:31.050 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E105 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1589 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:03:56.486 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E106 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1604 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:04:21.979 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E107 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1619 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:04:47.400 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E108 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1634 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:05:12.816 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E109 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1649 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:05:38.353 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E110 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1664 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:06:03.822 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E111 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1679 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:06:29.280 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E112 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1694 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:06:54.803 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E113 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1709 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:07:20.158 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E114 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1724 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:07:45.551 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E115 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1739 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:08:11.027 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E116 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1754 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:08:36.365 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E117 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1769 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:09:01.709 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E118 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1784 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:09:27.067 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E119 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1799 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:09:52.533 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E120 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1814 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:10:18.148 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E121 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1829 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:10:43.401 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E122 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1844 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:11:08.755 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E123 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1859 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:11:34.165 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E124 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1874 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:11:59.572 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E125 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1889 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:12:24.968 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E126 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1904 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:12:50.169 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E127 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1919 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:13:15.662 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E128 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1934 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2 2023-01-25 01:13:41.159 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E129 iter[ 14/ 15] | [Loss] nan | [exp] ./exp/0125/car/3dbf3ah_hvae_lion_B40 | [step] 1949 | url none | [time] 0.4m (~4h) |[best] 0 -100.000x1e-2

Zhiyuan-R commented 1 year ago

And below is my config

bash_name: '' clipforge: clip_model: ViT-B/32 enable: 0 feat_dim: 512 cmt: lion comet_key: '' data: batch_size: 40 batch_size_test: 10 cates: car clip_forge_enable: 0 clip_model: ViT-B/32 cond_on_cat: 0 cond_on_voxel: 0 data_dir: data/ShapeNetCore.v2.PC15k dataset_scale: 1 dataset_type: shapenet15k eval_test_split: 0 input_dim: -1 is_encode_whole_dataset_trainer: 0 nclass: 55 noise_std: 0.1 noise_std_min: -1.0 noise_type: normal normalize_global: true normalize_per_shape: false normalize_range: false normalize_shape_box: false normalize_std_per_axis: false num_workers: 4 random_subsample: 1 recenter_per_shape: false sample_with_replacement: 1 te_max_sample_points: 2048 tr_max_sample_points: 2048 train_drop_last: 1 type: datasets.pointflow_datasets voxel_size: 0.1 ddpm: add_point_feat: true attn:

ZENGXH commented 1 year ago

Hi, I try with VAE training using batch-size 40 on 4 gpus: I also get similar NaN issue. However, the same training code works with batch-size 32. It's not clear to me what's the reason, it seems the training does not work with batch-size > 40 somehow. While I am thinking about this, perhaps you can try using batch-size as 32 for now? Sorry about that!

Zhiyuan-R commented 1 year ago

Thanks for your hard working! I cannot believe you run it yourself! It is so nice of you! Have a good night!