nv-tlabs / GET3D

Other
4.17k stars 374 forks source link

Training freezes after tick 0 #158

Closed mahmud30tibn closed 3 months ago

mahmud30tibn commented 5 months ago

I ran the training line with 10 chair objects from shapenet dataset without any split.

python train_3d.py --outdir='log.txt' --data='./Chair_image/img/03001627/' --camera_path ./Chair_image/camera/ --gpus=1 --batch=4 --gamma=400 --data_camera_mode shapenet_chair --dmtet_scale 0.8 --use_shapenet_split 1 --one_3d_generator 1 --fp32 0 --use_shapenet_split 0

The training hangs after running only tick 0. The log says

==> start ==> use shapenet dataset ==> use shapenet folder number 9 ==> use image path: ./Chair_image/img/03001627/, num images: 216 ==> launch training

Training options: { "G_kwargs": { "class_name": "training.networks_get3d.GeneratorDMTETMesh", "z_dim": 512, "w_dim": 512, "mapping_kwargs": { "num_layers": 8 }, "iso_surface": "dmtet", "one_3d_generator": true, "n_implicit_layer": 1, "deformation_multiplier": 1.0, "use_style_mixing": true, "dmtet_scale": 0.8, "feat_channel": 16, "mlp_latent_channel": 32, "tri_plane_resolution": 256, "n_views": 1, "render_type": "neural_render", "use_tri_plane": true, "tet_res": 90, "geometry_type": "conv3d", "data_camera_mode": "shapenet_chair", "channel_base": 32768, "channel_max": 512, "fused_modconv_default": "inference_only" }, "D_kwargs": { "class_name": "training.networks_get3d.Discriminator", "block_kwargs": { "freeze_layers": 0 }, "mapping_kwargs": {}, "epilogue_kwargs": { "mbstd_group_size": 4 }, "data_camera_mode": "shapenet_chair", "add_camera_cond": true, "channel_base": 32768, "channel_max": 512, "architecture": "skip" }, "G_opt_kwargs": { "class_name": "torch.optim.Adam", "betas": [ 0, 0.99 ], "eps": 1e-08, "lr": 0.002 }, "D_opt_kwargs": { "class_name": "torch.optim.Adam", "betas": [ 0, 0.99 ], "eps": 1e-08, "lr": 0.002 }, "loss_kwargs": { "class_name": "training.loss.StyleGAN2Loss", "gamma_mask": 400.0, "r1_gamma": 400.0, "lambda_flexicubes_surface_reg": 0.5, "lambda_flexicubes_weights_reg": 0.1, "style_mixing_prob": 0.9, "pl_weight": 0.0 }, "data_loader_kwargs": { "pin_memory": true, "prefetch_factor": 2, "num_workers": 3 }, "inference_vis": false, "training_set_kwargs": { "class_name": "training.dataset.ImageFolderDataset", "path": "./Chair_image/img/03001627/", "use_labels": false, "max_size": 216, "xflip": false, "resolution": 1024, "data_camera_mode": "shapenet_chair", "add_camera_cond": true, "camera_path": "./Chair_image/camera/", "split": "all", "random_seed": 0 }, "resume_pretrain": null, "D_reg_interval": 16, "num_gpus": 1, "batch_size": 4, "batch_gpu": 4, "metrics": [ "fid50k" ], "total_kimg": 20000, "kimg_per_tick": 1, "image_snapshot_ticks": 50, "network_snapshot_ticks": 200, "random_seed": 0, "ema_kimg": 1.25, "G_reg_interval": 4, "run_dir": "log.txt/00024-stylegan2--gpus1-batch4-gamma400" }

Output directory: log.txt/00024-stylegan2--gpus1-batch4-gamma400 Number of GPUs: 1 Batch size: 4 images Training duration: 20000 kimg Dataset path: ./Chair_image/img/03001627/ Dataset size: 216 images Dataset resolution: 1024 Dataset labels: False Dataset x-flips: False

Creating output directory... Launching processes... Setting up PyTorch plugin "upfirdn2d_plugin"... Done. Setting up PyTorch plugin "bias_act_plugin"... Done. Setting up PyTorch plugin "filtered_lrelu_plugin"... Done. Loading training set... ==> use shapenet dataset ==> use shapenet folder number 9 ==> use image path: ./Chair_image/img/03001627/, num images: 216

Num images: 216 Image shape: [3, 1024, 1024] Label shape: [0]

Constructing networks... Setting up augmentation... Distributing across 1 GPUs... Setting up training phases... Exporting sample images... Initializing logs... Skipping tfevents export: No module named 'tensorboard' Training for 20000 kimg...

tick 0 kimg 0.0 time 28s sec/tick 13.6 sec/kimg 3399.39 maintenance 14.9
==> start visualization /home/tibnmahm/3d/GET3D-master/training/networks_get3d.py:467: UserWarning: torch.range is deprecated and will be removed in a future release because its behavior is inconsistent with Python's range builtin. Instead, use torch.arange, which produces values in [start, end). camera_theta = torch.range(0, n_camera - 1, device=self.device).unsqueeze(dim=-1) / n_camera math.pi 2.0 ==> saved visualization Evaluating metrics... ==> use shapenet dataset ==> use shapenet folder number 9 ==> use image path: ./Chair_image/img/03001627/, num images: 216 ==> preparing the cache for fid scores {'class_name': 'training.dataset.ImageFolderDataset', 'path': './Chair_image/img/03001627/', 'use_labels': False, 'max_size': None, 'xflip': False, 'resolution': 1024, 'data_camera_mode': 'shapenet_chair', 'add_camera_cond': True, 'camera_path': './Chair_image/camera/', 'split': 'all', 'random_seed': 0} 0%| | 0/4 [00:00<?, ?it/s]/home/tibnmahm/anaconda3/envs/get3d_2/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.) return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode) 100%|##########| 4/4 [00:13<00:00, 3.41s/it]

Any suggestion how to resolve this?

mahmud30tibn commented 3 months ago

Turns out the first epoch took a pretty long time for single GPUs (about 1 hour), for 4 GPUs it was about 10 minutes. Closing the issue as it was not an error.