nv-tlabs / GET3D

Other
4.2k stars 376 forks source link

Trouble training in Windows #91

Closed Bathsheba closed 1 year ago

Bathsheba commented 1 year ago

Update Talking to the duck worked, closing this. Thanks, duck.

I'm running in Windows 11 Ubuntu with Docker. I had no problem accessing CUDA, Docker handled it automatically. Inference works well and I can use all the functions: generate, interpolate, create textured meshes. My graphics card is a GTX1070 with only 8GB so it's pretty neat that this works!

For training, command and output below. I realize that I probably don't have enough memory, but this crash looks like maybe not a memory issue. I will look more closely at the code but any hint would be greatly appreciated.

Thanks, Bathsheba

Edit: This may be a torch version issue cf. https://github.com/NVlabs/stylegan3/issues/188 In grid_sample_gradfix.py:62

-op = torch._C._jit_get_operation('aten::grid_sampler_2d_backward')
+op, _ = torch._C._jit_get_operation('aten::grid_sampler_2d_backward')

changes the error to RuntimeError: aten::grid_sampler_2d_backward() is missing value for argument 'output_mask'. Declaration: aten::grid_sampler_2d_backward(Tensor grad_output, Tensor input, Tensor grid, int interpolation_mode, int padding_mode, bool align_corners, bool[2] output_mask) -> (Tensor, Tensor)

Edit: That was it, consulted https://github.com/pytorch/pytorch/issues/75018 and some threads about how _aten::convolutionbackward replaces _aten::cudnn_convolution_backwardweight. A few small edits fixed it, details on request.

/workspace/vol/GET3D# python train_3d.py --outdir=/train_logs --data=../renders_motorbike_512/img/03790512 --camera_path ../renders_motorbike_512/camera --gpus=1 --batch=4 --gamma=80 --data_camera_mode shapenet_motorbike --dmtet_scale 1.0 --use_shapenet_split 0 --fp32 0 --latent_dim 128 --img_res 256 --feat_channel 4 --tri_plane_resolution 64 --cbase 8192 --cmax 256 ==> start ==> use shapenet dataset ==> use ts shapenet motorbike split all ==> use shapenet folder number 164 ==> use image path: ../renders_motorbike_512/img/03790512, num images: 3936 ==> launch training

Training options: { "G_kwargs": { "class_name": "training.networks_get3d.GeneratorDMTETMesh", "z_dim": 128, "w_dim": 128, "mapping_kwargs": { "num_layers": 8 }, "one_3d_generator": true, "n_implicit_layer": 1, "deformation_multiplier": 1.0, "use_style_mixing": true, "dmtet_scale": 1.0, "feat_channel": 4, "mlp_latent_channel": 32, "tri_plane_resolution": 64, "n_views": 1, "render_type": "neural_render", "use_tri_plane": true, "tet_res": 90, "geometry_type": "conv3d", "data_camera_mode": "shapenet_motorbike", "channel_base": 8192, "channel_max": 256, "fused_modconv_default": "inference_only" }, "D_kwargs": { "class_name": "training.networks_get3d.Discriminator", "block_kwargs": { "freeze_layers": 0 }, "mapping_kwargs": {}, "epilogue_kwargs": { "mbstd_group_size": 4 }, "data_camera_mode": "shapenet_motorbike", "add_camera_cond": true, "channel_base": 8192, "channel_max": 256, "architecture": "skip" }, "G_opt_kwargs": { "class_name": "torch.optim.Adam", "betas": [ 0, 0.99 ], "eps": 1e-08, "lr": 0.002 }, "D_opt_kwargs": { "class_name": "torch.optim.Adam", "betas": [ 0, 0.99 ], "eps": 1e-08, "lr": 0.002 }, "loss_kwargs": { "class_name": "training.loss.StyleGAN2Loss", "gamma_mask": 80.0, "r1_gamma": 80.0, "style_mixing_prob": 0.9, "pl_weight": 0.0 }, "data_loader_kwargs": { "pin_memory": true, "prefetch_factor": 2, "num_workers": 3 }, "inference_vis": false, "training_set_kwargs": { "class_name": "training.dataset.ImageFolderDataset", "path": "../renders_motorbike_512/img/03790512", "use_labels": false, "max_size": 3936, "xflip": false, "resolution": 256, "data_camera_mode": "shapenet_motorbike", "add_camera_cond": true, "camera_path": "../renders_motorbike_512/camera", "split": "all", "random_seed": 0 }, "resume_pretrain": null, "D_reg_interval": 16, "num_gpus": 1, "batch_size": 4, "batch_gpu": 4, "metrics": [ "fid50k" ], "total_kimg": 20000, "kimg_per_tick": 1, "image_snapshot_ticks": 50, "network_snapshot_ticks": 200, "random_seed": 0, "ema_kimg": 1.25, "G_reg_interval": 4, "run_dir": "/train_logs/00042-stylegan2-03790512-gpus1-batch4-gamma80" }

Output directory: /train_logs/00042-stylegan2-03790512-gpus1-batch4-gamma80 Number of GPUs: 1 Batch size: 4 images Training duration: 20000 kimg Dataset path: ../renders_motorbike_512/img/03790512 Dataset size: 3936 images Dataset resolution: 256 Dataset labels: False Dataset x-flips: False

Creating output directory... Launching processes... Setting up PyTorch plugin "upfirdn2d_plugin"... Done. Setting up PyTorch plugin "bias_act_plugin"... Done. Setting up PyTorch plugin "filtered_lrelu_plugin"... Done. Loading training set... ==> use shapenet dataset ==> use ts shapenet motorbike split all ==> use shapenet folder number 164 ==> use image path: ../renders_motorbike_512/img/03790512, num images: 3936

Num images: 3936 Image shape: [3, 256, 256] Label shape: [0]

Constructing networks... Setting up augmentation... Distributing across 1 GPUs... Setting up training phases... Exporting sample images... Initializing logs... Training for 20000 kimg...

Traceback (most recent call last): File "train_3d.py", line 331, in main() # pylint: disable=no-value-for-parameter File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 829, in call return self.main(args, kwargs) File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 782, in main rv = self.invoke(ctx) File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1066, in invoke return ctx.invoke(self.callback, ctx.params) File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 610, in invoke return callback(args, kwargs) File "train_3d.py", line 325, in main launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run) File "train_3d.py", line 103, in launch_training subprocess_fn(rank=0, c=c, temp_dir=temp_dir) File "train_3d.py", line 49, in subprocess_fn training_loop_3d.training_loop(rank=rank, c) File "/workspace/vol/GET3D/training/training_loop_3d.py", line 288, in training_loop loss.accumulate_gradients( File "/workspace/vol/GET3D/training/loss.py", line 141, in accumulate_gradients loss_Gmain.mean().mul(gain).backward() File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/opt/conda/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 253, in apply return user_fn(self, *args) File "/workspace/vol/GET3D/torch_utils/ops/grid_sample_gradfix.py", line 53, in backward grad_input, grad_grid = _GridSample2dBackward.apply(grad_output, input, grid) File "/workspace/vol/GET3D/torch_utils/ops/grid_sample_gradfix.py", line 63, in forward grad_input, grad_grid = op(grad_output, input, grid, 0, 0, False) TypeError: 'tuple' object is not callable

iszihan commented 1 year ago

Did you solve this problem?

Bathsheba commented 1 year ago

I did, per above it was a torch version compatibility problem. Following that thread, I was able to fix it with a few small edits.