Closed leeooo001 closed 2 years ago
This looks more like a pytorch / pytorch_lightning related error. Note how the error is raised in "torch\nn\parallel\distributed.py". Have you made sure that your pytorch installation is working properly. Also, which version of pytorch and pytorch-lightning are you using?
Did you manage to solve the issue? Feel free to reopen if it persists.
I'm having the smae error on windows., I'm trying to find the solution
| Name | Type | Params
-------------------------------------------------------
0 | _flame | FlameHead | 0
1 | _offset_mlp | OffsetMLP | 616 K
2 | _normal_encoder | SirenNormalEncoder | 542 K
3 | _texture | TextureMLP | 1.8 M
4 | _explFeatures | MultiTexture | 4.5 M
5 | _leaky_hinge | LeakyHingeLoss | 0
6 | _masked_L1 | MaskedCriterion | 0
-------------------------------------------------------
7.9 M Trainable params
0 Non-trainable params
7.9 M Total params
31.494 Total estimated model params size (MB)
Epoch 0: 0%| | 0/5 [00:38<?, ?it/s]
Traceback (most recent call last):
File "D:\MOCAP\neural-head-avatars\python_scripts\optimize_nha.py", line 11, in <module>
train_pl_module(NHAOptimizer, RealDataModule)
File "d:\mocap\neural-head-avatars\nha\optimization\train_pl_module.py", line 88, in train_pl_module
trainer.fit(model,
File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 498, in fit
self.dispatch()
File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 545, in dispatch
self.accelerator.start_training(self)
File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 73, in start_training
self.training_type_plugin.start_training(trainer)
File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 114, in start_training
self._results = trainer.run_train()
File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 636, in run_train
self.train_loop.run_training_epoch()
File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 493, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 658, in run_training_batch
self._curr_step_result = self.training_step(
File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 293, in training_step
training_step_output = self.trainer.accelerator.training_step(args)
File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 156, in training_step
return self.training_type_plugin.training_step(*args)
File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\plugins\training_type\ddp.py", line 294, in training_step
return self.model(*args, **kwargs)
File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\torch\nn\parallel\distributed.py", line 878, in forward
self._sync_params()
File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\torch\nn\parallel\distributed.py", line 1379, in _sync_params
self._distributed_broadcast_coalesced(
File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\torch\nn\parallel\distributed.py", line 1334, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(
RuntimeError: Invalid scalar type
(neural) D:\MOCAP\neural-head-avatars>
@leeooo001 I ended up making it work changing a couple of things. on optimize_nha.py I added these 2 lines to force the script not to use NCCL
and changed the settings on optimize_avatar.ini
basically, I removed the ddp
from the distributed_backend
and accelerator
At least, for me, seemed that the issue on windows was due to the distributed process. It took me 2 days to solve this, and it was all inside the configuration :D
dear philgras:can you help me ?
Z:\python\philgras_neural_head_avatars\neural_head_avatars2>Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\python python_scripts/optimize_nha.py --config configs/optimize_avatar.ini Start Model training with the following configuration: Command Line Args: --config configs/optimize_avatar.ini Config File (configs/optimize_avatar.ini): image_log_period: 20 num_sanity_val_steps:0 gpus: 1 distributed_backend:ddp accelerator: ddp default_root_dir: demo/optimized_avatars data_path: demo/input_video split_config: configs/split.json tracking_results_path:demo/input_video2/tracking_0/tracked_flame_params.npz data_worker: 8 load_lmk: true load_seg: true load_camera: true load_flame: true load_normal: true load_parsing: true train_batch_size: [4, 2, 2] validation_batch_size:[2, 2, 2] epochs_offset: 150 epochs_texture: 50 epochs_joint: 50 flame_lr: [0.001, 0.01, 0.0002] offset_lr: [1e-05, 1e-05, 2e-06] tex_lr: [0.0001, 5e-05, 2e-05] spatial_blur_sigma:0.01 offset_hidden_layers:6 offset_hidden_feats:128 texture_hidden_feats:256 texture_hidden_layers:8 d_normal_encoding: 32 d_normal_encoding_hidden:128 n_normal_encoding_hidden:2 subdivide_mesh: 1 flame_noise: .1 soft_clip_sigma: 0.1 body_part_weights: configs/body_part_weights.json w_rgb: [0, 1, 0.05] w_perc: [0, 10, 0.5] w_norm: [0.02, 0.02, 0.02] w_edge: [10.0, 10.0, 10.0] w_eye_closed: [100000.0, 100000.0, 100000.0] w_semantic_ear: [0.1, 0.1, 0.1] w_semantic_eye: [0.1, 0.1, 0.1] w_semantic_hair: [[0.1, 50], [0.01, 100]] w_silh: [[0.01, 50], [0.1, 100]] w_lap: [[0.05, 50], [0.05, 100]] w_surface_reg: [0.0001, 0.0001, 0.0001] w_lmk: [0.01, 0.1, 0] w_shape_reg: [0.001, 0.001, 0.001] w_expr_reg: [0.001, 0.001, 0.001] w_pose_reg: [0.001, 0.001, 0.001] texture_weight_decay:[0.0001, 0.0001, 5e-06] Defaults: --texture_d_hidden_dynamic:128 --texture_n_hidden_dynamic:1 --glob_rot_noise: 5.0 --semantics_blur: 3 --w_semantic_mouth:[0.1, 0.1, 0.1] --logger: True --checkpoint_callback:True --gradient_clip_val:0 --process_position:0 --num_nodes: 1 --num_processes: 1 --auto_select_gpus:False --tpu_cores: <function _gpus_arg_default at 0x00000254E4989280> --overfit_batches: 0.0 --track_grad_norm: -1 --check_val_every_n_epoch:1 --fast_dev_run: False --accumulate_grad_batches:1 --limit_train_batches:1.0 --limit_val_batches:1.0 --limit_test_batches:1.0 --limit_predict_batches:1.0 --val_check_interval:1.0 --flush_logs_every_n_steps:100 --log_every_n_steps:50 --sync_batchnorm: False --precision: 32 --weights_summary: top --benchmark: False --deterministic: False --reload_dataloaders_every_epoch:False --auto_lr_find: False --replace_sampler_ddp:True --terminate_on_nan:False --auto_scale_batch_size:False --prepare_data_per_node:True --amp_backend: native --amp_level: O2 --move_metrics_to_cpu:False --multiple_trainloader_mode:max_size_cycle --stochastic_weight_avg:False --checkpoint_file:
[05/20 04:39:43 nha.data.real]: Collected real training dataset containing: 201 samples. [05/20 04:39:43 nha.data.real]: Collected real validation dataset containing: 100 samples. Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\utilities\distributed.py:51: UserWarning: ModelCheckpoint(save_last=True, monitor=None) is a redundant configuration. You can save the last checkpoint with ModelCheckpoint(save_top_k=None, monitor=None). warnings.warn(*args, **kwargs) Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch3d-0.6.1-py3.9-win-amd64.egg\pytorch3d\structures\meshes.py:1108: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). self._edges_packed = torch.stack([u // V, u % V], dim=1) [05/20 04:40:10 nha.optimization.train_pl_module]: Running the offset-optimization stage. GPU available: True, used: True TPU available: None, using: 0 TPU cores initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
| Name | Type | Params
0 | _flame | FlameHead | 0 1 | _offset_mlp | OffsetMLP | 616 K 2 | _normal_encoder | SirenNormalEncoder | 542 K 3 | _texture | TextureMLP | 1.8 M 4 | _explFeatures | MultiTexture | 4.5 M 5 | _leaky_hinge | LeakyHingeLoss | 0 6 | _masked_L1 | MaskedCriterion | 0
7.9 M Trainable params 0 Non-trainable params 7.9 M Total params 31.623 Total estimated model params size (MB) Epoch 0: 0%| | 0/101 [00:44<?, ?it/s] Traceback (most recent call last): File "Z:\python\philgras_neural_head_avatars\neural_head_avatars2\python_scripts\optimize_nha.py", line 12, in
train_pl_module(NHAOptimizer, RealDataModule)
File "Z:\python/philgras_neural_head_avatars/neural_head_avatars2\nha\optimization\train_pl_module.py", line 89, in train_pl_module
trainer.fit(model,
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 498, in fit
self.dispatch()
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 545, in dispatch
self.accelerator.start_training(self)
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 73, in start_training
self.training_type_plugin.start_training(trainer)
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 114, in start_training
self._results = trainer.run_train()
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 636, in run_train
self.train_loop.run_training_epoch()
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 493, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 658, in run_training_batch
self._curr_step_result = self.training_step(
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 293, in training_step
training_step_output = self.trainer.accelerator.training_step(args)
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 156, in training_step
return self.training_type_plugin.training_step(args)
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\plugins\training_type\ddp.py", line 294, in training_step
return self.model(args, *kwargs)
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(input, **kwargs)
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\torch\nn\parallel\distributed.py", line 878, in forward
self._sync_params()
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\torch\nn\parallel\distributed.py", line 1379, in _sync_params
self._distributed_broadcast_coalesced(
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\torch\nn\parallel\distributed.py", line 1334, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(
RuntimeError: Invalid scalar type