optimize_avatar error - Githubissues

leeooo001 commented 2 years ago

dear philgras：can you help me ?

Z:\python\philgras_neural_head_avatars\neural_head_avatars2>Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\python python_scripts/optimize_nha.py --config configs/optimize_avatar.ini Start Model training with the following configuration: Command Line Args: --config configs/optimize_avatar.ini Config File (configs/optimize_avatar.ini): image_log_period: 20 num_sanity_val_steps:0 gpus: 1 distributed_backend:ddp accelerator: ddp default_root_dir: demo/optimized_avatars data_path: demo/input_video split_config: configs/split.json tracking_results_path:demo/input_video2/tracking_0/tracked_flame_params.npz data_worker: 8 load_lmk: true load_seg: true load_camera: true load_flame: true load_normal: true load_parsing: true train_batch_size: [4, 2, 2] validation_batch_size:[2, 2, 2] epochs_offset: 150 epochs_texture: 50 epochs_joint: 50 flame_lr: [0.001, 0.01, 0.0002] offset_lr: [1e-05, 1e-05, 2e-06] tex_lr: [0.0001, 5e-05, 2e-05] spatial_blur_sigma:0.01 offset_hidden_layers:6 offset_hidden_feats:128 texture_hidden_feats:256 texture_hidden_layers:8 d_normal_encoding: 32 d_normal_encoding_hidden:128 n_normal_encoding_hidden:2 subdivide_mesh: 1 flame_noise: .1 soft_clip_sigma: 0.1 body_part_weights: configs/body_part_weights.json w_rgb: [0, 1, 0.05] w_perc: [0, 10, 0.5] w_norm: [0.02, 0.02, 0.02] w_edge: [10.0, 10.0, 10.0] w_eye_closed: [100000.0, 100000.0, 100000.0] w_semantic_ear: [0.1, 0.1, 0.1] w_semantic_eye: [0.1, 0.1, 0.1] w_semantic_hair: [[0.1, 50], [0.01, 100]] w_silh: [[0.01, 50], [0.1, 100]] w_lap: [[0.05, 50], [0.05, 100]] w_surface_reg: [0.0001, 0.0001, 0.0001] w_lmk: [0.01, 0.1, 0] w_shape_reg: [0.001, 0.001, 0.001] w_expr_reg: [0.001, 0.001, 0.001] w_pose_reg: [0.001, 0.001, 0.001] texture_weight_decay:[0.0001, 0.0001, 5e-06] Defaults: --texture_d_hidden_dynamic:128 --texture_n_hidden_dynamic:1 --glob_rot_noise: 5.0 --semantics_blur: 3 --w_semantic_mouth:[0.1, 0.1, 0.1] --logger: True --checkpoint_callback:True --gradient_clip_val:0 --process_position:0 --num_nodes: 1 --num_processes: 1 --auto_select_gpus:False --tpu_cores: <function _gpus_arg_default at 0x00000254E4989280> --overfit_batches: 0.0 --track_grad_norm: -1 --check_val_every_n_epoch:1 --fast_dev_run: False --accumulate_grad_batches:1 --limit_train_batches:1.0 --limit_val_batches:1.0 --limit_test_batches:1.0 --limit_predict_batches:1.0 --val_check_interval:1.0 --flush_logs_every_n_steps:100 --log_every_n_steps:50 --sync_batchnorm: False --precision: 32 --weights_summary: top --benchmark: False --deterministic: False --reload_dataloaders_every_epoch:False --auto_lr_find: False --replace_sampler_ddp:True --terminate_on_nan:False --auto_scale_batch_size:False --prepare_data_per_node:True --amp_backend: native --amp_level: O2 --move_metrics_to_cpu:False --multiple_trainloader_mode:max_size_cycle --stochastic_weight_avg:False --checkpoint_file:

[05/20 04:39:43 nha.data.real]: Collected real training dataset containing: 201 samples. [05/20 04:39:43 nha.data.real]: Collected real validation dataset containing: 100 samples. Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\utilities\distributed.py:51: UserWarning: ModelCheckpoint(save_last=True, monitor=None) is a redundant configuration. You can save the last checkpoint with ModelCheckpoint(save_top_k=None, monitor=None). warnings.warn(*args, **kwargs) Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch3d-0.6.1-py3.9-win-amd64.egg\pytorch3d\structures\meshes.py:1108: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). self._edges_packed = torch.stack([u // V, u % V], dim=1) [05/20 04:40:10 nha.optimization.train_pl_module]: Running the offset-optimization stage. GPU available: True, used: True TPU available: None, using: 0 TPU cores initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1

| Name | Type | Params

0 | _flame | FlameHead | 0 1 | _offset_mlp | OffsetMLP | 616 K 2 | _normal_encoder | SirenNormalEncoder | 542 K 3 | _texture | TextureMLP | 1.8 M 4 | _explFeatures | MultiTexture | 4.5 M 5 | _leaky_hinge | LeakyHingeLoss | 0 6 | _masked_L1 | MaskedCriterion | 0

7.9 M Trainable params 0 Non-trainable params 7.9 M Total params 31.623 Total estimated model params size (MB) Epoch 0: 0%| | 0/101 [00:44<?, ?it/s] Traceback (most recent call last): File "Z:\python\philgras_neural_head_avatars\neural_head_avatars2\python_scripts\optimize_nha.py", line 12, in train_pl_module(NHAOptimizer, RealDataModule) File "Z:\python/philgras_neural_head_avatars/neural_head_avatars2\nha\optimization\train_pl_module.py", line 89, in train_pl_module trainer.fit(model, File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 498, in fit self.dispatch() File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 545, in dispatch self.accelerator.start_training(self) File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 73, in start_training self.training_type_plugin.start_training(trainer) File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 114, in start_training self._results = trainer.run_train() File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 636, in run_train self.train_loop.run_training_epoch() File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 493, in run_training_epoch batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx) File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 658, in run_training_batch self._curr_step_result = self.training_step( File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 293, in training_step training_step_output = self.trainer.accelerator.training_step(args) File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 156, in training_step return self.training_type_plugin.training_step(args) File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\plugins\training_type\ddp.py", line 294, in training_step return self.model(args, *kwargs) File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl return forward_call(input, **kwargs) File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\torch\nn\parallel\distributed.py", line 878, in forward self._sync_params() File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\torch\nn\parallel\distributed.py", line 1379, in _sync_params self._distributed_broadcast_coalesced( File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\torch\nn\parallel\distributed.py", line 1334, in _distributed_broadcast_coalesced dist._broadcast_coalesced( RuntimeError: Invalid scalar type

malteprinzler commented 2 years ago

This looks more like a pytorch / pytorch_lightning related error. Note how the error is raised in "torch\nn\parallel\distributed.py". Have you made sure that your pytorch installation is working properly. Also, which version of pytorch and pytorch-lightning are you using?

philgras commented 2 years ago

Did you manage to solve the issue? Feel free to reopen if it persists.

carlosedubarreto commented 2 years ago

I'm having the smae error on windows., I'm trying to find the solution

  | Name            | Type               | Params
-------------------------------------------------------
0 | _flame          | FlameHead          | 0
1 | _offset_mlp     | OffsetMLP          | 616 K
2 | _normal_encoder | SirenNormalEncoder | 542 K
3 | _texture        | TextureMLP         | 1.8 M
4 | _explFeatures   | MultiTexture       | 4.5 M
5 | _leaky_hinge    | LeakyHingeLoss     | 0
6 | _masked_L1      | MaskedCriterion    | 0
-------------------------------------------------------
7.9 M     Trainable params
0         Non-trainable params
7.9 M     Total params
31.494    Total estimated model params size (MB)
Epoch 0:   0%|                                                                                   | 0/5 [00:38<?, ?it/s]
Traceback (most recent call last):
  File "D:\MOCAP\neural-head-avatars\python_scripts\optimize_nha.py", line 11, in <module>
    train_pl_module(NHAOptimizer, RealDataModule)
  File "d:\mocap\neural-head-avatars\nha\optimization\train_pl_module.py", line 88, in train_pl_module
    trainer.fit(model,
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 498, in fit
    self.dispatch()
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 545, in dispatch
    self.accelerator.start_training(self)
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 73, in start_training
    self.training_type_plugin.start_training(trainer)
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 114, in start_training
    self._results = trainer.run_train()
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 636, in run_train
    self.train_loop.run_training_epoch()
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 493, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 658, in run_training_batch
    self._curr_step_result = self.training_step(
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 293, in training_step
    training_step_output = self.trainer.accelerator.training_step(args)
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 156, in training_step
    return self.training_type_plugin.training_step(*args)
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\plugins\training_type\ddp.py", line 294, in training_step
    return self.model(*args, **kwargs)
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\torch\nn\parallel\distributed.py", line 878, in forward
    self._sync_params()
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\torch\nn\parallel\distributed.py", line 1379, in _sync_params
    self._distributed_broadcast_coalesced(
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\torch\nn\parallel\distributed.py", line 1334, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
RuntimeError: Invalid scalar type

(neural) D:\MOCAP\neural-head-avatars>

carlosedubarreto commented 2 years ago

@leeooo001 I ended up making it work changing a couple of things. on optimize_nha.py I added these 2 lines to force the script not to use NCCL

and changed the settings on optimize_avatar.ini basically, I removed the ddp from the distributed_backend and accelerator

At least, for me, seemed that the issue on windows was due to the distributed process. It took me 2 days to solve this, and it was all inside the configuration :D

philgras / neural-head-avatars

optimize_avatar error #12

| Name | Type | Params

0 | _flame | FlameHead | 0 1 | _offset_mlp | OffsetMLP | 616 K 2 | _normal_encoder | SirenNormalEncoder | 542 K 3 | _texture | TextureMLP | 1.8 M 4 | _explFeatures | MultiTexture | 4.5 M 5 | _leaky_hinge | LeakyHingeLoss | 0 6 | _masked_L1 | MaskedCriterion | 0