训练报错 - Githubissues

MikoSamey commented 1 year ago

你好，我在训练web172数据集时报错，显示梯度无法传回，我没有修改过代码（除了一些路径），请问该如何解决。 [W reducer.cpp:1303] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) Traceback (most recent call last): File "tools/train.py", line 161, in main() File "tools/train.py", line 151, in main train_model(model, File "/root/mmgeneration/mmgen/apis/train.py", line 207, in train_model runner.run(data_loaders, cfg.workflow, cfg.total_iters) File "/root/mmgeneration/mmgen/core/runners/dynamic_iterbased_runner.py", line 285, in run iter_runner(iter_loaders[i], kwargs) File "/root/mmgeneration/mmgen/core/runners/dynamic_iterbased_runner.py", line 215, in train outputs = self.model.train_step(data_batch, self.optimizer, kwargs) File "/root/mmgeneration/mmgen/core/ddp_wrapper.py", line 123, in train_step output = self.module.train_step(*inputs[0], *kwargs[0]) File "/root/autodl-tmp/MMGEN-FaceStylor/agilegan/transfer.py", line 414, in train_step loss_gen, log_vars_g, source_results = self._get_gen_loss(datadict) File "/root/autodl-tmp/MMGEN-FaceStylor/agilegan/transfer.py", line 121, in _get_genloss loss = loss_module(outputs_dict) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, **kwargs) File "/root/mmgeneration/mmgen/models/losses/gen_auxiliary_loss.py", line 264, in forward path_penalty, self.mean_pathlength, = gen_path_regularizer( File "/root/mmgeneration/mmgen/models/losses/gen_auxiliary_loss.py", line 102, in gen_path_regularizer grad = autograd.grad( File "/root/miniconda3/lib/python3.8/site-packages/torch/autograd/init.py", line 234, in grad return Variable._execution_engine.run_backward( RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3913) of binary: /root/miniconda3/bin/python Traceback (most recent call last): File "/root/miniconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/root/miniconda3/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

我是用的是云服务器，mmcv-full和mmgen的版本是1.6.0和0.7.2，该问题是否与版本有关?

mm-assistant[bot] commented 1 year ago

We recommend using English or English & Chinese for issues so that we could have broader discussion.

xuguozhi commented 1 year ago

have met same issue, have you solved this?

open-mmlab / MMGEN-FaceStylor

训练报错 #24