training problem - Githubissues

haofanwang commented 2 years ago

I meet error at Step 1 by running python -u main.py --config configs/sep_vqvae.yaml --train

Traceback (most recent call last):
  File "main.py", line 56, in <module>
    main()
  File "main.py", line 40, in main
    agent.train()
  File "/share/yanzhen/Bailando/motion_vqvae.py", line 94, in train
    loss.backward()
  File "/root/anaconda3/envs/workspace/lib/python3.8/site-packages/torch/_tensor.py", line 363, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/root/anaconda3/envs/workspace/lib/python3.8/site-packages/torch/autograd/__init__.py", line 166, in backward
    grad_tensors_ = _make_grads(tensors, grad_tensors_, is_grads_batched=False)
  File "/root/anaconda3/envs/workspace/lib/python3.8/site-packages/torch/autograd/__init__.py", line 67, in _make_grads
    raise RuntimeError("grad can be implicitly created only for scalar outputs")
RuntimeError: grad can be implicitly created only for scalar outputs

After print the loss, it looks like tensor([0.2667, 0.2735, 0.2687, 0.2584, 0.2701, 0.2697, 0.2571, 0.2658], device='cuda:0', grad_fn=<GatherBackward>), so do I need to take a mean or sum operation?

However, even if I take a mean operation, the training still seems problematic. The loss decreases normally, while in eval stage, the output quants are all zero. Any suggestion?

The training log is attached for reference.

log.txt

@lisiyao21

aleeyang commented 2 years ago

@haofanwang have you solved this problem? I meet the same problem

uk9921 commented 2 years ago

@lisiyao21 @aleeyang Are you using multiple GPUs for training? Maybe specifying the GPU index would help CUDA_VISIBLE_DEVICES=0 python -u main.py --config configs/sep_vqvae.yaml --train

lisiyao21 / Bailando

training problem #15