sony / ai-research-code

Apache License 2.0
347 stars 65 forks source link

NVC-Net Training #46

Closed Rcwt closed 2 years ago

Rcwt commented 2 years ago

Hi, thanks for releasing the code for NVC-Net. I've got two questions:

Firstly, when trying to train on multiple GPUs, I run into the following error:

Failed `it != items_.end()`: Any of [cudnn:float, cuda:float, cpu:float] could not be found in []
No communicator found. Running with a single process. If you run this with MPI processes, all processes will perform totally same.

which basically means it's only running on one GPU. In fact I get the same error simply by running the following

import nnabla.communicators as C
from nnabla.ext_utils import get_extension_context
ctx = get_extension_context("cudnn", device_id='0')
C.MultiProcessDataParallelCommunicator(ctx)

I know this is probably more of a nnabla issue but as a PyTorch user I'm not sure where to get help with nnabla.

Secondly, is it normal for the content preservation loss g_loss_con to be 0.0 for the first few epochs? I'm finding that the encoder basically encodes everything to the same vector in the hidden dimension, hence the loss is 0.0. For reference I'm also using the VCTK dataset processed with the given script with default parametres.

Thanks alot!

TomonobuTsujikawa commented 2 years ago

Hi, For first question, we got a question with same error... please check: https://github.com/sony/nnabla-ext-cuda/issues/367 I hope install page or docker container will help you.

For second question, please wait a moment as I will ask developer to check your question.

bacnguyencong-sony commented 2 years ago

Secondly, is it normal for the content preservation loss g_loss_con to be 0.0 for the first few epochs? I'm finding that the encoder basically encodes everything to the same vector in the hidden dimension, hence the loss is 0.0

Yes, it's expected to have the content preservation loss close to 0.0 because we want to preserve the content as much as possible even at the beginning of training. Note that the output from the content encoder is normalized. Therefore, if the training is stabilized, we should have different content codes for different inputs.

Rcwt commented 2 years ago

Thanks, will see how the training goes.

Closing this for now