royorel / StyleSDF

Other
533 stars 50 forks source link

Problem training full pipeline #13

Open boduan1 opened 2 years ago

boduan1 commented 2 years ago

Hello royorel! First thanks for your previous suggestion with the volume rendering part, it works for me now.

But I then got a problem with the full pipeline part, when I use 1 GPU everything works fine, but when I change to 2 or 4 gpus, it gives an error in the beginning image Do you know what might be the problem? (there is no problem when I use 2 or 4 GPUs for the volume rendering part)

Moreover, could you also give some instructions about how to repeat the evaluation in the paper? Thanks!

royorel commented 2 years ago

Hi @boduan1,

That might be related to the pytorch version you're using. It might be an issue with updates to the torch.distributed package. Are you using pytorch 1.9.0 like we mention in the readme?

wosecz commented 1 year ago

I find the same question during training full pipeline with multi-gpus. You are right the reason is pytorch version. However, I am unable to install pytorch1.9.0 because I can not degrade the cuda version (11.6) in the clusting environment. Could you please give me some suggestion about running full pipeline with cuda 11.6?

royorel commented 1 year ago

Hi @wosecz,

That issue lies with pytorch's distributed training package. My suggestion would be to re-implement that part in the training code such that it will work on the latest pytorch version.

wosecz commented 1 year ago

Thank you for your reply! I check the training part and find out the problem comes from the "autograd.grad" function in "g_path_regularize", losses.py. The full pipeline training code runs successfully with g_reg_every=0 in pytorch 1.9.0 env. However, omitting G regularization leads to performance degradation. I'm still trying to fix the version problem of autograd functions. Thank you again for your help!

DhlinV commented 1 year ago

Thanks for the interesting work ! I also met the same problem during training, I solved it by translating the "latent" variable to a leaf node, which can get the gradient. You just need to add

if return_latents:
    latent = latent.detach()
    latent.requires_grad_(True)

in the forward of the class Decoder in model.py (Line 618) before you synthesising the fake img. Then it works.

sen-mao commented 2 months ago

Thanks for the interesting work ! I also met the same problem during training, I solved it by translating the "latent" variable to a leaf node, which can get the gradient. You just need to add

if return_latents:
    latent = latent.detach()
    latent.requires_grad_(True)

in the forward of the class Decoder in model.py (Line 618) before you synthesising the fake img. Then it works.

It works for me 👍🏻