openai / guided-diffusion

MIT License
6.06k stars 807 forks source link

What is the 'zero_module' used for? [Question] #21

Closed santodante closed 2 years ago

santodante commented 2 years ago

First of all, thanks for an extraordinary paper - so many interesting details!! Also, thanks for open sourcing the code.

I have a few ideas I want to test and I'm trying to understand all the parts of the code. Most of it is clear and well commented, but I can't seem to figure out the reasoning behind the 'zero_module' you have in a few places in the guided-diffusion/guided_diffusion/unet.py file?

def zero_module(module):
    """
    Zero out the parameters of a module and return it.
    """
    for p in module.parameters():
        p.detach().zero_()
    return module

I couldn't find anything in the paper or online to explain why this is used.

I'm also curious why you used a custom mixed precision training instead of using PyTorch's mixed precision training (torch.cuda.amp.autocast)?

XinYu-Andy commented 2 years ago

First of all, thanks for an extraordinary paper - so many interesting details!! Also, thanks for open sourcing the code.

I have a few ideas I want to test and I'm trying to understand all the parts of the code. Most of it is clear and well commented, but I can't seem to figure out the reasoning behind the 'zero_module' you have in a few places in the guided-diffusion/guided_diffusion/unet.py file?

def zero_module(module):
    """
    Zero out the parameters of a module and return it.
    """
    for p in module.parameters():
        p.detach().zero_()
    return module

I couldn't find anything in the paper or online to explain why this is used.

I'm also curious why you used a custom mixed precision training instead of using PyTorch's mixed precision training (torch.cuda.amp.autocast)?

In my opinion, this may be a trick to do weight initialization, since I find this module is often used before the skip connection.

ShoufaChen commented 2 years ago

without .detach, you will get an error:

RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
unixpickle commented 2 years ago

The zero_grad() function is used to initialize certain modules to zero. I believe this initialization scheme was also used in Denoising Diffusion Probabilistic Models (2020) paper (e.g. here).

torch.cuda.amp wasn't in a stable state when we started working on this project, but it is likely suitable to be used in a project like this by now. We don't want to change the code more than necessary since this is mostly an archive research codebase.

chenaoxuan commented 1 year ago

First of all, thanks for an extraordinary paper - so many interesting details!! Also, thanks for open sourcing the code.

I have a few ideas I want to test and I'm trying to understand all the parts of the code. Most of it is clear and well commented, but I can't seem to figure out the reasoning behind the 'zero_module' you have in a few places in the guided-diffusion/guided_diffusion/unet.py file?

def zero_module(module):
    """
    Zero out the parameters of a module and return it.
    """
    for p in module.parameters():
        p.detach().zero_()
    return module

I couldn't find anything in the paper or online to explain why this is used.

I'm also curious why you used a custom mixed precision training instead of using PyTorch's mixed precision training (torch.cuda.amp.autocast)?

Hello, I discovered the application of this method in “ControlNet”, which is a "zero convolution", a trick to improve the effect

RichardSunnyMeng commented 9 months ago

So why need to .detach()?

The zero_grad() function is used to initialize certain modules to zero. I believe this initialization scheme was also used in Denoising Diffusion Probabilistic Models (2020) paper (e.g. here).

torch.cuda.amp wasn't in a stable state when we started working on this project, but it is likely suitable to be used in a project like this by now. We don't want to change the code more than necessary since this is mostly an archive research codebase.