unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
16.84k stars 1.16k forks source link

RMS Norm weight somehow has no grad after backward? #832

Open fahadh4ilyas opened 2 months ago

fahadh4ilyas commented 2 months ago

I'm testing the kernel to check the speed of each step. I'm comparing unsloth fast_rms_layernorm with openchat's rms_norm. Here is my script:

import torch
from unsloth.kernels.rms_layernorm import fast_rms_layernorm

class FastLlamaRMSNorm(torch.nn.Module):
    def __init__(self, hidden_size, eps):
        """
        UnpaddedLlamaRMSNorm is equivalent to T5LayerNorm
        """
        super().__init__()

        self.weight = torch.nn.Parameter(torch.ones(hidden_size))
        self.variance_epsilon = eps

    def forward(self, hidden_states):
        return fast_rms_layernorm(self, hidden_states)

@torch.jit.script  # type: ignore
def rms_norm(hidden_states: torch.Tensor, weight: torch.Tensor, variance_epsilon: float):
    input_dtype = hidden_states.dtype
    hidden_states = hidden_states.to(torch.float32)

    variance = (hidden_states * hidden_states).mean(-1, keepdim=True)
    hidden_states = hidden_states * torch.rsqrt(variance + variance_epsilon)
    return weight * hidden_states.to(input_dtype)

class UnpaddedLlamaRMSNorm(torch.nn.Module):
    def __init__(self, hidden_size, eps):
        """
        UnpaddedLlamaRMSNorm is equivalent to T5LayerNorm
        """
        super().__init__()

        self.weight = torch.nn.Parameter(torch.ones(hidden_size))
        self.variance_epsilon = eps

    def forward(self, hidden_states):
        return rms_norm(hidden_states, self.weight, self.variance_epsilon)

rms_openchat = UnpaddedLlamaRMSNorm(4096, 1e-5).to('cuda')
rms_unsloth = FastLlamaRMSNorm(4096, 1e-5).to('cuda')

X = torch.randn((8192, 4096), device='cuda')

Y_openchat = rms_openchat(X)
Y_openchat.mean().backward()
grad_openchat = rms_openchat.weight.grad.clone().detach()
rms_openchat.zero_grad()

Y_unsloth = rms_unsloth(X)
Y_unsloth.mean().backward()
grad_unsloth = rms_unsloth.weight.grad.clone().detach()
rms_unsloth.zero_grad()

But, I got this error

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[1], line 51
     49 Y_unsloth = rms_unsloth(X)
     50 Y_unsloth.mean().backward()
---> 51 grad_unsloth = rms_unsloth.weight.grad.clone().detach()
     52 rms_unsloth.zero_grad()

AttributeError: 'NoneType' object has no attribute 'clone'

This means that when doing backward, the weight from RMS Norm has no grad and would not be updated when training. Is this intentional?

danielhanchen commented 2 months ago

Apologies ye the fast RMS layernorm doesn't create any gradients for the RMS weights to speed things up

fahadh4ilyas commented 2 months ago

Apologies ye the fast RMS layernorm doesn't create any gradients for the RMS weights to speed things up

Oh, will it cause something unexpected? Or the accuracy of training will not be bothered by it?

danielhanchen commented 2 months ago

Oh normal LoRA training does not train the RMS Layernorm weights

fahadh4ilyas commented 2 months ago

Oh normal LoRA training does not train the RMS Layernorm weights

what about full finetune? or maybe this repo is not supporting full finetune?

danielhanchen commented 2 months ago

Currently not sorry