linkedin / Liger-Kernel

Efficient Triton Kernels for LLM Training
https://arxiv.org/pdf/2410.10989
BSD 2-Clause "Simplified" License
3.43k stars 202 forks source link

Are you even allowed to do these ops inplace? #254

Closed mayank31398 closed 1 month ago

mayank31398 commented 1 month ago

https://github.com/linkedin/Liger-Kernel/blob/58fd2bc85073fdb010164426c9b159cd8a0e9542/src/liger_kernel/ops/swiglu.py#L59-L60

Lets take a custom autograd function:

class Exponential(torch.autograd.Function):
    def forward(ctx, x):
        out = torch.exp(x)
        ctx.save_for_backward(out)
        return out

    def backward(ctx, out_grad):
        out = ctx.saved_tensors
        x_grad = out_grad * out
        return x_grad

and if we have an op like swiglu that modifies inputs in backwards:

x = some tensor
x_exp = Exponential.apply(x)
y = swiglu(x_exp, x_exp)
loss = some_loss(y, target)

now during backprop, we would see incorrect behaviour right? because the custom autograd function Exponential saves the output for backprop here instead of saving the input for backprop.

mayank31398 commented 1 month ago

Hey, guys Any clarification regarding this?

shivam15s commented 1 month ago

I talked with the team about this, and seems like they intentionally designed it this way. The inconsistency you observe is fairly uncommon in an LLM training loop.

mayank31398 commented 1 month ago

I see, @shivam15s thanks for the clarification.

mayank31398 commented 1 month ago

it might be better to just to it out-of-place and let torch compile with dynamo take care of it