Closed mayank31398 closed 1 month ago
Hey, guys Any clarification regarding this?
I talked with the team about this, and seems like they intentionally designed it this way. The inconsistency you observe is fairly uncommon in an LLM training loop.
I see, @shivam15s thanks for the clarification.
it might be better to just to it out-of-place and let torch compile with dynamo take care of it
https://github.com/linkedin/Liger-Kernel/blob/58fd2bc85073fdb010164426c9b159cd8a0e9542/src/liger_kernel/ops/swiglu.py#L59-L60
Lets take a custom autograd function:
and if we have an op like swiglu that modifies inputs in backwards:
now during backprop, we would see incorrect behaviour right? because the custom autograd function
Exponential
saves the output for backprop here instead of saving the input for backprop.