Closed JL-er closed 5 months ago
Oh, looks that you may need to switch back to logsigmoid, -exp is not stable yet
这是可行的loss非常稳定,基本没有误差
应该是这次更新的问题
This update fixes potential nans during inference, I think it's not the issue. Possibly cuz of potential inf grad of -exp, would check it, thank you
RWKV-PEFT 添加fla,目前是可用的。但是一旦更换新fla loss就会nan,如果后续fla有更新可以告诉我 ,我可以进行测试
不知道为什么fla的rwkv6,竟然没有cuda快,我之前测试gla的时候会快很多
Have you compared the kernel speed
我找时间测一下,对了还有个问题,我在做state tuning的时候,替换上fla算子会出现报错 应该是state没有保存梯度的原因,所以想问一下怎么解决?
You can enable gradient for h0 mannually
Taking h0 as learnable params would be ok? like h0 = nn.Parameter(key_dim, head_dim)
我在使用cuda算子时是可以正常运行的,但是fla不行,正常情况state在算子计算的梯度会自动保存
还有一点是,我这里冻结了其他所有权重只保留state的梯度
ic, currently there is no access to grad of states. we will add an option later
thank you
@JL-er Hi, check it out https://github.com/sustcsonglin/flash-linear-attention/commit/1547448b998a163fdb33c49266da699db13f2dc8
Now we do not truncate grad of h states for RWKV6 for ease of state tuning Do contact us if you met any bugs or any numerical stability issues :-D
rwkv-peft上测试非常完美,已经不需要clip了。不过之前infctx训练6000ctx len时偶尔会nan(我会重新测试) 非常感谢您
我现在用的是前几天的版本loss正常