Closed liuliu closed 2 months ago
Closed due to the issue being stale. A contribution was merged to MFA v1, then cleared away when the entire repo was rewritten from scratch. Users can now edit / recompile the code with ease, changing which registers have which data types.
An accuracy issue arises during integration with SSD-1B model. q, k can be large enough that qk can exceed half-precision range. This is OK because the scale usually applied on q or on both q and k like `new_q = sqrt(scale) q
,
new_k = sqrt(scale) k`. However in MFA attention kernel implementation, we apply alpha only after q k is done, hence cause nan issue.This can be reproduced with the tensors extracted from SSD-1B computation and with following s4nnc code:
The
reprod_tensor.sqlite3
is attached here. reprod_tensor.split.sqlite3.zip reprod_tensor.split.sqlite3.z01.zip(Please rename the sqlite3.z01.zip file to sqlite3.z01 to workaround GitHub file size limitation).