issues
search
huggingface
/
nanotron
Minimalistic large language model 3D-parallelism training
Apache License 2.0
1.14k
stars
107
forks
source link
Core attention
#168
Open
zzhhjjj
opened
4 months ago
zzhhjjj
commented
4 months ago
replace flash_attn_varlen_func with flash_attn_func as
We are not using cu_seqlens_q or cu_seqlens_k.
~2% increase in training speed. Tested with 7b params model, from 358 GFLOPS to 367 GFLOPS
replace flash_attn_varlen_func with flash_attn_func as