Core attention - Githubissues

huggingface / nanotron

Minimalistic large language model 3D-parallelism training

Apache License 2.0

1.14k stars 107 forks source link

Open zzhhjjj opened 4 months ago

zzhhjjj commented 4 months ago

replace flash_attn_varlen_func with flash_attn_func as

We are not using cu_seqlens_q or cu_seqlens_k.
~2% increase in training speed. Tested with 7b params model, from 358 GFLOPS to 367 GFLOPS