huggingface / nanotron

Minimalistic large language model 3D-parallelism training
Apache License 2.0
1.14k stars 107 forks source link

Core attention #168

Open zzhhjjj opened 4 months ago

zzhhjjj commented 4 months ago

replace flash_attn_varlen_func with flash_attn_func as

  1. We are not using cu_seqlens_q or cu_seqlens_k.
  2. ~2% increase in training speed. Tested with 7b params model, from 358 GFLOPS to 367 GFLOPS