sangmichaelxie / doremi

Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets
https://arxiv.org/abs/2305.10429
MIT License
286 stars 32 forks source link

Question about Flash-attention version. #12

Closed kiseliu closed 11 months ago

kiseliu commented 11 months ago

Hi, thanks for sharing this code base.

I am wondering about the Flash-attention version used in this repo, since the latest version of Flash-attention has a mismatch with transformers==4.27.2.

kiseliu commented 11 months ago

Thanks for the v1 updates. Now, I can run the code.