mosaicml / llm-foundry

LLM training code for Databricks foundation models
https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
Apache License 2.0
3.83k stars 501 forks source link

Train with attention mask #1183

Open germanjke opened 1 month ago

germanjke commented 1 month ago

Hi,

Llama 3 trains like this

We trained the models on sequences of 8,192 tokens, using a mask to ensure self-attention does not cross document boundaries.

I see you have something like this in mpt_modeling.py here

Can you tell please, how we can define this in train config?

Thanks

dakinggg commented 1 month ago

Hey, we have not implemented the attention masking you are describing for models other than MPT variants.