Train with attention mask

mosaicml / llm-foundry

LLM training code for Databricks foundation models

Apache License 2.0

3.83k stars 501 forks source link

Open germanjke opened 1 month ago

germanjke commented 1 month ago

Hi,

Llama 3 trains like this

We trained the models on sequences of 8,192 tokens, using a mask to ensure self-attention does not cross document boundaries.

I see you have something like this in mpt_modeling.py here

Can you tell please, how we can define this in train config?

Thanks

dakinggg commented 1 month ago

Hey, we have not implemented the attention masking you are describing for models other than MPT variants.