mosaicml / llm-foundry

LLM training code for Databricks foundation models
https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
Apache License 2.0
3.99k stars 525 forks source link

Allow EOS token for finetuning #1199

Closed jimwu6 closed 4 months ago

jimwu6 commented 4 months ago

This is needed to allow the finetuning dataset to be constructed correctly.

dakinggg commented 4 months ago

Where do you see this needed? I'm pretty sure finetuning just uses the eos from the tokenizer.

milocress commented 4 months ago

Where do you see this needed? I'm pretty sure finetuning just uses the eos from the tokenizer.

It looks like it's one of the things **ed into the superclass, I think there are some cases where omitting this causes an error. eg.

[rank2]: ValueError: sequence_id is a required argument when MPT is configured with attn_uses_sequence_id=True and the model is in train mode.
dakinggg commented 4 months ago

@milocress that should only be for pretraining style. finetuning style handles packing and sequence id on its own. e.g. https://github.com/mosaicml/llm-foundry/blob/fb9a2259e880b0baa3d3523ff42def9ea6c29ce3/llmfoundry/data/packing.py#L155