mosaicml / llm-foundry

LLM training code for Databricks foundation models
https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
Apache License 2.0
3.84k stars 503 forks source link

Add eval_drop_last flag to fix TE eval bug #1247

Open j316chuck opened 1 month ago

j316chuck commented 1 month ago

Description

This PR introduces the eval_drop_last flag which enables the drop_last flag in ICL eval pytorch Dataloaders. This flag ensures that all dataset batches will be divisible by eval_batch_size. This feature is necessary because TransformerEngine requires all inputs to be divisible by 8 and so we must pass in batches of size 8. Before, the eval dataloaders would return the remainder of the dataset size on the last batch which would result in an error.

For example, if the dataset was of length 41 and the batch size was 8, the last batch would be of size 41 % 8 = 1 which would break TE. Now with this eval_drop_last flag enabled, we simply skip this last batch of size 1.

Note: enabling this flag will result in different eval scores.

Testing

Unit Test: test_icl_task_tokenizer_and_dataloader

Integration Test:

Issues Fixed

https://databricks.atlassian.net/browse/RGENAI-165

b-chu commented 1 month ago

Agreed with Daniel here

snarayan21 commented 3 weeks ago

Can we disable TE layers just for eval if they have this batch size requirement? Or turn off fp8 temporarily?