This PR introduces the eval_drop_last flag which enables the drop_last flag in ICL eval pytorch Dataloaders. This flag ensures that all dataset batches will be divisible by eval_batch_size. This feature is necessary because TransformerEngine requires all inputs to be divisible by 8 and so we must pass in batches of size 8. Before, the eval dataloaders would return the remainder of the dataset size on the last batch which would result in an error.
For example, if the dataset was of length 41 and the batch size was 8, the last batch would be of size 41 % 8 = 1 which would break TE. Now with this eval_drop_last flag enabled, we simply skip this last batch of size 1.
Note: enabling this flag will result in different eval scores.
Testing
Unit Test: test_icl_task_tokenizer_and_dataloader
Integration Test:
Before: fp8-llama3-8b-metamath-4ep-4LEFPw 🔴
Error Traceback:
[Eval batch=1/6] Eval on gsm8k/0-shot data
[Eval batch=2/6] Eval on gsm8k/0-shot data
[Eval batch=3/6] Eval on gsm8k/0-shot data
[Eval batch=4/6] Eval on gsm8k/0-shot data
[Eval batch=5/6] Eval on gsm8k/0-shot data
/usr/lib/python3/dist-packages/composer/core/data_spec.py:37: UserWarning: Cannot split tensor of length 2 into batches of size 8. As it is smaller, no splitting will be done. This may happen on the last batch of a dataset if it is a smaller size than the microbatch size.
warnings.warn(
/usr/lib/python3/dist-packages/composer/core/data_spec.py:26: UserWarning: Cannot split list of length 2 into batches of size 8. As it is smaller, no splitting will be done. This may happen on the last batch of a dataset if it is a smaller size than the microbatch size.
...
[rank6]: File "/usr/lib/python3/dist-packages/transformer_engine/pytorch/utils.py", line 235, in assert_dim_for_fp8_exec
[rank6]: tensor.dim() == 2
[rank6]: AssertionError: FP8 execution requires 2D input matrices with height divisible by 8 and width divisible by 16, but got tensor with dims=[1404, 4096]
After: fp8-llama3-8b-metamath-4ep-0uiOJb ✅
[Eval batch=1/5] Eval on gsm8k/0-shot data
[Eval batch=2/5] Eval on gsm8k/0-shot data
[Eval batch=3/5] Eval on gsm8k/0-shot data
[Eval batch=4/5] Eval on gsm8k/0-shot data
[Eval batch=5/5] Eval on gsm8k/0-shot data:
Eval metrics/gsm8k/0-shot/InContextLearningGenerationExactMatchAccuracy: 0.6016
Reference run: llama3-8b-metamath-4ep-jaIcPX with no skipped batches
Description
This PR introduces the
eval_drop_last
flag which enables thedrop_last
flag in ICL eval pytorch Dataloaders. This flag ensures that all dataset batches will be divisible byeval_batch_size
. This feature is necessary because TransformerEngine requires all inputs to be divisible by 8 and so we must pass in batches of size 8. Before, the eval dataloaders would return the remainder of the dataset size on the last batch which would result in an error.For example, if the dataset was of length 41 and the batch size was 8, the last batch would be of size
41 % 8 = 1
which would break TE. Now with thiseval_drop_last
flag enabled, we simply skip this last batch of size 1.Note: enabling this flag will result in different eval scores.
Testing
Unit Test:
test_icl_task_tokenizer_and_dataloader
Integration Test:
Before:
fp8-llama3-8b-metamath-4ep-4LEFPw
🔴Error Traceback:
After:
fp8-llama3-8b-metamath-4ep-0uiOJb
✅Reference run:
llama3-8b-metamath-4ep-jaIcPX
with no skipped batchesIssues Fixed
https://databricks.atlassian.net/browse/RGENAI-165