Open vigneshwaran opened 1 year ago
Hi, @vigneshwaran, thanks for discovering this! please make a pull request with your changes!
Great find! If you can open a PR, we'd love to accept it :)
I have proposed my MR #2428 @eracah @mvpatel2000
In test script, batch is always expected to have fixed size of (batch_size, max_seq_len) https://github.com/mosaicml/composer/blob/b5ed4874be81bb69eca663bd8c733fcc870c315d/tests/datasets/test_in_context_learning_datasets.py#L591
Test error:
So, we need to update test files.
I have been experiencing llm-foundry/eval takes a lot of time compared to lm-evaluation-harness. After digging into the code, I found padding token is appended till the maximum length of the tokenizer.
https://github.com/bmosaicml/composer/blob/1011f90f2653dae103c3837c968071e399b1decc/composer/datasets/in_context_learning_evaluation.py#L418C1-L428C59
My proposal:
Instead of padding till max_seq_len, use the maximum length of the batch.
This has improved latency by 400% when I used 2048 as sequence length. It would be even more for models trained with higher sequence length.