Wrong truncation of training examples in alpaca dataset

YosiMass commented 9 months ago

System Info

PyTorch version: 2.0.1+cu117 Is debug build: False CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: N/A

OS: Red Hat Enterprise Linux release 8.8 (Ootpa) (x86_64) GCC version: (GCC) 10.1.0 Clang version: Could not collect CMake version: version 3.27.4 Libc version: glibc-2.28

Python version: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-4.18.0-477.15.1.el8_8.x86_64-x86_64-with-glibc2.28 Is CUDA available: True CUDA runtime version: 11.6.55 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB Nvidia driver version: 535.54.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Information

[X] The official example scripts
[ ] My own modified scripts

🐛 Describe the bug

in datasets/alpaca_dataset in getitem

The code builds the training example with all three parts (instruction, input, response) and if it is larger than max_words, the code just removes the last tokens. As a result the response might be removed

Error logs

no error message. Just wrong behavior

Expected behavior

The fix should be to truncate the (instruction+input) and keep the full response, such that overall it will fit into max_words.

HamidShojanazeri commented 9 months ago

@YosiMass , thanks for opening this ticker, yes we filed a bug on this and working on the fix.

wukaixingxp commented 1 month ago

Hi! This issue has been solved by this commit. Please checkout our latest main and let me know if there is any question.

meta-llama / llama-recipes