Open YosiMass opened 9 months ago
@YosiMass , thanks for opening this ticker, yes we filed a bug on this and working on the fix.
Hi! This issue has been solved by this commit. Please checkout our latest main and let me know if there is any question.
System Info
PyTorch version: 2.0.1+cu117 Is debug build: False CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: N/A
OS: Red Hat Enterprise Linux release 8.8 (Ootpa) (x86_64) GCC version: (GCC) 10.1.0 Clang version: Could not collect CMake version: version 3.27.4 Libc version: glibc-2.28
Python version: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-4.18.0-477.15.1.el8_8.x86_64-x86_64-with-glibc2.28 Is CUDA available: True CUDA runtime version: 11.6.55 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB Nvidia driver version: 535.54.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
Information
🐛 Describe the bug
in datasets/alpaca_dataset in getitem
The code builds the training example with all three parts (instruction, input, response) and if it is larger than max_words, the code just removes the last tokens. As a result the response might be removed
Error logs
no error message. Just wrong behavior
Expected behavior
The fix should be to truncate the (instruction+input) and keep the full response, such that overall it will fit into max_words.