Open iseeyuan opened 3 months ago
Seems like a bug with max_seq_len: https://github.com/pytorch/torchchat/blob/0b001b9dc74c12e136f5ee9b3c19427b9acd24ff/generate.py#L631
When I hack it to 200 (compared to the default 8192) the perf for chat is close to that of generate
This also seems isolated to MPS as well (I don't see as significant a drop on Cuda)
@manuelcandales Looks like your PR might solve this problem for free
@Jack-Khuu , would a shorter max_seq_length just a hack? If the chat conversation go beyond the limit the chat will stop, which limit the user experience of long chat history.
See good old https://github.com/pytorch/torchchat/issues/783
would a shorter max_seq_length just a hack?
It would be, which is why we're lucky to have https://github.com/pytorch/torchchat/pull/964 Manuel's changes in PT gets picked up by the pin bump and will hopefully resolve the seq_length issues
Seems like I'm still seeing it...., can someone else ack that they see a similar behavior?
Seems like I'm still seeing it...., can someone else ack that they see a similar behavior?
You used the same max tokens flag?
actually just did a test. ~15 t/s for generate and 1.5 t/s for chat using similar params
🐛 Describe the bug
For generate on llama3.1, I got 9.1 tok/s, but chat is much slower. I got around 1.4 tok/s. Test laptop: MacBook Pro with M1 Max, 64 GB memory. Sonoma 14.5
Details for both generate and chat:
Versions
(torchchat) myuan@myuan-mbp torchchat % python collect_env.py Collecting environment information... PyTorch version: 2.5.0.dev20240710 Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A
OS: macOS 14.5 (arm64) GCC version: Could not collect Clang version: 15.0.0 (clang-1500.0.40.1) CMake version: version 3.30.1 Libc version: N/A
Python version: 3.10.0 (default, Mar 3 2022, 03:54:28) [Clang 12.0.0 ] (64-bit runtime) Python platform: macOS-14.5-arm64-arm-64bit Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
CPU: Apple M1 Max
Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] torch==2.5.0.dev20240710 [pip3] torchao==0.3.1 [conda] numpy 1.26.4 pypi_0 pypi [conda] torch 2.5.0.dev20240710 pypi_0 pypi [conda] torchao 0.3.1 pypi_0 pypi