Open lhl opened 1 month ago
Thanks for testing out the repo @lhl!
Looks like we're hitting a Error: CUDA error: out of memory
here
Can you check exporting/generating with the stories15M model to verify that the behavior itself is working?
Looks like stories15M
works:
❯ python3 torchchat.py generate stories15M --dso-path exportedModels/stories15M.so --prompt "Hello my name is"
/home/local/.conda/envs/torchchat/lib/python3.11/site-packages/torchao/ops.py:12: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
return torch.library.impl_abstract(f"{name}")(func)
Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
NumExpr defaulting to 16 threads.
PyTorch version 2.4.0 available.
Warning: checkpoint path ignored because an exported DSO or PTE path specified
Warning: checkpoint path ignored because an exported DSO or PTE path specified
Using device=cuda NVIDIA GeForce RTX 4090
Loading model...
Time to load model: 0.17 seconds
-----------------------------------------------------------
Hello my name is Billy. He is three years old and very curious. He likes to explore new places.
One day, he was walking in the forest when he saw a big, scary bear. He was so scared he wanted to run away, but he couldn't move. Suddenly, he remembered his grandmother's advice, "If you ever be scared, just blink your eyes in life."
Billy blinked his eyes again and the bear blinked back in surprise. The bear started to walk away, but Billy was still scared.
Suddenly, he remembered what his grandmother had said: "If you blink a little bit, the bear won't be mean, but the most important thing is to keep exploring."
Billy knew he had to be brave, so he blinked his eyes. To his surprise, the bear was just a big, friendly bear! It had been
Time for inference 1: 0.72 sec total, time to first token 0.15 sec with sequential prefill, 199 tokens, 278.05 tokens/sec, 3.60 ms/token
Bandwidth achieved: 13.57 GB/s
*** This first iteration will include cold start effects for dynamic import, hardware caches. ***
========================================
Average tokens/sec: 278.05
Memory used: 0.10 GB
(scripts/build_native.sh aoti
still fails but looks like a different bug)
the 4090 has the full 24GB of VRAM so should have no problems fitting an 8B model. It just occurred to me that the issue might be because of Llama 3.1 - since compiled it might want 128K context - limiting the tokens to --max-new-tokens 2048
still results in the CUDA OOM so maybe there needs to be an option for specifying token limts for the compiled model.
BTW speaking of --compile
, I get these errors when I run the torch compile mode and try to generate:
W0804 22:57:20.187000 140149025421120 torch/fx/experimental/symbolic_shapes.py:4449] [0/0] xindex is not in var_ranges, defaulting to unknown range.
(stalls after generating some tokens)
W0804 22:58:16.894000 140149025421120 torch/fx/experimental/symbolic_shapes.py:4449] [0/1] xindex is not in var_ranges, defaulting to unknown range.
--compile-prefill
does not have these errors (but is no faster than not compiling at all):
Time for inference 1: 4.95 sec total, time to first token 0.26 sec with parallel prefill, 199 tokens, 40.17 tokens/sec, 24.89 ms/token
Bandwidth achieved: 645.19 GB/s
...
Average tokens/sec: 40.17
Memory used: 16.30 GB
I have work/deadlines/travel so I won't be able to really followup further, I'm assuming anyone doing basic testing is probably going to run into similar issues, my config (clean mamba env on a 4090 seems as vanilla a setup as possible).
I had the same C++ runner issue building runner for ET/PTE models in #985
🐛 Describe the bug
I am running an Arch Linux system with a 4090/3090 w/ and up-to-date CUDA 12.5 (
Build cuda_12.5.r12.5/compiler.34385749_0
)I have created a new mamba env for torchchat and run the install. Regular inferencing (eg with
generate
) works fine.I compile an AOTI model per the README:
When I try to run with the exported DSO model it gives an error:
I tried the C++ runner as well but it fails to build:
Versions