Closed ThomasBaruzier closed 1 week ago
Dev branch doesn't fix the issue Previous commits do, I am pinpointing the latest working commit rn
Edit: latest working commit is b2af0bb from dev, as 0d78f03 crashes, the issues is somewhere in git diff b2af0bb 0d78f03
This is tricky. I'm not able to reproduce it here, and expandable_segments is still a poorly documented, experimental feature in PyTorch. It's especially weird because the commit that apparently breaks it doesn't change anything at all except for YaRN models, and Mistral-Large doesn't use YaRN.
Are you sure it's working reliably up to commit b2af0bb?
Here is the list of commits crashing
03b2d55 segfault cad7848 segfault ef7cdda segfault 5d43593 segfault c84f597 segfault f1adff9 segfault 4314792 segfault be3de0f segfault d393bfe segfault 6b73184 segfault 8dca1ab segfault b195503 segfault aff1e5a segfault 0d78f03 segfault b2af0bb working 3a38913 working e960dfd working 7c7b199 segfault 8361f3f segfault 15e5404 segfault a5132d0 segfault 6d7b2e8 segfault 43a0be3 segfault a17f666 segfault 9946f45 segfault e155e0a segfault c4a03e0 segfault 12bceb9 segfault 0695f3a segfault 8a25e0f segfault b252107 segfault 144c576 working 10a8842 working b2c7cf2 working 46eff43 working 228ba34 working a372fe1 working aadc454 working 5ee9835 working 1df7b04 working 1e18e80 working 0d9adf9 working f0dca9a working f2c53ef working a029bcd working 361d211 working c1fed2e working 3e8e181 working affdc0d working 5c455c1 working c9ce168 working 1e462f1 working c18400f working 0d5c0bc working 12f08db working ea27954 working 40e37f4 working c050aec working
I used this script:
#!/bin/bash
set -e
source ~/files/ai/envs/exllama/bin/activate
rm -rf exllama
cp -r exllama.bkp exllama
touch commits.txt
cd exllama
git checkout dev
for commit in $(git log --oneline | cut -f1 -d' '); do
if grep -q "$commit" ../commits.txt; then
echo "Skipping $commit, already processed."
continue
fi
git checkout "$commit"
pip install .
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
python test_inference.py \
-m ~/storage/quants/exl/Qwen2.5-0.5B-Instruct-4.0bpw/ \
-p 'Once upon a time,' --token 256 --gpu_split 0.1,25 \
&& echo "$commit working" >> ../commits.txt \
|| echo "$commit segfault" >> ../commits.txt
git switch -
done
https://github.com/turboderp/exllamav2/commit/144c576bdb468996b710c058d743d42b87b2f115 is the issue, seems more reasonable
I can confirm this, but I just removed the expandable segments from my environment because I thought it was a personal setup issue.
I think I managed to reproduce this, and there should be a fix in the latest commit in the dev branch. Hopefully.
(exllama) ~/files/ai/tabby/exllama python test_inference.py -m ~/storage/quants/exl/Qwen2.5-0.5B-Instruct-4.0bpw/ -p 'Once upon a time,' --token 256 --gpu_split 0.1,25
-- Model: /home/user/storage/quants/exl/Qwen2.5-0.5B-Instruct-4.0bpw/
-- Options: ['gpu_split: 0.1,25']
Loading: /home/user/storage/quants/exl/Qwen2.5-0.5B-Instruct-4.0bpw/ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:04 0:00:00
-- Loaded model in 5.2623 seconds
-- Loading tokenizer...
-- Warmup...
-- Generating...
Once upon a time, there was a fire monster named Kai. He had many dangerous spells and was very scary. One day, he tricked Nemo into leaving his house. However, Nemo has a fire monster at home. In the house, there were also lots of fire monsters. Nemo is too scared to leave, so he asked his parents for help.
I asked you today to help Kai.
A fire monster walked in from the back and started attacking Nemo. At that moment, Nemo was very scared. How should he respond?
To help you, let's look at the scenario and the options provided:
1. Nemo took action and rescued Kai.
2. Nemo took action but Kai did not respond.
3. Nemo remained where he was and continued to wait for Kai after giving him a drink.
4. Nemo attacked Kai without taking any action.
After analyzing the options, the correct answer is that Nemo should have taken action and rescued Kai. By doing so, Nemo was able to avoid the deadly fire monster and live happily in the swamp. The first option (which is incorrect) shows that Nemo did not respond and instead waited for Kai to be rescued despite being scared. The second option (which is incorrect) shows that Nemo
-- Response generated in 3.00 seconds, 256 tokens, 85.24 tokens/second (includes prompt eval.)
What can I say You're incredible
OS
Linux
GPU Library
CUDA 12.x
Python version
3.12
Pytorch version
torch==2.4.1
Model
mistralai/Mistral-Large-Instruct-2407, Qwen/Qwen2.5-72B-Instruct
Describe the bug
Hello,
Since 0.2.3, loading models crashes when using
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
when trying to initialize the model loading on a second GPU.Since I need to remove it to make it work, I had to downgrade my Q4 context window from 19200 tokens to 13056 tokens, losing 6k tokens for the same configuration. This is justified by the fragmentation happening without this flag:
464.99 MiB is reserved by PyTorch but unallocated
Note: Loading a smaller model on a single GPU with
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
does not crash. Note: cuda_malloc_backend does not helpIt would be very useful to be able to grab that VRAM back.
Reproduction steps
Load any model that splits on 2 GPUs with
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python3 main.py
Expected behavior
Not crash
Logs
Additional context
I use 2x3090 on arch, cuda 12.4.131, python 3.12.6, latest commit of TabbyAPI.
Acknowledgements