Open rsoika opened 2 months ago
What GPU are you using? ExLlama isn't great on older GPUs with poor FP16 performance.
I am running on a linux server with CPU Intel Core i7-7700 + GeForce GTX 1080.
For example the call of model.load_autosplit(cache)
takes more than 3 minutes. The model I am using has a size of 2.4 G
Is this something you would expect in this situation?
Well, I haven't optimized specifically for the 10-series GPUs. Even though the 1080 supports FP16, it runs at about 1/64th the speed of FP32. I have been meaning to add some FP32 fallback kernels to ExLlama, but it's a lot of work and I just haven't found the time yet.
ok, thanks for your feedback. It was more for my understanding about the architecture. I was not aware that my GPU is such 'old' ;-) - no worry all is fine.
Besides FP32, I thought the pascal series has fast int8 support. Some places say that it's 4xFP32.
Hi,
I tried to use
exllamv2
with Mistral 7B Instruct instead of myllama-cpp-python
test implementation.exllamv2
works, but the performance is very slow compared tollama-cpp-python.
To me this looks like the GPU is totally ignored? I have CUDA installed and run the code in a
nvidia/cuda
Docker containerThis is how my Test Code looks (it is from the examples directory):
Is it necessary to activate the GPU somehow?