turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.77k stars 220 forks source link

will it work with Nvidia P40 24GB on Linux? #27

Open waan1 opened 1 year ago

waan1 commented 1 year ago

I'm developing AI assistant for fiction writer. As openai API gets pretty expensive with all the inference tricks needed, I'm looking for a good local alternative for most of inference, saving gpt4 just for polishing final results. exllama looks pretty interesting, but I'm getting compilation error. Even though in addition to fiction writer I'm a software developer, I'm far from being an AI expert. Would it be correct to assume from the lines below that P40 is not supported currently? -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS

Maybe it was a silly try, but self.weight = tensors[key].half() did not work.

If P40 will not work with exllama, could somebody advise if oobabooga/GPTQ-for-LLaMa would work? If not CUDA, maybe there are good options for i9-13900K with 128G DDR5?

The full Traceback: python test_benchmark_inference.py -d /home/igorm/ai-assistant/agent-city/llm/models/Wizard-Vicuna-13B-Uncensored-GPTQ -p -ppl Traceback (most recent call last): File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build subprocess.run( File "/usr/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/test_benchmark_inference.py", line 1, in from model import ExLlama, ExLlamaCache, ExLlamaConfig File "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/model.py", line 5, in import cuda_ext File "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/cuda_ext.py", line 14, in exllama_ext = load( File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1283, in load return _jit_compile( File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1508, in _jit_compile _write_ninja_file_and_build_library( File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1623, in _write_ninja_file_and_build_library _run_ninja_build( File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1916, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'exllama_ext': [1/3] /opt/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/TH -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/THC -isystem /opt/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -std=c++17 -c /home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/q4v2_matmul.cu -o q4v2_matmul.cuda.o FAILED: q4v2_matmul.cuda.o /opt/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/TH -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/THC -isystem /opt/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -std=c++17 -c /home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/q4v2_matmul.cu -o q4v2_matmul.cuda.o /home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/../cuda_compat.cuh(48): error: cannot overload functions distinguished by return type alone void atomicAdd(half2* address, half2 val) { atomicAdd_half2(address, val); } ^

1 error detected in the compilation of "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/q4v2_matmul.cu". [2/3] /opt/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/TH -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/THC -isystem /opt/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -std=c++17 -c /home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/half_matmul.cu -o half_matmul.cuda.o FAILED: half_matmul.cuda.o /opt/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/TH -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/THC -isystem /opt/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -std=c++17 -c /home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/half_matmul.cu -o half_matmul.cuda.o /home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/../cuda_compat.cuh(48): error: cannot overload functions distinguished by return type alone void atomicAdd(half2* address, half2 val) { atomicAdd_half2(address, val); } ^

1 error detected in the compilation of "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/half_matmul.cu". ninja: build stopped: subcommand failed.

turboderp commented 1 year ago

Maybe it was a silly try, but self.weight = tensors[key].half() did not work.

That would turn the q4 weights into half types without converting them first. So that definitely wouldn't work.

I'm planning to do a lot more work on support for the P40 specifically. It's a very attractive card for the obvious reasons if it can be made to perform well. I don't have one to develop on, but the error message you're getting suggests that CUDA is providing its own atomicAdd operation for half2 types, while the extension also tries to provide its own for compatibility.

Try commenting out the following line in exllama_ext/cuda_func/cuda_compat.cuh:

__device__ __forceinline__ void atomicAdd(half2* address, half2 val) { atomicAdd_half2(address, val); }

You may have to also delete the directory ~/.cache/torch_extensions/pyxxx_cuxxx/exllama_ext, since editing .h and .cuh files won't always trigger a rebuild.

waan1 commented 1 year ago

Your fix resolved the reported compile issue, thank you for quick and easy fix. But now it is a runtime issue. I'm not sure if it is a show stopper for now. Would it be of any value if I create a new issue?

python test_benchmark_inference.py -d /home/igorm/ai-assistant/agent-city/llm/models/Wizard-Vicuna-13B-Uncensored-GPTQ -p -ppl -- Loading model -- Tokenizer: /home/igorm/ai-assistant/agent-city/llm/models/Wizard-Vicuna-13B-Uncensored-GPTQ/tokenizer.model -- Model config: /home/igorm/ai-assistant/agent-city/llm/models/Wizard-Vicuna-13B-Uncensored-GPTQ/config.json -- Model: /home/igorm/ai-assistant/agent-city/llm/models/Wizard-Vicuna-13B-Uncensored-GPTQ/Wizard-Vicuna-13B-Uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors -- Sequence length: 2048 -- Options: ['attention: switched', 'matmul: switched', 'mlp: switched', 'perf', 'perplexity'] Traceback (most recent call last): File "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/test_benchmark_inference.py", line 168, in wrapper = timer("Load model", lambda: ModelWrapper(args)) File "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/test_benchmark_inference.py", line 72, in timer ret = func() File "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/test_benchmark_inference.py", line 168, in wrapper = timer("Load model", lambda: ModelWrapper(args)) File "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/test_benchmark_inference.py", line 51, in init self.tokenizer = ExLlamaTokenizer(self.tokenizer_model_path) File "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/tokenizer.py", line 10, in init self.tokenizer = SentencePieceProcessor(model_file = self.path) File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/sentencepiece/init.py", line 447, in Init self.Load(model_file=model_file, model_proto=model_proto) File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/sentencepiece/init.py", line 905, in Load return self.LoadFromFile(model_file) File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/sentencepiece/init.py", line 310, in LoadFromFile return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

turboderp commented 1 year ago

That's a new one. An internal error in SentencePiece would suggest either you've got a corrupted tokenizer.model or the wrong version of SentencePiece installed perhaps? I'm using 0.1.97, if that helps.

waan1 commented 1 year ago

I installed your sources on fresh virtualenv using your scripts. And I've got version = '0.1.99' Do you suggest to uninstall it and install 0.1.97?

waan1 commented 1 year ago

Vicuna was loaded from HF I'll try to load a smaller model just in case

turboderp commented 1 year ago

I can't think of anything else at the moment, really. That, or try a different model, or try downloading the tokenizer.model file again.

waan1 commented 1 year ago

You were right again :) tokenizer.model was loaded before I installed lfs. downloaded it again and it works now. Thank you very much for help! And now we know that exllama is compatible with P40. Looks like pretty slow. Will play with it.

waan1 commented 1 year ago

tested chatbot one performance core of CPU (CPU3) is 100% (i9-13900K) other 23 cores are idle P40 is 100%. it took only 9.2GB VRAM out of 24GB. I'm unclear of how both CPU and GPU could be saturated at the same time. Prompt: what is the capital of (country) - tried 10 times with different countries. Response took from 16 to 25 sec. Sometimes it was just one word, sometimes it was "capital of (country) is (city).

turboderp commented 1 year ago

I'm unclear of how both CPU and GPU could be saturated at the same time.

PyTorch waits in a busy loop whenever it synchronizes a CUDA stream, as far as I can tell. With a 13900K the CPU should be easily able to keep up with the P40, since my 12900K can keep up with a 4090. So while there are clearly still CPU bottlenecks that people with slower CPUs are running into (working on that), the CPU usage you're seeing is "normal."

As for the speed on the P40, that obviously needs some work. But all the CUDA stuff is being rewritten as we speak, and there will be a bunch more tuning options soon, with probably some alternative code paths that should be better suited for older GPUs.

waan1 commented 1 year ago

I see. Will be watching your git for updates and will try again when it has something for older PGUs.

turboderp commented 1 year ago

Having read up on it a bit, good performance on P40 might be a ways off, unfortunately. Apparently its FP16 performance is 1/64 of its FP32 performance. I guess it's emulated in the driver or something. So I don't know how much I can do to address that, other than either provide an alternative FP32 path which would limit the context length somewhat. And it's a big rewrite.

Ph0rk0z commented 1 year ago

Just tested and perf isn't good. 1.x It/s with no context. Maybe there is a way to just stop doing matmul in FP16?

For reference this is how autoGPTQ does it in float https://github.com/PanQiWei/AutoGPTQ/blob/main/auto_gptq/nn_modules/qlinear_old.py

I run it with use_cuda_fp16=false.

turboderp commented 1 year ago

Yep, it converts everything to FP32 on the fly. It's hard to get to 160 tokens/second that way, and hard to run a 30B model at full context length when the state takes up twice as much space. But I have some ideas to try out, once I find a convenient way to test on a Pascal card.

Ph0rk0z commented 1 year ago

I've got 2 of them so if you need anything tested I can run it.

I wish I was getting 160 t/s, but for some reason I'm not on the 3090s. I think that's a whole separate issue. Not sure what's doing it, if its my xeon v4, PCIE3 system or something up with my environment. I'm only getting 27 it/s on the 7b so it has to be something.

waan1 commented 1 year ago

Having read up on it a bit, good performance on P40 might be a ways off, unfortunately. Apparently its FP16 performance is 1/64 of its FP32 performance. Tesla P40 INT8 (TIOP/s): 47.0 FP32 (TFLOP/s): 11.8 Is it possible to use INT8 instead? https://downloads.dell.com/manuals/all-products/esuprt_solutions_int/esuprt_solutions_int_solutions_resources/high-computing-solution-resources_white-papers13_en-us.pdf Some people somehow achieve about 4080 performance on P40? https://www.reddit.com/r/LocalLLaMA/comments/13n8bqh/my_results_using_a_tesla_p40/

jterry333 commented 1 year ago

I've got an extra P40 I can send you if you think it'll help you crank up the performance on these things

jmoney7823956789378 commented 1 year ago

I could do the same with an MI25 or MI60, if that was something you wanted.

Ph0rk0z commented 1 year ago

int8

Bits and bytes perf for P40 is not good. About 1/2 speed as well.

Ph0rk0z commented 1 year ago

Just tested with nohalf2, if I did it right, it definitely went up on P6000. This is the 7b though.

Output generated in 30.07 seconds (3.33 tokens/s, 100 tokens, context 18, seed 1652524018)

any reason to not

//__device__ __forceinline__ void atomicAdd(half2* address, half2 val) { atomicAdd_half2(address, val); }

in cuda_compat within the repo? Does it do something required for volta?

turboderp commented 1 year ago

I think I'd need to know for sure exactly when half2 support is provided by CUDA and when it isn't. Cause there's still a half2 path that needs to compile, even if it's never called. Unless it's #ifdefd out, but then what are the conditions for that exactly?

ardfork commented 1 year ago

In that case it's because you check for __CUDA_ARCH__ < 700 for both atomicAdd half and half2 when half2 should be __CUDA_ARCH__ < 600.

From https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomicadd:

The 32-bit half2 floating-point version of atomicAdd() is only supported by devices of compute capability 6.x and higher. The atomicity of the half2 or nv_bfloat162 add operation is guaranteed separately for each of the two half or nv_bfloat16 elements; the entire half2 or __nv_bfloat162 is not guaranteed to be atomic as a single 32-bit access. The 16-bit __half floating-point version of atomicAdd() is only supported by devices of compute capability 7.x and higher.

Ph0rk0z commented 1 year ago

Pascal is compute 6.1. Not sure how maxwell fares on this repo, I don't think anyone tried it yet. Pascal doesn't have an atomicadd half tho, unless you make the function for it.


#ifdef __CUDA_ARCH__
#if __CUDA_ARCH__ < 700 && __CUDA_ARCH__ > 600
// adapted from https://github.com/torch/cutorch/blob/master/lib/THC/THCAtomics.cuh
__device__ __forceinline__ void atomicAddHalf(__half* address, c10::Half val) {
    unsigned int *address_as_ui = reinterpret_cast<unsigned int *>(reinterpret_cast<char *>(address) - (reinterpret_cast<size_t>(address) & 2));
    unsigned int old = *address_as_ui;
    unsigned int assumed;

    do {
        assumed = old;
        unsigned short hsum = reinterpret_cast<size_t>(address) & 2 ? (old >> 16) : (old & 0xffff);
        hsum += val;
        old = reinterpret_cast<size_t>(address) & 2
                 ? (old & 0xffff) | (hsum << 16)
                 : (old & 0xffff0000) | hsum;
        old = atomicCAS(address_as_ui, assumed, old);

    // Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN)
    } while (assumed != old);
}
#endif
#endif
lhl commented 1 year ago

I dug out an old 1080 Ti (Pascal) the other day to do some comparisons vs a Radeon VII (GCN 5.1 gfx906) and can confirm that @ardfork's #ifdef change worked for the 1080 Ti.

(I saw there's some ongoing work hipifying for rocm, happy to run a test, the old Radeon card runs llama.cpp both w/ clblast and hipblas so will be curious to see how exllama compares.)

Ph0rk0z commented 1 year ago

half2-HPEC2017.pdf

Supposedly there is a way to pack 2 half2 ops into a single FP32 operation and gain a speedup but I'm not sure if that is accomplished only for P100 or if it also works for P40. Who knows where this people's repo went by now since it's so old or if their numbers hold up.

waan1 commented 1 year ago

half precision would still require 2 bytes per weight, limiting model selection to 3B or 7B. 13B would not fit into 24GB, right? ideally packing 8 weights x 4 bit into one fp32 would increase performance by 8 time minus overhead and allow up to 40B GPTQ models. but I guess it is a lot of work. P40 makes 4 times more int8 operations than fp32 (47 TOPS vs 12 FLOPS). Would it be easier to pack two weights into one int8? Still a lot of work though.

Ph0rk0z commented 1 year ago

I thought about int8 as well but int8 is missing hardware matrix matmul.

drgnfr6 commented 1 year ago

@Ph0rk0z That doesn't seem accurate as far as I can tell, though maybe I'm misunderstanding. If so feel free to brush me off, I'm struggling to trace this to experience I last used years ago. Below is one of the references I came across when digging, when I had some free time a few weeks ago. I started to implement something to show how we could use it, but frankly got lost about 1/4 of the way in. It's been over a decade since I wrote C code in earnest, and it wasn't heavily math based. I did get far enough to compare speed of the matmul operation when using it and it's definitely faster than fp16 or fp32, roughly 4 times faster than equivalent fp32 operations if my timings bore out correctly.

Basically, in CUDA land at least, it seems to boil down to leveraging dp4a (or dp2a, if we have any int16 as input) calls. These appear to be very fast on the p40 (I have 2 of the cards in my current test bed). I'm presuming this might be some standard function call, as I didn't see it in the CUDA specific documentation, the __dp4a function is also referenced in Intel and AMD documentation, though I didn't dig far enough into that side to say how exactly.

This helps quite a bit regarding use of the dp4a and dp2a functions: https://developer.nvidia.com/blog/mixed-precision-programming-cuda-8/ I'm not sure why no-one uses the call in llama.cpp or exllama or similar, it seems to be perfectly functional, compiles under cuda toolkit 12.2 and is quite fast on p40s (I'd guess others as well, given specs from nvidia on int based ops), but I also couldn't find it in the official docs for the cuda math API here either: https://docs.nvidia.com/cuda/cuda-math-api/modules.html

Ph0rk0z commented 1 year ago

Sounds like you got further than me. I am pretty rusty on the math here. This computes the dot product but what about the matrix product. I thought they were different and also used in stuff like bits and bytes for some of the operations.

Extrapolating from this guy's post: https://forums.developer.nvidia.com/t/dp4a-instruction-usage-in-pascal-architecture-gpus/53309/6

We would need to write a matmul function using DP4a and do like was done with atomicadd and then see if it's faster or not. But then also everything else has to be changed to FP32 from the FP16 it currently is in exllama because all FP16 ops are slow.

So the original problem of that remains.

turboderp commented 1 year ago

The FP16 problem remains, but INT8 would present problems of its own. It's an integer type, after all, not a drop-in replacement for floats.