turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.18k stars 233 forks source link

Floating point exception when context length > chunk_size #503

Open cikkle opened 2 weeks ago

cikkle commented 2 weeks ago

I'm using exllama through tabbyapi and I've been getting a floating point exception when continuing existing chats since a week or two ago. After experimenting I noticed this only happened around the 2K mark and I found in tabbyAPI's default config that chunk_size defaults to 2048, so I tried uncommenting it and setting it to 4096. Sure enough, I now get the exception past 4096 context instead.

I've tried using different models, playing with the cache mode, context size, gpu split parameters, deleting and cloning the repos again and reinstalling python dependencies clean, wondering if something else factors into this but nothing else seems to affect this.

TabbyAPI console output (command-r with 8192 max_seq_len and the default 2048 chunk size):

o0@hades:~/ai/tabbyAPI$ python3 main.py 
INFO:     ExllamaV2 version: 0.1.5
INFO:     Your API key is:
INFO:     Your admin key is:
INFO:     
INFO:     If these keys get compromised, make sure to delete api_tokens.yml and restart the server. Have fun!
INFO:     Generation logging is disabled
WARNING:  An unsupported GPU is found in this configuration. Switching to compatibility mode. 
WARNING:  This disables parallel batching and features that rely on it (ex. CFG). 
WARNING:  To disable compatability mode, all GPUs must be ampere (30 series) or newer. AMD GPUs are not supported.
INFO:     Attempting to load a prompt template if present.
INFO:     Using template "default" for chat completions.
INFO:     Loading model: /home/o0/ai/models/text/command-r-v01-35B-exl2
INFO:     Loading with autosplit
/home/o0/.local/lib/python3.10/site-packages/torch/nn/attention/bias.py:205: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered 
internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:505.)
  return scaled_dot_product_attention(
Loading model modules ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 43/43 0:00:00
INFO:     Model successfully loaded.
INFO:     Developer documentation: http://0.0.0.0:5000/redoc
INFO:     Completions: http://0.0.0.0:5000/v1/completions
INFO:     Chat completions: http://0.0.0.0:5000/v1/chat/completions
INFO:     Started server process [48169]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:5000 (Press CTRL+C to quit)
INFO:     127.0.0.1:36074 - "GET /v1/model/list HTTP/1.1" 200
INFO:     127.0.0.1:36082 - "GET /v1/model HTTP/1.1" 200
INFO:     127.0.0.1:33344 - "POST /v1/completions HTTP/1.1" 200
INFO:     Metrics: 340 tokens generated in 15.52 seconds (Queue: 0.0 s, Process: 0 cached tokens and 1413 new tokens at 1435.41 T/s, Generate: 23.4 T/s, 
Context: 1413 tokens) 
INFO:     127.0.0.1:59204 - "POST /v1/completions HTTP/1.1" 200
INFO:     Metrics: 170 tokens generated in 7.11 seconds (Queue: 0.0 s, Process: 1547 cached tokens and 3 new tokens at 33.56 T/s, Generate: 24.23 T/s, 
Context: 1550 tokens) 
INFO:     127.0.0.1:60814 - "POST /v1/completions HTTP/1.1" 200
INFO:     Metrics: 200 tokens generated in 8.45 seconds (Queue: 0.0 s, Process: 1717 cached tokens and 3 new tokens at 34.59 T/s, Generate: 23.92 T/s, 
Context: 1720 tokens) 
INFO:     127.0.0.1:37692 - "POST /v1/completions HTTP/1.1" 200
INFO:     127.0.0.1:41732 - "POST /v1/completions HTTP/1.1" 200
INFO:     Metrics: 147 tokens generated in 6.25 seconds (Queue: 0.0 s, Process: 1919 cached tokens and 1 new tokens at 7898.88 T/s, Generate: 23.51 T/s, 
Context: 1920 tokens) 
INFO:     127.0.0.1:49484 - "POST /v1/completions HTTP/1.1" 200
Floating point exception
o0@hades:~/ai/tabbyAPI$ 

At the time of writing this I'm using the latest commits of tabbyapi and exllama (exllama is installed from source) Ubuntu 22.04.4 LTS Python 3.10.12 ROCm 6.1.2 (though for what it's worth I've encountered someone having the same issue with a pair of 3090s).

turboderp commented 2 weeks ago

Which version were you on before updating? Since there's no stack trace, it would help if it could be narrowed down to the latest updates.

Also, since I guess Torch SDPA is the likely culprit, what version of PyTorch are you using? Is it a nightly build?

cikkle commented 2 weeks ago

I'm not able to answer about the version of exllama well, unfortunately. I tend to pull and reinstall it nearly every other time I restart tabbyapi. I just tried reverting to a commit of tabbyapi that was on 0.1.0 and still encountered the same problem

Nightly pytorch doesn't really work at all for me in any case, so I'm on the stable rocm version, installed via the command given on their site:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0

turboderp commented 2 weeks ago

When you reverted to the earlier Tabby, did it also actually revert to exllamav2==0.1.0 or did it keep the installed 0.1.5?

cikkle commented 2 weeks ago

I uninstalled my existing version of exllama beforehand, the console logging reported 0.1.0.

turboderp commented 2 weeks ago

Okay, I've committed a potential fix to the dev branch. Are you able to test it?

cikkle commented 2 weeks ago

The dev branch specifically isn't building for me with pip install .; I can't tell from the error where it's getting stuck.

      [29/43] /opt/rocm/bin/hipcc  -I/home/o0/.local/lib/python3.10/site-packages/torch/include -I/home/o0/.local/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/o0/.local/lib/python3.10/site-packages/torch/include/TH -I/home/o0/.local/lib/python3.10/site-packages/torch/include/THC -I/home/o0/.local/lib/python3.10/site-packages/torch/include/THH -I/opt/rocm/include -I/usr/include/python3.10 -c -c /home/o0/ai/exllamav2/exllamav2/exllamav2_ext/hip/comp_units/unit_exl2_2b.hip -o /home/o0/ai/exllamav2/build/temp.linux-x86_64-3.10/exllamav2/exllamav2_ext/hip/comp_units/unit_exl2_2b.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -lineinfo -O3 -DHIPBLAS_USE_HIP_HALF -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=exllamav2_ext -D_GLIBCXX_USE_CXX11_ABI=0 --offload-arch=gfx900 --offload-arch=gfx906 --offload-arch=gfx908 --offload-arch=gfx90a --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx942 -fno-gpu-rdc -std=c++17
      clang: warning: -lineinfo: 'linker' input unused [-Wunused-command-line-argument]
      ninja: build stopped: subcommand failed.
      Traceback (most recent call last):
        File "/home/o0/.local/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2107, in _run_ninja_build
          subprocess.run(
        File "/usr/lib/python3.10/subprocess.py", line 526, in run
          raise CalledProcessError(retcode, process.args,
      subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

      The above exception was the direct cause of the following exception:

      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/home/o0/ai/exllamav2/setup.py", line 92, in <module>
          setup(
        File "/usr/lib/python3/dist-packages/setuptools/__init__.py", line 153, in setup
          return distutils.core.setup(**attrs)
        File "/usr/lib/python3.10/distutils/core.py", line 148, in setup
          dist.run_commands()
        File "/usr/lib/python3.10/distutils/dist.py", line 966, in run_commands
          self.run_command(cmd)
        File "/usr/lib/python3.10/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/usr/lib/python3/dist-packages/setuptools/command/install.py", line 68, in run
          return orig.install.run(self)
        File "/usr/lib/python3.10/distutils/command/install.py", line 619, in run
          self.run_command('build')
        File "/usr/lib/python3.10/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/usr/lib/python3.10/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/usr/lib/python3.10/distutils/command/build.py", line 135, in run
          self.run_command(cmd_name)
        File "/usr/lib/python3.10/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/usr/lib/python3.10/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/usr/lib/python3/dist-packages/setuptools/command/build_ext.py", line 79, in run
          _build_ext.run(self)
        File "/usr/lib/python3.10/distutils/command/build_ext.py", line 340, in run
          self.build_extensions()
        File "/home/o0/.local/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 870, in build_extensions
          build_ext.build_extensions(self)
        File "/usr/lib/python3.10/distutils/command/build_ext.py", line 449, in build_extensions
          self._build_extensions_serial()
        File "/usr/lib/python3.10/distutils/command/build_ext.py", line 474, in _build_extensions_serial
          self.build_extension(ext)
        File "/usr/lib/python3/dist-packages/setuptools/command/build_ext.py", line 202, in build_extension
          _build_ext.build_extension(self, ext)
        File "/usr/lib/python3.10/distutils/command/build_ext.py", line 529, in build_extension
          objects = self.compiler.compile(sources,
        File "/home/o0/.local/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 683, in unix_wrap_ninja_compile
          _write_ninja_file_and_compile_objects(
        File "/home/o0/.local/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1783, in _write_ninja_file_and_compile_objects
          _run_ninja_build(
        File "/home/o0/.local/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2123, in _run_ninja_build
          raise RuntimeError(message) from e
      RuntimeError: Error compiling objects for extension
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> exllamav2

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

Full output: exllamav2_build.log

turboderp commented 1 week ago

Apparently I used a couple of intrinsics not supported by HIP. I pushed a commit with fallback definitions, so it should compile again.