turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.2k stars 235 forks source link

87225fe "Optimize kernel batch performance" breaks some chat queries #290

Closed bjj closed 5 months ago

bjj commented 5 months ago

I'm using exllamav2 for the first time. I built from source today (minus the ~3 commits that break Windows builds). I saw some strange behavior and bisected it to the commit in the title.

Expected:

(exllamav2) PS C:\dev\exllamav2> python .\examples\chat.py -m E:\exl2-llm-models\LoneStriker\Nous-Capybara-34B-4.65bpw-h6-exl2 --length 512 --mode nous -sp " " -pt
 -- Model: E:\exl2-llm-models\LoneStriker\Nous-Capybara-34B-4.65bpw-h6-exl2
 -- Options: ['length: 512', 'rope_scale 1.0', 'rope_alpha 1.0']
 -- Loading model...
 -- Loading tokenizer...
 -- Prompt format: nous
 -- System prompt:

User: Sally, a girl, has three brothers. Each brother has two sisters. How many sisters does Sally have?

Sally has one sister. Each brother has two sisters, which means Sally and the other sister.

(Response: 23 tokens, 20.78 tokens/second)

However, with revisions past the one in the title, that query returns various nonsense (or even crashes). Often it produces no output.

Prior to bisecting, I tried to figure out what caused the problem. It's at least partially related to length. A question like What is darker than pink? would get an answer. What is darker than pink? What is darker than pink? would also get an answer. But What is darker than pink? What is darker than pink? What is darker than pink? would get nonsense. However, changing the trailing punctuation to ... would get an answer.

It's not affected by context length. It happens across multiple quantizations made by different people under the same conditions.

(for your convenience, the chat prompt:

class PromptFormat_nous(PromptFormat):
    description = "Nous Research"

    def __init__(self):
        super().__init__()
        pass

    def default_system_prompt(self):
        return \
            f"""Perform the task to the best of your ability."""

    def first_prompt(self):
        return \
            """<|system_prompt|>\n\n""" + \
            """USER:\n""" + \
            """<|user_prompt|>\n\n""" + \
            """ASSISTANT:\n"""

    def subs_prompt(self):
        return \
            """USER:\n""" + \
            """<|user_prompt|>\n\n""" + \
            """ASSISTANT:\n"""

    def stop_conditions(self, tokenizer):
        return \
            [tokenizer.eos_token_id,
             """</s>""",
             ]

    def encoding_options(self):
        return False, False, True

    def print_extra_newline(self):
        return True

)

bjj commented 5 months ago

The bisect might be a red herring. The behavior of the specific query definitely changes before/after that point, but even before that point you can run into issues by just making longer queries. Why is the sky blue? is fine. Start repeating it over and over and you'll hit a point where there's no output. The same queries work fine in llama.cpp with a GGUF quant.

EDIT: Just want to clarify the following is with --amnesia

User: Why is the sky blue?

The color of the sky appears blue to the human eye due to a phenomenon called Rayleigh scattering. As sunlight reaches Earth's atmosphere, it is scattered, or redirected, in all directions by the sunlight particles, or photons, colliding with air molecules and other particles. Blue light is scattered more than other wavelengths because it travels at higher frequencies and shorter distances. This scattering results in the natural color of the sky appearing blue to the human eye.

User: Why is the sky blue?Why is the sky blue

The color of the sky appears blue due to a phenomenon called Rayleigh scattering. As sunlight reaches Earth's atmosphere, it is scattered, or redirected, in all directions by the gases and particles in the air. Blue light is scattered more than other colors because it travels in all directions and at right angles. This causes the sky to appear blue during clear days.

User: Why is the sky blue? Why is the sky blue? Why is the sky blue?

The color of the sky appears blue due to a phenomenon called Rayleigh scattering. As sunlight passes through the Earth's atmosphere, it interacts with nitrogen and oxygen molecules, causing shorter-wavelength light (blue and violet) to scatter more than longer-wavelength light (red and orange). This scattered blue light reaches our eyes after being scattered in all directions, making the sky appear blue during clear weather conditions.

User: Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue?

It doesn't matter how often you ask "why", the answer will always be the same. The color of the sky appears blue due to a phenomenon called Rayleigh scattering. As sunlight reaches Earth's atmosphere, it is scattered, or redirected, in all directions by the gases and particles in the air. Blue light is scattered more than other colors because it travels in all directions (short wavelengths) and it reaches our eyes with enough intensity to be seen, which makes the sky appear blue during the day.

User: Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue?

User: Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue?

User: Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue?

User:
turboderp commented 5 months ago

I'm having a hard time reproducing this.

(venv) [bb@bbc exllamav2]$ python examples/chat.py -m /mnt/str/models/nous-capybara-34b-exl2/4.65bpw/ -mode nous -gs auto -l 512 -sp " " --amnesia
 -- Model: /mnt/str/models/nous-capybara-34b-exl2/4.65bpw/
 -- Options: ['gpu_split: auto', 'length: 512']
 -- Loading tokenizer...
 -- Loading model...
 -- Prompt format: nous
 -- System prompt:

User: Why is the sky blue? 

The sky appears blue due to a phenomenon called Rayleigh scattering. As sunlight passes through the Earth's atmosphere, it interacts with air molecules, water droplets, and dust particles. The short-wavelength blue light is scattered more effectively than the longer-wavelength red light, due to the size of these particles in the atmosphere. As a result, the blue light is scattered in all directions, making the sky appear blue during the day.

User: Why is the sky blue? Why is the sky blue? Why is the sky blue? 

The sky appears blue due to a phenomenon called Rayleigh scattering. As sunlight passes through the Earth's atmosphere, it interacts with air molecules, water droplets, and dust particles. The short-wavelength blue light is scattered more effectively than the longer-wavelength red light, due to the size of these particles in the atmosphere. As a result, the blue light is scattered in all directions, making the sky appear blue during the day.

User: Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? 

The sky appears blue due to a phenomenon called Rayleigh scattering. As sunlight passes through Earth's atmosphere, it interacts with air molecules and water droplets, causing the short-wavelength blue light to scatter more effectively than the longer-wavelength red light. This scattered blue light is then redirected in all directions, making the sky appear blue during the day.

User: 

Or for your first example, with the most recent commit:

(venv) [bb@bbc exllamav2]$ python examples/chat.py -m /mnt/str/models/nous-capybara-34b-exl2/4.65bpw/ -mode nous -gs auto -l 512 -sp " "
 -- Model: /mnt/str/models/nous-capybara-34b-exl2/4.65bpw/
 -- Options: ['gpu_split: auto', 'length: 512']
 -- Loading tokenizer...
 -- Loading model...
 -- Prompt format: nous
 -- System prompt:

User: Sally, a girl, has three brothers. Each brother has two sisters. How many sisters does Sally have?

Sally has only one sister. Each brother has two sisters, which means Sally and the other sister.

I tried it with this quant which is a little old and likely uses the old calibration method, but I'd be interested in what other models exhibit the same behavior.

bjj commented 5 months ago

I tried other LoneStriker variants but I tried lucyknada/Nous-Capybara-34B-exl2-4bpw to get someone else's quant. It behaved the same for the original bug report. Happy to test against any exl2 you want to suggest. I tried to find another similar 30b one to test but didn't see anything obvious. I grabbed LoneStriker/Orca-2-13b-8.0bpw-h8-exl2 but it doesn't load.

BUT! for the blue sky the results are interesting:

 -- Model: E:\exl2-llm-models\lucyknada\Nous-Capybara-34B-exl2-4bpw\
 -- Options: ['length: 7000']
 -- Loading model...
 -- Loading tokenizer...
 -- Prompt format: nous
 -- System prompt:

Perform the task to the best of your ability.

User: Why is the sky blue?

The sky appears blue due to a phenomenon called Rayleigh scattering. As sunlight reaches Earth's atmosphere, it is scattered, or redirected, in all directions by the gases and particles in the air. Blue light is scattered more than other colors because it travels in smaller, shorter waves. This scattered blue light is what we see when we look up at the sky, giving it its characteristic blue hue.

User:  Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue?

The sky appears blue due to a phenomenon called Rayleigh scattering. As sunlight travels through Earth's atmosphere, it interacts with air molecules, water droplets, and dust particles. These particles are much smaller than the wavelengths of visible light, causing the shorter-wavelength colors (green, yellow, orange, and red) to be scattered more effectively than the longer-wavelength colors (blue and violet). Since our eyes are most sensitive to the color blue, the sky appears blue during a clear day.

User:  Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue?

scattering. The short-wavelength blue light is scattered more than longer wavelengths, such as red light. This scattered blue light reaches our line of sight, it spreads out and disperses into various colors. Blue light is scattered in all directions, while other colors are not scattered as much. As a result, blue light is scattered to our eyes from all directions, making the sunlight's wavelength, causing the blue light to scatter more effectively. The scattered blue light reaches our eyes, making the sky appear blue during the day.

User:  Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue? Why is the sky blue?

particles, causing the blue light to spread out and reach our eyes, making the sky appear blue.

It's like the output gets truncated relative to the length of the input.

Happy to run any test you want.

turboderp commented 5 months ago

That does indeed look like truncation, though I'm not sure what the reason would be. What version of Torch are you using, and what GPU?

bjj commented 5 months ago
(exllamav2) PS C:\dev\exllamav2> conda list -n exllamav2
# packages in environment at C:\Users\benja\miniconda3\envs\exllamav2:
#
# Name                    Version                   Build  Channel
blas                      1.0                         mkl
bottleneck                1.3.5           py311h5bb9823_0
bzip2                     1.0.8                he774522_0
ca-certificates           2023.12.12           haa95532_0
cramjam                   2.7.0           py311h005caf5_0
cuda-cccl                 12.3.101                      0    nvidia
cuda-cudart               12.1.105                      0    nvidia
cuda-cudart-dev           12.1.105                      0    nvidia
cuda-cupti                12.1.105                      0    nvidia
cuda-libraries            12.1.0                        0    nvidia
cuda-libraries-dev        12.1.0                        0    nvidia
cuda-nvrtc                12.1.105                      0    nvidia
cuda-nvrtc-dev            12.1.105                      0    nvidia
cuda-nvtx                 12.1.105                      0    nvidia
cuda-opencl               12.3.101                      0    nvidia
cuda-opencl-dev           12.3.101                      0    nvidia
cuda-profiler-api         12.3.101                      0    nvidia
cuda-runtime              12.1.0                        0    nvidia
cudatoolkit               11.8.0               hd77b12b_0
fastparquet               2023.8.0        py311hd7041d2_0
filelock                  3.13.1          py311haa95532_0
fsspec                    2023.10.0       py311haa95532_0
gmpy2                     2.1.2           py311h7f96b67_0
intel-openmp              2023.1.0         h59b6b97_46320
jinja2                    3.1.2           py311haa95532_0
libcublas                 12.1.0.26                     0    nvidia
libcublas-dev             12.1.0.26                     0    nvidia
libcufft                  11.0.2.4                      0    nvidia
libcufft-dev              11.0.2.4                      0    nvidia
libcurand                 10.3.4.107                    0    nvidia
libcurand-dev             10.3.4.107                    0    nvidia
libcusolver               11.4.4.55                     0    nvidia
libcusolver-dev           11.4.4.55                     0    nvidia
libcusparse               12.0.2.55                     0    nvidia
libcusparse-dev           12.0.2.55                     0    nvidia
libffi                    3.4.4                hd77b12b_0
libnpp                    12.0.2.50                     0    nvidia
libnpp-dev                12.0.2.50                     0    nvidia
libnvjitlink              12.1.105                      0    nvidia
libnvjitlink-dev          12.1.105                      0    nvidia
libnvjpeg                 12.1.1.14                     0    nvidia
libnvjpeg-dev             12.1.1.14                     0    nvidia
libuv                     1.44.2               h2bbff1b_0
m2w64-gcc-libgfortran     5.3.0                         6
m2w64-gcc-libs            5.3.0                         7
m2w64-gcc-libs-core       5.3.0                         7
m2w64-gmp                 6.1.0                         2
m2w64-libwinpthread-git   5.0.0.4634.697f757               2
markupsafe                2.1.3           py311h2bbff1b_0
mkl                       2023.1.0         h6b88ed4_46358
mkl-service               2.4.0           py311h2bbff1b_1
mkl_fft                   1.3.8           py311h2bbff1b_0
mkl_random                1.2.4           py311h59b6b97_0
mpc                       1.1.0                h7edee0f_1
mpfr                      4.0.2                h62dcd97_1
mpir                      3.0.0                hec2e145_1
mpmath                    1.3.0           py311haa95532_0
msys2-conda-epoch         20160418                      1
networkx                  3.1             py311haa95532_0
ninja                     1.10.2               haa95532_5
ninja-base                1.10.2               h6d14046_5
numexpr                   2.8.7           py311h1fcbade_0
numpy                     1.26.3          py311hdab7c0b_0
numpy-base                1.26.3          py311hd01c5d8_0
openssl                   3.0.12               h2bbff1b_0
packaging                 23.1            py311haa95532_0
pandas                    2.1.4           py311hf62ec03_0
pip                       23.3.1          py311haa95532_0
pygments                  2.15.1          py311haa95532_1
python                    3.11.7               he1021f5_0
python-dateutil           2.8.2              pyhd3eb1b0_0
python-tzdata             2023.3             pyhd3eb1b0_0
pytorch                   2.1.2           py3.11_cuda12.1_cudnn8_0    pytorch
pytorch-cuda              12.1                 hde6ce7c_5    pytorch
pytorch-mutex             1.0                        cuda    pytorch
pytz                      2023.3.post1    py311haa95532_0
pyyaml                    6.0.1           py311h2bbff1b_0
regex                     2023.10.3       py311h2bbff1b_0
safetensors               0.4.0           py311hcbdf901_0
sentencepiece             0.1.99          py311h59b6b97_0
setuptools                68.2.2          py311haa95532_0
six                       1.16.0             pyhd3eb1b0_1
sqlite                    3.41.2               h2bbff1b_0
sympy                     1.12            py311haa95532_0
tbb                       2021.8.0             h59b6b97_0
tk                        8.6.12               h2bbff1b_0
typing_extensions         4.9.0           py311haa95532_1
tzdata                    2023d                h04d1e81_0
vc                        14.2                 h21ff451_1
vs2015_runtime            14.27.29016          h5e58377_2
websockets                10.4            py311h2bbff1b_1
wheel                     0.41.2          py311haa95532_0
xz                        5.4.5                h8cc25b3_0
yaml                      0.2.5                he774522_0
zlib                      1.2.13               h8cc25b3_0

It's a 3090ti. For a smaller model I could test on a 4080, and in a week or so I'll have a 3090.

I have CUDA 11.8 and 12.3 installed, but 11.8 is in my path. nvidia driver is current as of today.

bjj commented 5 months ago

I found my issue with loading Orca 2 13b (user error; my added_tokens.json file was not the right file)

It exhibits the truncation error as well:

(exllamav2) PS C:\dev\exllamav2> python .\examples\chat.py -m E:\exl2-llm-models\LoneStriker\Orca-2-13b-8.0bpw-h8-exl2\ --length 512 --mode chatml -sp "Perform the task to the best of your ability." --amnesia
 -- Model: E:\exl2-llm-models\LoneStriker\Orca-2-13b-8.0bpw-h8-exl2\
 -- Options: ['length: 512']
 -- Loading model...
 -- Loading tokenizer...
 -- Prompt format: chatml
 -- System prompt:

Perform the task to the best of your ability.

User: Why is the sky blue?

The sky appears blue because of a process called Rayleigh scattering. When sunlight enters Earth's atmosphere, it encounters tiny molecules of nitrogen, oxygen, and other gases. These molecules are much smaller than the wavelengths of visible light, so they can't absorb the light directly. Instead, they scatter it in all directions.

Blue light has a shorter wavelength and higher frequency than other colors in the visible spectrum. As a result, it is scattered more readily by the gas molecules. This means that more blue light reaches our eyes from all directions, making the sky appear blue to an observer on Earth.

User: Why is the sky blue?Why is the sky blue?Why is the sky blue?Why is the sky blue?Why is the sky blue?Why is the sky blue?

scattering of light. Sunlight is composed of many colors, which
, it interacts with air molecules and smaller particles, it collides with with various sizes of particles, such as dust, water droplets with molecules of nitrogen, oxygen, and other particles. These particles are much smaller than the wavelengths of visible light, so they scatter the light in many directions. Shorter wavelengths, like blue and violet, are scattered more than longer wavelengths, like red and orange. Since our eyes are more sensitive to blue light, we perceive the sky as blue.

User: Why is the sky blue?Why is the sky blue?Why is the sky blue?Why is the sky blue?Why is the sky blue?Why is the sky blue?Why is the sky blue?Why is the sky blue?Why is the sky blue?Why is the sky blue?Why is the sky blue?Why is the sky blue?Why is the sky blue?

atmosphere, the shorter wavelengths, such as oxygen and nitrogen. Blue light, which is

light,
.

different
waves
different
colors of
different
collide with
smaller
encounters
with
molecules like
molecules
molecules
,
including
with
different
molecules
mole
mole
dust
such
such
like
such
such as
molecules
and
aerosols. The
blue
wavelengths
are scattered more than other colors because they has smaller wavelengths and higher frequencies. This is why the sky appears blue to our eyes.
bjj commented 5 months ago

I built my initial environment starting with conda install "pytorch-cuda>=11.8" which resulted in the versions in the reply above (12.1).

I just built a new environment starting with conda install "pytorch-cuda=11.8" and that version stack seems to work fine (see new package list at end)

# Name                    Version                   Build  Channel
blas                      1.0                         mkl
bottleneck                1.3.5           py311h5bb9823_0
bzip2                     1.0.8                he774522_0
ca-certificates           2023.12.12           haa95532_0
cramjam                   2.7.0           py311h005caf5_0
cuda-cccl                 12.3.101                      0    nvidia
cuda-cudart               11.8.89                       0    nvidia
cuda-cudart-dev           11.8.89                       0    nvidia
cuda-cupti                11.8.87                       0    nvidia
cuda-libraries            11.8.0                        0    nvidia
cuda-libraries-dev        11.8.0                        0    nvidia
cuda-nvrtc                11.8.89                       0    nvidia
cuda-nvrtc-dev            11.8.89                       0    nvidia
cuda-nvtx                 11.8.86                       0    nvidia
cuda-profiler-api         12.3.101                      0    nvidia
cuda-runtime              11.8.0                        0    nvidia
fastparquet               2023.8.0        py311hd7041d2_0
filelock                  3.13.1          py311haa95532_0
fsspec                    2023.10.0       py311haa95532_0
gmpy2                     2.1.2           py311h7f96b67_0
intel-openmp              2023.1.0         h59b6b97_46320
jinja2                    3.1.2           py311haa95532_0
libcublas                 11.11.3.6                     0    nvidia
libcublas-dev             11.11.3.6                     0    nvidia
libcufft                  10.9.0.58                     0    nvidia
libcufft-dev              10.9.0.58                     0    nvidia
libcurand                 10.3.4.107                    0    nvidia
libcusolver               11.4.1.48                     0    nvidia
libcusolver-dev           11.4.1.48                     0    nvidia
libcusparse               11.7.5.86                     0    nvidia
libcusparse-dev           11.7.5.86                     0    nvidia
libffi                    3.4.4                hd77b12b_0
libnpp                    11.8.0.86                     0    nvidia
libnpp-dev                11.8.0.86                     0    nvidia
libnvjpeg                 11.9.0.86                     0    nvidia
libuv                     1.44.2               h2bbff1b_0
m2w64-gcc-libgfortran     5.3.0                         6
m2w64-gcc-libs            5.3.0                         7
m2w64-gcc-libs-core       5.3.0                         7
m2w64-gmp                 6.1.0                         2
m2w64-libwinpthread-git   5.0.0.4634.697f757               2
markupsafe                2.1.3           py311h2bbff1b_0
mkl                       2023.1.0         h6b88ed4_46358
mkl-service               2.4.0           py311h2bbff1b_1
mkl_fft                   1.3.8           py311h2bbff1b_0
mkl_random                1.2.4           py311h59b6b97_0
mpc                       1.1.0                h7edee0f_1
mpfr                      4.0.2                h62dcd97_1
mpir                      3.0.0                hec2e145_1
mpmath                    1.3.0           py311haa95532_0
msys2-conda-epoch         20160418                      1
networkx                  3.1             py311haa95532_0
ninja                     1.10.2               haa95532_5
ninja-base                1.10.2               h6d14046_5
numexpr                   2.8.7           py311h1fcbade_0
numpy                     1.26.3          py311hdab7c0b_0
numpy-base                1.26.3          py311hd01c5d8_0
openssl                   3.0.12               h2bbff1b_0
packaging                 23.1            py311haa95532_0
pandas                    2.1.4           py311hf62ec03_0
pip                       23.3.1          py311haa95532_0
pygments                  2.15.1          py311haa95532_1
python                    3.11.7               he1021f5_0
python-dateutil           2.8.2              pyhd3eb1b0_0
python-tzdata             2023.3             pyhd3eb1b0_0
pytorch                   2.1.2           py3.11_cuda11.8_cudnn8_0    pytorch
pytorch-cuda              11.8                 h24eeafa_5    pytorch
pytorch-mutex             1.0                        cuda    pytorch
pytz                      2023.3.post1    py311haa95532_0
pyyaml                    6.0.1           py311h2bbff1b_0
regex                     2023.10.3       py311h2bbff1b_0
safetensors               0.4.0           py311hcbdf901_0
sentencepiece             0.1.99          py311h59b6b97_0
setuptools                68.2.2          py311haa95532_0
sqlite                    3.41.2               h2bbff1b_0
sympy                     1.12            py311haa95532_0
tbb                       2021.8.0             h59b6b97_0
tk                        8.6.12               h2bbff1b_0
typing_extensions         4.9.0           py311haa95532_1
tzdata                    2023d                h04d1e81_0
vc                        14.2                 h21ff451_1
vs2015_runtime            14.27.29016          h5e58377_2
websockets                10.4            py311h2bbff1b_1
wheel                     0.41.2          py311haa95532_0
xz                        5.4.5                h8cc25b3_0
yaml                      0.2.5                he774522_0
zlib                      1.2.13               h8cc25b3_0
bjj commented 5 months ago

I revisited this because I was seeing a big perf difference between Linux and Windows and wanted to rule out the pytorch version.

The failing env with pytorch=2.1.2+pytorch-cuda=12.1 still fails, but I built another one on pytorch=2.2.0+pytorch-cuda=12.1 and it works. It's slightly faster on Windows than pytorch=2.1.2+pytorch-cuda=11.8. Still 20%+ behind Linux, though.

I'll close this because it's clearly an environment issue.