Closed Jipok closed 4 months ago
Slightly better result:
$ python examples/streaming.py
Successfully preprocessed all matching files.
Loading model: /home/llama/models/Marcoroni-7b-GPTQ/
Our story begins in the Scottish town of Auchtermuchty, where once.շ, this is a. I will be able to find it.A, as an alO2, the world and 107898, we’ll. 33.0460616,023507 12
J 394
O53190540.N663139
9.H4-T, A537475
0553,9
44720964-1749761 1991, 186003599503034209295:13,199 (2 5)73007130 4
014096033446
5040 6.P181519B-4
79.W.4066
472732. 538145895345 1143
4371.4588
Prompt processed in 0.02 seconds, 15 tokens, 905.85 tokens/second
Response generated in 4.10 seconds, 250 tokens, 61.04 tokens/second
Can you provide the exact model used? It works correctly on my machine:
python test_inference.py -m models/Llama-2-13B-chat-GPTQ -p "Once upon a time,"
Successfully preprocessed all matching files.
-- Model: models/Llama-2-13B-chat-GPTQ
-- Options: ['rope_scale 1.0', 'rope_alpha 1.0']
-- Loading model...
-- Loading tokenizer...
-- Warmup...
-- Generating (greedy sampling)...
Once upon a time, there was a young woman named Sophia who lived in a small village nestled in the rolling hills of Tuscany. Sophia was a beautiful and kind-hearted person, loved by all who knew her. She spent her days tending to her family's vineyard, and at night, she would sit by the fire and dream of adventure.
One evening, as she was gazing into the flames, a strange and handsome stranger appeared before her. He introduced himself as Leo, a traveler from a far-off land who had been wandering the countryside for many years
Prompt processed in 0.06 seconds, 5 tokens, 90.38 tokens/second
Response generated in 3.55 seconds, 128 tokens, 36.07 tokens/second
@ardfork rocm version? GPU?
6_0-bpw-h6 https://huggingface.co/latimar/Synthia-13B-exl2 gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/Synthia-13B-GPTQ gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/Marcoroni-7b-GPTQ
I have ROCm 5.6.0 on my host, I have the same nightly PyTorch version as you. Using a 6700 xt with gfx1030 ISA.
I only tested Marcoroni-7b-GPTQ, similar to your last test:
python examples/streaming.py
Successfully preprocessed all matching files.
Loading model: models/Marcoroni-7b-GPTQ
Our story begins in the Scottish town of Auchtermuchty, where once upon a time there lived an old lady named Jeanie.
One day, as she was walking through her garden, she noticed that all the flowers were withering and dying. She knew it must be because they hadn't been watered properly. So, she decided to fetch some water from the well nearby.
As she walked towards the well, she heard a strange noise coming from inside it. Curious, she peeked into the well and saw a beautiful golden fish swimming around. The golden fish asked if she had any spare water for him, since he was so thirsty.
Jeanie was amazed by this extraordinary creature! She agreed to give him some water, but only if he promised to bring her good luck. The golden fish nodded his head eagerly, promising to do just that.
The next morning, when everyone woke up in Auchtermuchty, they discovered that their lives had changed for the better. The crops grew much faster than usual, and people became healthier and happier. It seemed like the golden fish kept its promise, bringing them all good fortune.
In return, the grateful villagers built a special pond near Jeanie's house
Prompt processed in 0.01 seconds, 15 tokens, 1003.77 tokens/second
Response generated in 4.63 seconds, 250 tokens, 53.95 tokens/second
So, it seems that either something is wrong when running on gfx906 or with your ROCm install.
Edit: I also verified that it works correctly with docker.io/rocm/dev-ubuntu-22.04:5.6.1-complete and latest ROCm nightly torch. Maybe try with a container to isolate potential ROCm installation problem.
with your ROCm install.
GPTQ works on exllama v1. Transformers version works too.
Are you using the same venv, particularly the same pytorch version?
I also have heard that sometime gfx906 can have some problem with some optimization level. Can you try replacing -O3
by -O2
, -O1
and -O0
in exllamav2/ext.py
? I think you only need to replace the one in extra_cuda_cflags but replace both just to be sure.
Replacing -03 to -02 or -01 doesn't have any affect, still output garbadge. -00 cause memory access fault (reason: page not present or supervisor privilege) when warmup.
How did you compile exllamav2 to work with ROCm? Or did you just install these:
pytorch-lightning 1.9.4 pytorch-triton-rocm 2.1.0+34f8189eae torch 2.2.0.dev20230912+rocm5.6 torchaudio 2.2.0.dev20230912+rocm5.6 torchdiffeq 0.2.3 torchmetrics 1.1.2 torchsde 0.2.5 torchvision 0.17.0.dev20230912+rocm5.6
In advance before installing exllamav2?
Are you using the same venv, particularly the same pytorch version?
No venv. pytorch system-wide. Other pkgs installed for user.
Can you try replacing -O3 by -O2, -O1 and -O0 in exllamav2/ext.py?
No noticeable effect with -O1 and -O0. Anyway my GPU works with stable diffusion and exllama v1.
How did you compile exllamav2 to work with ROCm? In advance before installing exllamav2?
Yes. Just pip3 install --force --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.6
Since you are not the only one with a gfx906 that have this problem. There is probably something wrong with ROCm and that GPU. Either some emitted code that is incorrect or a builtin function that is broken.
I'm curious if it has something to do with one of the half2 functions. Can you try this exllama v1 PR: https://github.com/turboderp/exllama/pull/146? If it produce gibberish we can at least isolate the problem to one of the half2 function.
Can you try this exllama v1 PR: https://github.com/turboderp/exllama/pull/146?
Same error in https://github.com/PanQiWei/AutoGPTQ/issues/340:
I'm curious if it has something to do with one of the half2 functions.
Idk. I have installed AUTOMATIC1111/stable-diffusion-webui and wiki says: *For many AMD GPUs, you must add --precision full --no-half or --upcast-sampling arguments to avoid NaN errors or crashing.
For me both --precision full --no-half
and without args works. But without args i have ~2x less ram consumption:
Same error
What do you mean same error? It's the first time you are posting that one.
This error is because it is based on an older exllama version, it is broken on recent pytorch. Apply the PR on current exllama master or backport this patch https://github.com/turboderp/exllama/commit/3dff8feee545734717cc61d5b1e2422f0a1085ca.
For me both --precision full --no-half and without args works. But without args i have ~2x less ram consumption:
Exllama always use half, but in exllama v1, it could use half2 and half, half2 were broken (without the PR I linked) so it always used half. Exllama v2 support only half2 (saw some code for half support but don't think it's complete).
What do you mean same error? It's the first time you are posting that one.
Updated post. I got a little confused responding to different discussions on a similar topic at the same time.
This error is because it is based on an older exllama version, it is broken on recent pytorch. Apply the PR on current exllama master or backport this patch https://github.com/turboderp/exllama/commit/3dff8feee545734717cc61d5b1e2422f0a1085ca.
In commit: ...this is fixed in ROCm 5.6.
So i skipped merging/reabse. Just used https://github.com/turboderp/exllama/pull/146 with extra_cflags = ["-O3 -DHIPBLAS_USE_HIP_HALF=1"]
in cuda_ext.py, like recommended in https://github.com/PanQiWei/AutoGPTQ/issues/340.
Can you try this exllama v1 PR: https://github.com/turboderp/exllama/pull/146?
Works, no garbage: PYTORCH_ROCM_ARCH=gfx906 python example_chatbot.py -d ~/models/Synthia-13B-GPTQ -un "Jipok" -p prompt_chatbort.txt
In commit:
...this is fixed in ROCm 5.6.
That just an added comment, the important part was #define hipblasHgemm __compat_hipblasHgemm
.
extra_cflags = ["-O3 -DHIPBLAS_USE_HIP_HALF=1"] in cuda_ext.py, like recommended in https://github.com/PanQiWei/AutoGPTQ/issues/340.
Interesting, I didn't know about this HIPBLAS_USE_HIP_HALF
option, looks like it's quite recent, added in 5.5.0, so don't think I should use that yet since some distro are still using 5.4.0, but good to know for future project.
To go back on your issue, I'm not really sure how to troubleshot it. You could try running with AMD_LOG_LEVEL=2
see if you have any other warning/error other than "Cannot find the function Cijk_..." from rocblas. But I doubt you will see anything.
You could try running with AMD_LOG_LEVEL=2
Spam with things like:
:1:hip_code_object.cpp :606 : 9053530252 us: 17097: [tid:0x7f1383233740] Cannot find the function: Cijk_Ailk_Bljk_HHS_BH_MT64x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT4_4_TLDS0_USFGRO0_VAW2_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8
:1:hip_module.cpp :83 : 9053530256 us: 17097: [tid:0x7f1383233740] Cannot find the function: Cijk_Ailk_Bljk_HHS_BH_MT64x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT4_4_TLDS0_USFGRO0_VAW2_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 for module: 0xc05d09d0
Hi, @Jipok. I also encountered the same garbadge about V2, on gfx90a and ROCm5.6.0. In addition, exllama V1 run with my environment perfectly. Have you found a solution?
Have you found a solution?
No. But I didn’t really look for it due to lack of free time.
gfx90a
Do I understand correctly that this is a server GPU similar to my MI50? In this case, perhaps the problem occurs only on server cards(although rocm usually have problems with desktop cards)
Do I understand correctly that this is a server GPU similar to my MI50?
yes, my device is MI210.
Can confirm the issue still exists today after the pre-build RoCM wheels available.
I just tested the 2.5bpw Llama2 chat 7B model on my AMD hardware (with llama mode) and it responded like this:
Each codeblock was a new chat. It responded very "conservative" and almost always denied the request. However, it seemed to read the input prompt wrong (look at last chat)
I am not sure if this is a ROCm issue, nor if it is bpw related, when I am home I will compare it to CUDA inference and download all the bpw variants on AMD and NVIDIA (right now I have a a super limited internet connection, I just wanted to document this rn)
It becomes a lot more cooperative with a custom prompt like -sp "Just answer the questions."
Still, even with the default prompt those responses are worse than what I've been seeing with 2.3 bits. It's possible there's something wrong with the 2.5 bit quant. I'll have to look into it later.
I myself am 99% of the time using exllama on NVIDIA systems, I just wanted to investigate in the amd reliability. I got a better connection here and tested the 4bpw model:
Just wanted to update that the 4.0bpw works extremely well on AMD hardware, so I apologize for my previous reports. (2.4bpw of the 13b codellama instruct seemed to have the same issues as the llama 7b chat low bpw version)
Your issue have nothing to do with the original one.
The problem is that for some reason, exllama v2 produce gibberish on gfx906 (MI50, MI60, Radeon VII, Radeon Pro VII) and gfx90a (MI210, MI250, MI250X). This issue is a duplicate of #15 but provide a bit more information.
If AMD GPU have different output or perplexity result than NVIDIA, you should open a different issue.
Mistral consistently produces something similar to a sentence at the beginning. But still a problem.
PYTORCH_ROCM_ARCH=gfx906 python examples/streaming.py
Successfully preprocessed all matching files.
Loading model: /home/llama/models/Mistral-7B-instruct-exl2/
Our story begins in the Scottish town of Auchtermuchty, where once and the other country. The people that have been a strong place of the 587.
23106672048027907470060421520484374847202000751849470410082 and1522240048498241281101208000005877081058949181280489917012987870808210048907045700002804801520118770079220090099529122997730000000008982801900008007008029379690810001177
Prompt processed in 0.04 seconds, 15 tokens, 390.09 tokens/second
Response generated in 5.75 seconds, 250 tokens, 43.50 tokens/second
@Jipok I can not reproduce the issue on MI210 + rocm5.6/5.7
Specifically running https://github.com/PanQiWei/AutoGPTQ/blob/bbbc3c334a35d52d830a5299534d3edfdb268a57/tests/test_q4.py#L374. It is using only a subset of exllama v2 kernels so that may be the reason why.
@fxmarty Hi fxmarty, I just checked with Radeon VII + rocm5.6 with latest exllamav2 code. The issue still exists, so maybe it's something else in the exllamav2 kernel?
Change the warp layout in exllamav2/exllamav2_ext/cuda/rms_norm.cu like https://github.com/turboderp/exllamav2/pull/137, from (32,32) to (16,64), can solve this problem. I can get the correct output on ROCm5.6, gfx90a now.
(exllama) root@pzl043-mi210:~/exllama/exllamav2_231031# python test_inference.py -m ../../models/LLaMA-2-13B-chat-GPTQ/ -p "Hello,"
Successfully preprocessed all matching files.
-- Model: ../../models/LLaMA-2-13B-chat-GPTQ/
-- Options: ['rope_scale 1.0', 'rope_alpha 1.0']
-- Loading model...
-- Loading tokenizer...
-- Warmup...
-- Generating...
Hello, I think we're going to have a great time. So, what do you say? Are you ready to take this journey with me?"
I smiled and nodded eagerly, feeling my heart race with excitement. "Yes, I'm definitely ready! Let's do this!"
We high-fived and headed out the door, ready to start our adventure. As we walked down the street, I couldn't help but feel grateful for this new chapter in my life. I was finally taking control of my own happiness, and it felt amazing.
Later that night, as we
-- Response generated in 2.09 seconds, 128 tokens, 61.38 tokens/second (includes prompt eval.)
i can confirm this issue is related to warp size, i can test with many rocm devices (gfx900, gfx906, gfx908, gfx1030) and this issue occures on any device with warp size 64
I tested my AMD MI25 and MI60 with Mistral-7B-Instruct-v0.2-GPTQ, and the problem still exists. I also tried using exllamav1, and it works perfectly.
it certainly is fixed for me on those devices as of 5a61d6e821e2522081f2781cccbc27810e4ff62e did you compile from recent sources?
I tried to install it from the latest source:
git clone https://github.com/turboderp/exllamav2 cd exllamav2 python setup.py install --user
And received:
`python3 test_inference.py -m /home/dev/test/models/TheBloke_Mistral-7B-Instruct-v0.2-GPTQ -p "hello" -- Model: /home/dev/test/models/TheBloke_Mistral-7B-Instruct-v0.2-GPTQ -- Options: [] -- Loading model... -- Loading tokenizer... -- Warmup... -- Generating...
hello@ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇
-- Response generated in 5.62 seconds, 128 tokens, 22.79 tokens/second (includes prompt eval.)`
might be something that only shows in minstal, could you try a llama(v2) based model to narrow it down?
I have tried TheBloke_Dolphin-Llama2-7B-GPTQ, TheBloke_WizardLM-7B-uncensored-GPTQ, and TheBloke_Mistral-7B-Instruct-v0.2-GPTQ. Almost identical result. Just a bunch of "⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇". Dolphin-Llama2 starts with some text, but after several words starts printing "⁇ ⁇ ⁇." It may be a problem with the building of exllamav2. I will try on another server with Mi60 and give feedback.
what rocm version are you using and is the mi25 and the mi60 in the same system?
@ZanMax Could you try with an FP16 model? Like this one which should fit easily on the MI60 and probably the MI25 too.
If that still fails I think it narrows down the problem to the RMS norm kernel. Which you could verify by swapping the forward
and forward_torch
functions in rmsnorm.py
.
I tried Llama-2-7B-Chat-fp16, and it works perfectly on both mi25 and mi60
That means the normalization kernel works, at least. It's hard to say what else might be acting up. You can try disabling flash-attn with -nfa
, other than that, I don't know. I'll try to write a little diagnosis script later that tests all the custom kernels and functions one by one. In the meantime, perhaps you could try:
python model_diff.py -ma <fp16_model> -mb <gptq_model> -ed <some_parquet_file>
This will run both models in parallel and compare the hidden state between each, and while the difference is likely going to explode, it would be helpful to know when it happens exactly, i.e. if it's after an attention or MLP block.
@turboderp I want to share my mi60 for debugging. How can I contact you?
@turboderp @ZanMax I have a dual mi60 system and am having the exact same issue with the question marks. Same thing when manually updating exllamav2 and using it inside of text-generation-webui as well.
I have a 7900XTX on order now. Once it arrives I can start doing some ROCm testing myself. Give it a few more days.
Cool sounds good. I can get you an SSH key to my dual mi60 machine too, if you want. It more or less stays on 24/7
@turboderp you might want a wave64 device to (ie a old consumer card or mi25-mi300) i would be willing to provide a mi25 free of charge if you are in Europe.
Was this ever investigated any further?
I'm going to close this issue here since it seems to be stale, but please open a new issue if there continues to be issues with wave64 cards.
GPTQ models works with exllama v1.
GPU: AMD Instinct MI50 Name in OS: AMD ATI Radeon VII Arch: gfx906
rocminfo
``` ROCk module is loaded ===================== HSA System Attributes ===================== Runtime Version: 1.1 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE System Endianness: LITTLE ... ******* Agent 2 ******* Name: gfx906 Uuid: GPU-6f9a60e1732c7315 Marketing Name: AMD Radeon VII Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 1 Device Type: GPU Cache Info: L1: 16(0x10) KB L2: 8192(0x2000) KB Chip ID: 26287(0x66af) ASIC Revision: 1(0x1) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 1801 BDFID: 1280 Internal Node ID: 1 Compute Unit: 60 SIMDs per CU: 4 Shader Engines: 4 Shader Arrs. per Eng.: 1 WatchPts on Addr. Ranges:4 Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 64(0x40) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 40(0x28) Max Work-item Per CU: 2560(0xa00) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 16760832(0xffc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx906:sramecc+:xnack- Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32 *** Done *** ```