oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
39.49k stars 5.19k forks source link

Intel Arc thread #3761

Closed oobabooga closed 2 months ago

oobabooga commented 1 year ago

This thread is dedicated to discussing the setup of the webui on Intel Arc GPUs.

You are welcome to ask questions as well as share your experiences, tips, and insights to make the process easier for all Intel Arc users.

thejacer commented 7 months ago

Edited above, sorry.

On Sun, Feb 4, 2024 at 12:32 PM Kristle Chester @.***> wrote:

That's good news.

Have you edited run_arc.sh to use your opencl values?

Instead of


export GGML_OPENCL_PLATFORM=2
export GGML_OPENCL_DEVICE=A770

It needs to be something like


export GGML_OPENCL_PLATFORM=0
export GGML_OPENCL_DEVICE=Intel(R)

From: thejacer @.> Sent: Sunday, February 4, 2024 1:26:05 PM To: oobabooga/text-generation-webui @.> Cc: Kristle Chester @.>; Mention @.> Subject: Re: [oobabooga/text-generation-webui] Intel Arc thread (Issue

3761)

Platform Name Intel(R) OpenCL Graphics Number of devices 1 Device Name Intel(R) Graphics [0x56a0] Device Vendor Intel(R) Corporation Device Vendor ID 0x8086 Device Version OpenCL 3.0 NEO Device UUID 8680a056-0800-0000-0300-000000000000 Driver UUID 32332e33-352e-3237-3139-312e34320000 Valid Device LUID No Device LUID 4005-9721ff7f0000 Device Node Mask 0 Device Numeric Version 0xc00000 (3.0.0) Driver Version 23.35.27191.42 Device OpenCL C Version OpenCL C 1.2 Device OpenCL C all versions OpenCL C 0x400000 (1.0.0) OpenCL C 0x401000 (1.1.0) OpenCL C 0x402000 (1.2.0) OpenCL C 0xc00000 (3.0.0) Device OpenCL C features opencl_c_int64 0xc00000 (3.0.0) opencl_c_3d_image_writes 0xc00000 (3.0.0) opencl_c_images 0xc00000 (3.0.0) opencl_c_read_write_images 0xc00000 (3.0.0) opencl_c_atomic_order_acq_rel 0xc00000 (3.0.0) __opencl_c_atomic_order_seq_cst 0xc00000 (3.0.0) opencl_c_atomic_scope_all_devices 0xc00000 (3.0.0) opencl_c_atomic_scope_device 0xc00000 (3.0.0) opencl_c_generic_address_space 0xc00000 (3.0.0) opencl_c_program_scope_global_variables 0xc00000 (3.0.0) __opencl_c_work_group_collective_functions 0xc00000 (3.0.0) opencl_c_subgroups 0xc00000 (3.0.0) opencl_c_ext_fp32_global_atomic_add 0xc00000 (3.0.0) opencl_c_ext_fp32_local_atomic_add 0xc00000 (3.0.0) opencl_c_ext_fp32_global_atomic_min_max 0xc00000 (3.0.0) __opencl_c_ext_fp32_local_atomic_min_max 0xc00000 (3.0.0) opencl_c_ext_fp16_global_atomic_load_store 0xc00000 (3.0.0) opencl_c_ext_fp16_local_atomic_load_store 0xc00000 (3.0.0) __opencl_c_ext_fp16_global_atomic_min_max 0xc00000 (3.0.0) opencl_c_ext_fp16_local_atomic_min_max 0xc00000 (3.0.0) opencl_c_integer_dot_product_input_4x8bit 0xc00000 (3.0.0) opencl_c_integer_dot_product_input_4x8bit_packed 0xc00000 (3.0.0) Latest conformance test passed v2023-05-16-00 Device Type GPU Device PCI bus info (KHR) PCI-E, 0000:03:00.0 Device Profile FULL_PROFILE Device Available Yes Compiler Available Yes Linker Available Yes Max compute units 512 Max clock frequency 2400MHz Device IP (Intel) 0x30dc008 (12.220.8) Device ID (Intel) 22176 Slices (Intel) 8 Sub-slices per slice (Intel) 8 EUs per sub-slice (Intel) 8 Threads per EU (Intel) 8 Feature capabilities (Intel) DP4A, DPAS Device Partition (core) Max number of sub-devices 0 Supported partition types None Supported affinity domains (n/a) Max work item dimensions 3 Max work item sizes 1024x1024x1024 Max work group size 1024 Preferred work group size multiple (device) 64 Preferred work group size multiple (kernel) 64 Max sub-groups per work group 128 Sub-group sizes (Intel) 8, 16, 32 Preferred / native vector sizes char 16 / 16 short 8 / 8 int 4 / 4 long 1 / 1 half 8 / 8 (cl_khr_fp16) float 1 / 1 double 0 / 0 (n/a) Half-precision Floating-point support (cl_khr_fp16) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Single-precision Floating-point support (core) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Correctly-rounded divide and sqrt operations Yes Double-precision Floating-point support (n/a) Address bits 64, Little-Endian External memory handle types DMA buffer Global memory size 16704737280 (15.56GiB)

clinfo definitely sees my GPU and has it correctly at 16GB vram.

— Reply to this email directly, view it on GitHub< https://github.com/oobabooga/text-generation-webui/issues/3761#issuecomment-1925873304>, or unsubscribe< https://github.com/notifications/unsubscribe-auth/AACVH5EKJJFUXGCUEMUVUVTYR7HD3AVCNFSM6AAAAAA4E2AWXOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRVHA3TGMZQGQ>.

You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/oobabooga/text-generation-webui/issues/3761#issuecomment-1925874929, or unsubscribe https://github.com/notifications/unsubscribe-auth/APWKRXXT2NVYLZHZIGZKGQTYR7H3TAVCNFSM6AAAAAA4E2AWXOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRVHA3TIOJSHE . You are receiving this because you commented.Message ID: @.***>

kcyarn commented 7 months ago

Edited above, sorry.

I now have oobabooga llama.cpp (gguf only) working in WSL2 Ubuntu 22.04. This uses the older backend clblast. The newer ones are really nice, but I went with what I was familiar with. I've added everything to the wsl_scripts folder at oobabooga_intel_arc.

Given the complexity on the wsl side, Docker might be the best direction for this one.

Here's a screenshot showing it using the gpu with WSL2 on Windows 11. You may need the insiders version on Windows 10. Screenshot 2024-02-07 013213

thejacer commented 7 months ago

Amazing. I’ll run this tonight. If there’s anything you want me to test please let me know.

On Tue, Feb 6, 2024 at 8:10 PM Kristle Chester @.***> wrote:

Edited above, sorry. … <#m2065921319814216459>

I now have oobabooga llama.cpp (gguf only) working in WSL2 Ubuntu 22.04. This uses the older backend clblast. The newer ones are really nice, but I went with what I was familiar with. I've added everything to the wsl_scripts folder at oobabooga_intel_arc https://github.com/kcyarn/oobabooga_intel_arc.

Given the complexity on the wsl side, Docker might be the best direction for this one.

Here's a screenshot showing it using the gpu with WSL2 on Windows 11. You may need the insiders version on Windows 10. Screenshot.2024-02-07.013213.png (view on web) https://github.com/oobabooga/text-generation-webui/assets/349172/23ff6268-42d6-4f0f-820d-c105de7f6f00

— Reply to this email directly, view it on GitHub https://github.com/oobabooga/text-generation-webui/issues/3761#issuecomment-1931123110, or unsubscribe https://github.com/notifications/unsubscribe-auth/APWKRXWAGGPXSEEBKPIYX4LYSLPBZAVCNFSM6AAAAAA4E2AWXOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZRGEZDGMJRGA . You are receiving this because you commented.Message ID: @.***>

thejacer commented 7 months ago

Edited above, sorry.

I now have oobabooga llama.cpp (gguf only) working in WSL2 Ubuntu 22.04. This uses the older backend clblast. The newer ones are really nice, but I went with what I was familiar with. I've added everything to the wsl_scripts folder at oobabooga_intel_arc.

Given the complexity on the wsl side, Docker might be the best direction for this one.

Here's a screenshot showing it using the gpu with WSL2 on Windows 11. You may need the insiders version on Windows 10. Screenshot 2024-02-07 013213

All of this installed new packages:

sudo add-apt-repository ppa:oibaf/graphics-drivers sudo apt update sudo apt upgrade -y

sudo apt install -y vainfo

sudo apt install -y mesa-va-drivers

Run vainfo --display drm --device /dev/dri/card0 Output is the UHD Graphics 770

One of the simpler ways to test whether its using the GPU. Also grabs a lot of the dependencies needed later. sudo apt install -y ffmpeg sudo apt install -y gstreamer1.0-plugins-bad gstreamer1.0-tools gstreamer1.0-vaapi

This was new:

sudo usermod -a -G video ${USER}

And this installed new packages:

sudo apt-get install x11-apps -y

I also added all of those lines, which were all missing, to the bottom of my .bashrc. With those changes I was able to see new information for my cpu and integrated graphics when running clinfo, I could see my gpu when running vainfo and my renderer string is now my A770 when I run glxinfo | grep OpenGL. I still can't see my gpu when I run intel_gpu_top though. All of this resulted in a .gguf loading into my gpu(!) without rebuilding clblast when running on the text gen ui I tried setting up days ago. It still utilized about 80% of my cpu and only ~20% gpu when running inference though. About to rebuild clblast and see what how it goes.

No change after rebuilding clblast and llama.cpp. I might have messed this part up though, I got lost in the comments on that block. I'll keep trying.

DDXDB commented 6 months ago

在英特尔锐炫上运行 Ooobabooga 的指南草案

在考虑提交到主存储库之前,需要更多的眼睛和测试人员。

安装说明

尽管编辑 conda 的 OpenCL 供应商文件是一个可行的选择,但切换到标准 python3 安装并使用 venv 可使所有测试模型的令牌/秒性能提高约 71%。它还消除了较旧的 conda 库和英特尔锐炫所需的尖端库可能出现的问题。就目前而言,跳过 conda 及其 CDT 似乎是最可靠的选择。

工作模型装载机

  • llama.cpp
  • 变形金刚

最新的英特尔转换器扩展增加了对 Arc 的 INT4 推理支持。Hugging Face transformers 在 23 年 9 月承诺为培训师提供 XPU 支持。如果任何其他模型加载器使用变压器,它们可以毫不费力地运行。(他们可能还需要一个相当大的分叉。在这种情况下,添加 BigDL 模型加载器可能更能利用能源。这只是我的看法。我的 BigDL 实验仍在 Jupyter 笔记本中,但在 Intel GPU 和 CPU 上都获得了很好的体验。

注意:加载程序在 modules/loaders.py 中是硬编码的。在不将其重构为更模块化的扩展或[不寒而栗的]猴子修补的情况下,我们只需要记住哪些适用于我们的个人系统。使其更加模块化和可定制,以适应 CPU 和 GPU 的不同组合,这比在英特尔锐炫上工作要广泛得多。它还需要社区的大量支持和承诺。

测试的模型

  • 变形金刚

    • 骆驼2-7B-聊天-HF
    • mistralai_Mistral-7B-指令-v0.2
  • llama.cpp

    • 骆驼-2-7b-聊天。Q5_K_M.gguf
    • 米斯特拉尔-7b-instruct-v0.2.Q5_K_M.gguf

未测试的内容

  • 大多数型号
  • 训练
  • 参数
  • 扩展
  • 除了“它是否加载并运行一些简单的提示”之外,还经常使用

_注意:Coqui_tss、silero_tts、whisperstt、superbooga 和 superboogav2 都是中断安装。可以在没有任何依赖项的情况下安装其要求,然后在调试期间选取其他依赖项。特别是 TTS,将割炬升级到英特尔扩展的错误版本。

安装说明

最后两项只是我使用全新安装或新显卡执行的标准操作。它们可能不再是必需的。如果您已经安装了这些软件,请检查是否有更新。英特尔在 2024 年以大量更新拉开了序幕。

测试机详细信息

  • Ubuntu的 23.10
  • 6.5.0.14.16 通用 Linux
  • i7-13700k CPU(运行显示器)
  • 英特尔锐炫 A770(非显示)

Bash 脚本

以下是 2 个 bash 脚本: 和 .它们需要保存或符号链接到目录。install_arch.sh``run_arch.sh``text-generation-webui

开始

  1. 下载或克隆 Oobabooga 的新副本。
  2. 将以下脚本保存到 .它们应与 、 等位于同一文件夹中。text-generation-webui``one_click.py``cmd_linux.sh
  3. 使它们可执行。
    cd text-generation-webui
    ./install_arch.sh
  4. 检查您的硬件信息。clinfo
    clinfo -l
  5. 在 中,找到并将其更改为您的平台编号。然后更改为您的设备名称。保存文件。run_arc.sh``GGML_OPENCL_PLATFORM``GGML_OPENCL_DEVICE
  6. 使用 启动服务器。这将使用您保存在 中的任何标志。您也可以像脚本一样使用标志。run_arch.sh``CMD_FLAGS.txt``--listen --api
    ./run_arch.sh

下面的两个脚本都上传到了 github。这只是一个起点。欢迎更改。一旦它在 bash 中正确,我们就可以决定是否将其与 oobabooga 的start_linux.sh、需求文件和one_click.py集成。

install_arch.sh

#!/bin/bash

# Check if the virtual environment already exists
if [[ ! -d "venv" ]]; then
    # Create the virtual environment
    python -m venv venv
fi

# Activate the virtual environment
source venv/bin/activate

# Intel extension for transformers recently added Arc support.
# See https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/neural_chat/docs/notebooks/build_chatbot_on_xpu.ipynb for additional notes on the dependencies.
# Working model loaders:
#  - llama.cpp
#  - transformers

pip install intel-extension-for-transformers

# Install xpu intel pytorch, not cpu.

pip install torch==2.1.0a0 torchvision==0.16.0a0 torchaudio==2.1.0a0 intel-extension-for-pytorch==2.1.10+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

# Installing these from requriements_cpu_only.txt causes dependency with intel pytorch.

# Install a few of the dependencies for the below.
pip install coloredlogs datasets sentencepiece

pip install --no-deps peft==0.7.* optimum==1.16.* optimum-intel accelerate==0.25.*

# Skip llama-cpp-python install and all installed above without their deps.

grep -v -e peft -e optimum -e accelerate -e llama-cpp-python requirements_cpu_only.txt > temp_requirements.txt

pip install -r temp_requirements.txt

# Install the cpuinfo dependency installed by one_click
pip install py-cpuinfo==9.0.0

# Use the correct cmake args for llama-cpp

export CMAKE_ARGS="-DLLAMA_CLBLAST=ON"
export FORCE_CMAKE=1

pip install --no-cache-dir llama-cpp-python

# List of extensions to exclude
# Exclude coqui_tss because it causes torch dependency issues with intel gpus.
# Whisper_stt and silero_tss both force pytorch updates as dependency of dependency situation. May be possible to use without dependency installation.
cd extensions

extensions=()  # Create an empty array to store folder names
# List of extensions to exclude
# Exclude coqui_tss because it causes torch dependency issues with intel gpus.
# Whisper_stt and silero_tss both force pytorch updates as dependency of dependency situation. May be possible to use without dependency installation.
exclude_extensions=(coqui_tts silero_tts whisper_stt superbooga superboogav2)

for folder in */; do
    extensions+=($folder)
done

echo "${extensions[*]}"

install_extensions=()

for ext in "${extensions[@]}"; do
    should_exclude=false

    for exclude_ext in "${exclude_extensions[@]}"; do
        if [[ "$ext" == *"$exclude_ext"* ]]; then
            should_exclude=true
            break
        fi
    done

    if [ "$should_exclude" = false ]; then
        install_extensions+=("$ext")
    fi
done

# Print the install_extensions
# echo "${install_extensions[@]}"

for extension in ${install_extensions[@]}; do
    cd "$extension"
    echo -e "\n\n$extension\n\n"
    # Install dependencies from requirements.txt
    if [ -e "requirements.txt" ]; then
        echo "Installing requirements in $dir"
        pip install -r requirements.txt
    else
        echo "No requirements.txt found in $dir"
    fi
    cd ..
done
# Leave the extension directory.
cd ..

# Delete the temp_requirements.txt file.

rm temp_requirements.txt

run_arch.sh

#!/bin/bash
# Uncomment if oneapi is not in your .bashrc
# source /opt/intel/oneapi/setvars.sh
# Activate virtual environment built with install_arc.sh. (Not conda!)
source venv/bin/activate

# Change these values to match your card in clinfo -l
# Needed by llama.cpp

export GGML_OPENCL_PLATFORM=2
export GGML_OPENCL_DEVICE=A770

# Use sudo intel_gpu_top to view your card.

# Capture command-line arguments
flags_from_cmdline=$@

# Read flags from CMD_FLAGS.txt
flags_from_file=$(grep -v '^#' CMD_FLAGS.txt | grep -v '^$')
# Combine flags from both sources
all_flags="$flags_from_file $flags_from_cmdline"

# Run the Python script with the combined flags
python server.py $all_flags

Is there a native windows solution? Existing mian start_windows didn't work for me, llama.cpp runs on the CPU. and the program always reminds me that I don't have CUDA

kcyarn commented 5 months ago

Not that I'm aware of. Theoretically, it's possible to install native Windows python and the Intel drivers and then use the Linux install without Anaconda shell scripts as a guide to install and run using pip. It depends on whether the Intel drivers support Windows for the necessary libraries and whether there are wheels. If you want to give it a go, I'd start with llama.cpp. If you can get it running natively on the Windows side, move on to llama-cpp-python. Once you have that running (I used a jupyter notebook when I was troubleshooting this), then you have the foundation for oobabooga.

The WSL2 solutions work, but they're really slow. I suspect WSL needs a major kernel update. It flies in Ubuntu, which is my daily driver.

DDXDB commented 5 months ago

Not that I'm aware of. Theoretically, it's possible to install native Windows python and the Intel drivers and then use the Linux install without Anaconda shell scripts as a guide to install and run using pip. It depends on whether the Intel drivers support Windows for the necessary libraries and whether there are wheels. If you want to give it a go, I'd start with llama.cpp. If you can get it running natively on the Windows side, move on to llama-cpp-python. Once you have that running (I used a jupyter notebook when I was troubleshooting this), then you have the foundation for oobabooga.

The WSL2 solutions work, but they're really slow. I suspect WSL needs a major kernel update. It flies in Ubuntu, which is my daily driver.

I tried compiling llama-cpp-python for sycl and replacing llama-cpp-python for webui, but it didn't work

github-actions[bot] commented 2 months ago

This issue has been closed due to inactivity for 2 months. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.