Open fdm-git opened 8 months ago
If you use 4x4060ti the speed of inference might be slow given the communication overhead between the GPU's (not tried a benchmark but just based on theoretical knowledge). If you have 24GB GPU better use that one.
In the backend if you look at the packages openchat uses pytorch, vLLM and Ray, so if you can configure the underlying libraries for AMD GPU (RCOM) then you should be able to use openchat model with AMD GPU, theoretically like the way they are now supported by ollama (because of all the hardwork done by llama.cpp to support AMD GPU's, given ollama is a wrapper around that library).
You can still run openchat models using llama.cpp with AMD GPU using following guide:
sudo apt-get update
sudo apt-get upgrade
sudo mkdir --parents --mode=0755 /etc/apt/keyrings
wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | \
gpg --dearmor | sudo tee /etc/apt/keyrings/rocm.gpg > /dev/null
# Kernel driver repository for jammy
sudo tee /etc/apt/sources.list.d/amdgpu.list <<'EOF'
deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/5.7.1/ubuntu jammy main
EOF
# ROCm repository for jammy
sudo tee /etc/apt/sources.list.d/rocm.list <<'EOF'
deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/debian jammy main
EOF
# Prefer packages from the rocm repository over system packages
echo -e 'Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600' | sudo tee /etc/apt/preferences.d/rocm-pin-600
sudo apt-get update
sudo apt-get install amdgpu-dkms
sudo apt-get install rocm-hip-libraries
sudo reboot
sudo apt-get install rocm-dev
sudo apt-get install rocm-hip-runtime-dev rocm-hip-sdk
sudo apt-get install rocm-libs
sudo apt-get install rocminfo
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD Ryzen 5 2600X Six-Core Processor
Uuid: CPU-XX
Marketing Name: AMD Ryzen 5 2600X Six-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3600
BDFID: 0
Internal Node ID: 0
Compute Unit: 12
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 32792028(0x1f45ddc) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 32792028(0x1f45ddc) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 32792028(0x1f45ddc) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 2
*******
Name: gfx1031
Uuid: GPU-XX
Marketing Name: AMD Radeon RX 6750 XT
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 3072(0xc00) KB
L3: 98304(0x18000) KB
Chip ID: 29663(0x73df)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2880
BDFID: 1536
Internal Node ID: 1
Compute Unit: 40
SIMDs per CU: 2
Shader Engines: 2
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 115
SDMA engine uCode:: 80
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 12566528(0xbfc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS:
Size: 12566528(0xbfc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1031
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
sudo usermod -a -G render yourusername
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=1 make -j
export HSA_OVERRIDE_GFX_VERSION=10.3.0
# Download the model from huggingface in models directory
https://huggingface.co/TheBloke/openchat-3.5-0106-GGUF
export HSA_OVERRIDE_GFX_VERSION=10.3.0 && export HIP_VISIBLE_DEVICES=1 && sudo ./main -ngl 50 -c 8000 -m models/openchat-3.5-0106-GGUF -p "What are large language models explain with examples and write a sample script using llama.cpp to run openchat model for inference?"
system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 generate: n_ctx = 8000, n_batch = 512, n_predict = -1, n_keep = 0
Large language models (LLMs) are artificial intelligence models that have been trained to understand and generate human-like text. They are called "large" because they typically consist of millions or even billions of parameters, which enable them to learn complex patterns and generate more accurate and coherent responses. Examples of popular LLMs include OpenAI's GPT-3, GPT-4, and Google's BERT.
llama.cpp
Please note that llama.cpp is not a real library, but I'll provide you with a general outline to help you understand how to load and run a model for inference.
First, you'll need to include the necessary libraries and declare the required functions:
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <map>
// Declare the functions
std::map<std::string, std::string> parse_yaml(const std::string& file_path);
std::vector<std::string> tokenize(const std::string& text);
std::string generate_response(const std::vector<std::string>& tokens);
Next, you can define the main function to load the model and perform inference:
int main() {
// Load the model parameters from a YAML file
std::map<std::string, std::string> model_params = parse_yaml("model_params.yaml");
// Load the OpenChat model
// Note: You'll need to replace this with the actual code to load your model
std::string model_path = model_params["model_path"];
std::string model = load_model(model_path);
// Get the input text from the user
std::string input_text;
std::cout << "Enter your text: ";
std::getline(std::cin, input_text);
// Tokenize the input text
std::vector<std::string> tokens = tokenize(input_text);
// Generate a response using the OpenChat model
std::string response = generate_response(tokens);
// Print the response
std::cout << "Response: " << response << std::endl;
return 0;
}
In this example, the parse_yaml
, tokenize
, and generate_response
functions are placeholders. You'll need to replace them with the appropriate code to parse a YAML file, tokenize the input text, and generate a response using the OpenChat model.
Please note that this is just a high-level outline, and you'll need to adapt it to your specific use case and model. If you're working with a specific library or framework, refer to their documentation for detailed instructions on how to load and run a model for inference.
# If you get following error
“hipErrorNoBinaryForGpu: Unable to find code object for all current devices!”
# Try following
sudo tee /etc/apt/sources.list.d/rocm.list <<’EOF’
deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/debian jammy main
EOF
sudo amdgpu-install — rocmrelease=5.7.0 — usecase=rocm,hip — no-dkms
wow thanks for your detailed reply! Appreciated!
@vikrantrathore Thanks for your detailed answer! BTW, to use the provided openchat server with tensor parallel over multiple GPUs, you can set the tensor parallel argument, e.g.
# N is the number of tensor parallel GPUs
python -m ochat.serving.openai_api_server --model openchat/openchat-3.5-0106 --engine-use-ray --worker-use-ray --tensor-parallel-size N
Hi there and first of all thanks for this great tool!
I was wondering if you could provide any feedback about having a single RTX 4090 24GB vs 4x 4060ti 16GB. At the end 4x 4060ti stack tensor cores count will match 4090 tensor cores and the 4x4060ti stack will have a total 64GB of RAM instead of 24GB. Can't tell if the 4x 4060ti stack memory bandwidth will be a bottleneck compared to a single 4090.
One last thing, will AMD GPUs be supported one day?
Thanks in advance!