Enable distributuion/ollama for rocm

alexhegit commented 3 weeks ago

🚀 The feature, motivation and pitch

Ollama has the docker image ollama/ollama:rocm support AMD ROCm . I wish distributuion/ollama could support AMD ROCm as https://github.com/meta-llama/llama-stack/tree/main/distributions/ollama/gpu for NVIDIA GPU.

I have a patch from my fork repo but does not pass the test.

Please guide me to fix it and make the PR be merged soon.

Alternatives

No response

Additional context

Here is the details about the patch and test.

Way 1: Using docker compose

Step 1: Create compose.yaml for rocm

Here is the patch https://github.com/alexhegit/llama-stack-rocm/commit/37f2b07c5102351c671b6ae0e8cd85ab4853e661

I create the rocm/compose.yaml use ollama/ollama:rocm to replace the ollama/ollama by refering the https://github.com/meta-llama/llama-stack/blob/main/distributions/ollama/gpu/compose.yaml .

And I reuse the https://github.com/meta-llama/llama-stack/blob/main/distributions/ollama/gpu/run.yaml as rocm/run.yaml

Run and Test

Step 1: docker compose up ollama/ollama:rocm with llamastack/distirbutuion-ollama

$ cd [llama-stack]/distribution/ollama/rocm

$ docker compose up
[+] Running 4/3
 ✔ Container rocm-ollama-1                                               Created                                                                                                                          0.3s
 ! ollama Published ports are discarded when using host network mode                                                                                                                                      0.0s
 ✔ Container rocm-llamastack-1                                           Created                                                                                                                          0.3s
 ! llamastack Published ports are discarded when using host network mode                                                                                                                                  0.0s
Attaching to llamastack-1, ollama-1
ollama-1      | 2024/11/04 17:48:33 routes.go:1158: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES:0 HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
ollama-1      | time=2024-11-04T17:48:33.998Z level=INFO source=images.go:754 msg="total blobs: 9"
ollama-1      | time=2024-11-04T17:48:33.998Z level=INFO source=images.go:761 msg="total unused blobs removed: 0"
ollama-1      | time=2024-11-04T17:48:33.999Z level=INFO source=routes.go:1205 msg="Listening on [::]:11434 (version 0.3.14)"
ollama-1      | time=2024-11-04T17:48:33.999Z level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 rocm_v60102]"
ollama-1      | time=2024-11-04T17:48:33.999Z level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
ollama-1      | time=2024-11-04T17:48:34.003Z level=INFO source=amd_linux.go:383 msg="amdgpu is supported" gpu=0 gpu_type=gfx1100
ollama-1      | time=2024-11-04T17:48:34.004Z level=INFO source=types.go:123 msg="inference compute" id=0 library=rocm variant="" compute=gfx1100 driver=6.7 name=1002:7448 total="45.0 GiB" available="44.5 GiB"
llamastack-1  | /usr/local/lib/python3.10/site-packages/pydantic/_internal/_fields.py:172: UserWarning: Field name "schema" in "JsonResponseFormat" shadows an attribute in parent "BaseModel"
llamastack-1  |   warnings.warn(
ollama-1      | [GIN] 2024/11/04 - 17:49:34 | 200 |     120.776µs |       127.0.0.1 | GET      "/api/ps"
ollama-1      | [GIN] 2024/11/04 - 17:49:34 | 200 |       8.436µs |       127.0.0.1 | GET      "/api/ps"
llamastack-1  | INFO:     Started server process [1]
llamastack-1  | INFO:     Waiting for application startup.
llamastack-1  | INFO:     Application startup complete.
llamastack-1  | INFO:     Uvicorn running on http://['::', '0.0.0.0']:5000 (Press CTRL+C to quit)

Step 2: verify the ollama server

$ ollama run llama3.1:8b-instruct-fp16
>>> who are amd?
AMD stands for Advanced Micro Devices, Inc. They are an American multinational semiconductor company that designs, manufactures, and sells microprocessors, motherboard
chipsets, embedded systems, graphics processing units (GPUs), flash memory devices, and other semiconductor products.
...

It works fine.

Step 3: client test (failed)

$ python -m llama_stack.apis.inference.client localhost 5000
User>hello world, write me a 2 sentence poem about the moon
{"error": {"message": "400: Invalid value: `Llama3.1-8B-Instruct` not registered. Make sure there is an Inference provider serving this model."}}

Step 4: Check the containers

$ docker compose top
rocm-llamastack-1
UID    PID       PPID      C    STIME   TTY   TIME       CMD
root   3934159   3934137   0    01:48   ?     00:00:02   python -m llama_stack.distribution.server.server --yaml_config /root/llamastack-run-ollama.yaml

rocm-ollama-1
UID    PID       PPID      C    STIME   TTY   TIME       CMD
root   3934105   3934078   0    01:48   ?     00:00:00   /bin/ollama serve                                                                                                           
root   3935109   3934105   11   01:50   ?     00:00:22   /usr/lib/ollama/runners/rocm_v60102/ollama_llama_server --model /root/.ollama/models/blobs/sha256-09cd6813dc2e73d9db9345123ee1b3385bb7cee88a46f13dc37bc3d5e96ba3a4 --ctx-size 8192 --batch-size 512 --embedding --n-gpu-layers 33 --threads 12 --parallel 4 --port 35363

Way 2: step by step

Step 1: Start the LLM inference server

# Start ollama/ollama:rocm
docker run -d --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size 8G -v ollama:/root/.ollama -p 11434:11434 --name ollama-rocm ollama/ollama:rocm

# Run LLM with ollama (work fine with cli test)
ollama run llama3.1:8b-instruct-fp16

# Start the llamastack/distribution-ollama
docker run --network host -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./gpu/run.yaml:/root/llamastack-run-ollama.yaml llamastack/distribution-ollama --yaml_config /root/llamastack-run-ollama.yaml

It use https://github.com/meta-llama/llama-stack/blob/main/distributions/ollama/gpu/run.yaml to mount to the container of llamastack/distribution-ollama

Step 2: Run the client test

python -m llama_stack.apis.inference.client localhost 5000

Failed with the log of client test:

$ python -m llama_stack.apis.inference.client localhost 5000
User>hello world, write me a 2 sentence poem about the moon
{"error": {"message": "400: Invalid value: `Llama3.1-8B-Instruct` not registered. Make sure there is an Inference provider serving this model."}}

Log of llamastack/distributuion-ollama

$ docker run --network host -it --rm -p 5000:5000 -v ~/.llama:/root/.llama -v ./gpu/run.yaml:/root/llamastack-run-ollama.yaml llamastack/distribution-ollama --yaml_config /root/llamastack-run-ollama.yaml
WARNING: Published ports are discarded when using host network mode
/usr/local/lib/python3.10/site-packages/pydantic/_internal/_fields.py:172: UserWarning: Field name "schema" in "JsonResponseFormat" shadows an attribute in parent "BaseModel"
  warnings.warn(
Resolved 12 providers
 inner-inference => ollama0
 models => __routing_table__
 inference => __autorouted__
 inner-safety => meta0
 inner-memory => meta0
 shields => __routing_table__
 safety => __autorouted__
 memory_banks => __routing_table__
 memory => __autorouted__
 agents => meta0
 telemetry => meta0
 inspect => __builtin__

Initializing Ollama, checking connectivity to server...
Serving API safety
 POST /safety/run_shield
Serving API memory
 POST /memory/insert
 POST /memory/query
Serving API inspect
 GET /health
 GET /providers/list
 GET /routes/list
Serving API models
 GET /models/get
 GET /models/list
 POST /models/register
Serving API shields
 GET /shields/get
 GET /shields/list
 POST /shields/register
Serving API agents
 POST /agents/create
 POST /agents/session/create
 POST /agents/turn/create
 POST /agents/delete
 POST /agents/session/delete
 POST /agents/session/get
 POST /agents/step/get
 POST /agents/turn/get
Serving API inference
 POST /inference/chat_completion
 POST /inference/completion
 POST /inference/embeddings
Serving API memory_banks
 GET /memory_banks/get
 GET /memory_banks/list
 POST /memory_banks/register

Listening on ['::', '0.0.0.0']:5000
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://['::', '0.0.0.0']:5000 (Press CTRL+C to quit)
INFO:     127.0.0.1:48294 - "POST /inference/chat_completion HTTP/1.1" 200 OK
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 206, in sse_generator
    async for item in await event_gen:
  File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/routers/routers.py", line 99, in chat_completion
    provider = self.routing_table.get_provider_impl(model)
  File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/routers/routing_tables.py", line 131, in get_provider_impl
    raise ValueError(
ValueError: `Llama3.1-8B-Instruct` not registered. Make sure there is an Inference provider serving this model.

HabebNawatha commented 3 weeks ago

Hello @alexhegit , I used to have the same problem, I'm not sure if it's supposed to work that way. but whenever I run the model using ollama first for few minutes and then running it using llama stack distribution it works for me. Try to run the model using ollama first as a warm up, and then start your server. Let me know if this helps!

alexhegit commented 3 weeks ago

Hello @alexhegit , I used to have the same problem, I'm not sure if it's supposed to work that way. but whenever I run the model using ollama first for few minutes and then running it using llama stack distribution it works for me. Try to run the model using ollama first as a warm up, and then start your server. Let me know if this helps!

Hi @HabebNawatha YES. The Way2_StepbyStep, I did ollama run with LLM “ollama run llama3.1:8b-instruct-fp16 ” as warmup before using distribution/ollama serving. I confirm by ollama serving is ready. BTW: what GPU you used, NVIDIA or AMD GPU. The Way2 work fine with Nvidia GPU from my side. But I try to enable AMD ROCm GPU with issue.

HabebNawatha commented 3 weeks ago

@alexhegit Hey! I'm actually using a Macbook Pro M2 Chip, running it locally. And every time i warm up my model before using a client script it works. But still I think it should not work that way, it should be ready whenever the client script is called.

meta-llama / llama-stack