Problem Serving Custom LLAMA3 Using Google Cloud Run

Oluwafemi-Jegede commented 2 months ago

What is the issue?

I can run a custom LLAMA3 model locally using this docker config

FROM ollama/ollama:latest

COPY custom_llama.txt /App/custom_llama.txt

WORKDIR /App

RUN ollama serve & sleep 5 && ollama create ai-agent -f custom_llama.txt && ollama run ai-agent

EXPOSE 11434

However when I deploy on GCP cloud run, I don't see any model running. $URL/api/tags = {"models":[]}, but it says ollama running on the homepage

FYI: Custom model is LLAMA3:8B

OS

Docker

GPU

No response

CPU

No response

Ollama version

LLAMA3

rick-github commented 2 months ago

What's in custom_llama.txt?

Oluwafemi-Jegede commented 2 months ago

What's in custom_llama.txt? @rick-github

FROM llama3:8b

PARAMETER temperature 0.8 PARAMETER top_k 30 PARAMETER top_p 0.7

TEMPLATE """ {{ if .System }}<|start_header_id|>system<|end_header_id|> {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> {{ .Response }}<|eot_id|> """

SYSTEM You are a bot that helps infer ........

rick-github commented 2 months ago

Where is the llama3:8b model located?

Oluwafemi-Jegede commented 2 months ago

I am unsure if I understand what you mean, but shouldn't this FROM ollama/ollama:latest in the docker file already resolve that?

rick-github commented 2 months ago

FROM ollama/ollama:latest just pulls the program, not any models. If you want to create a new model, you need to pull the model you want to base your custom one on: ollama pull llama3:8b.

Oluwafemi-Jegede commented 2 months ago

@rick-github Okay thanks so the docker file should look like this?

FROM ollama/ollama:latest

COPY custom_llama.txt /App/custom_llama.txt

WORKDIR /App

RUN ollama serve & sleep 5 && ollama pull llama3:8b && ollama create ai-agent -f custom_llama.txt && ollama run ai-agent

EXPOSE 11434

Also curious how it runs locally without running the ollama application in the background

rick-github commented 2 months ago

The RUN commands you have there are only running during the container build process, the container automatically starts the ollama server when it's instantiated, so when running locally it's just ready. The final ollama run ai-agent is unnecessary.

rick-github commented 2 months ago

Note that the way you are doing this, every time you build the container, ollama will re-pull the model, which can be slow, error prone, and impactful on your bandwidth budget. It may be better to pull the model to your work space just once, and then COPY the model in to the container during the build process.

Oluwafemi-Jegede commented 2 months ago

@rick-github Yeah, thanks for the suggestion will try COPY to reduce overhead, I tried Dockerfile below and I still can not see any model on cloud run after adding the pull command

FROM ollama/ollama:latest

COPY custom_llama.txt /App/custom_llama.txt

WORKDIR /App

RUN ollama serve & sleep 5 && ollama pull llama3:8b && ollama create ai-agent -f custom_llama.txt

EXPOSE 11434

$URL/api/tags => {"models":[]}

rick-github commented 2 months ago

Worked locally for me. I don't have a GCP account so can't test cloud run. Do you get any logs from the GCP attempt?

Build:

$  docker build -f Dockerfile -t 6702 --progress plain .
...
#8 0.180 2024/09/08 18:12:46 routes.go:1123: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
...
pulling manifest
#8 81.61 pulling 6a0746a1ec1a... 100% ▕████████████████▏ 4.7 GB
...
#8 81.61 success
...
#8 81.66 transferring model data
#8 81.66 using existing layer sha256:6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
...
#8 81.66 success
...
#9 exporting layers 14.9s done
#9 writing image sha256:9c602d2c645c0ced9f6010250c0f7876d771c5634e799cfdcc6c335ed55fc4d6 done
#9 naming to docker.io/library/6702 done
#9 DONE 14.9s

Run:

$ docker run -d --name 6702 6702
4faf0f6f995e88003c274200f743a50615f4146f15e0965cdbed306e89f3c04a
$ docker exec -it 6702 bash
root@4faf0f6f995e:/App# ollama list
NAME            ID              SIZE    MODIFIED
ai-agent:latest 3f2762d3ecf4    4.7 GB  7 minutes ago
llama3:8b       365c0bd3c000    4.7 GB  7 minutes ago
root@4faf0f6f995e:/App# ollama run ai-agent:latest hello
Hello! I'm a bot designed to help infer information from text-based input. I can assist with tasks such as answering questions, summarizing content, and generating ideas. What would you like to talk about
or ask?

root@4faf0f6f995e:/App#

Oluwafemi-Jegede commented 2 months ago

Yeah same here, works perfectly locally for me but when I move to cloud it just showsollama is running

rick-github commented 2 months ago

Are you running it in a VM instance in the cloud, or just the container with gcloud compute instances create-with-container?

Oluwafemi-Jegede commented 2 months ago

so I am using Google Cloud Run more like a managed container to run workloads in the cloud, with no direct access to the VMs or Compute Engine instances.

Oluwafemi-Jegede commented 2 months ago

Not Ideal or my plan, but I created the model with two API request

$URL/api/pull => to pull llama3:8b $URL/api/create (with the content of the model file ) => to create the bot model

However, it will be nice to just run the container image, which contains all the config, and have it ready to serve

rick-github commented 2 months ago

It's because the cloud built container has OLLAMA_MODELS=/home/.ollama/models while the locally built container uses OLLAMA_MODELS=/root/.ollama/models. Not sure why, I assume the build or run process in GCP sets some environment variables (maybe HOME) that results in a different path for ollama state. I don't know enough about GCP to fix this the right way, but a workaround is to set HOME in the Dockerfile:

--- Dockerfile.orig 2024-09-08 23:42:50.799039526 +0200
+++ Dockerfile  2024-09-08 23:34:44.897002700 +0200
@@ -5,6 +5,6 @@

 WORKDIR /App

-RUN ollama serve & sleep 5 && ollama pull llama3:8b && ollama create ai-agent -f custom_llama.txt
+RUN HOME=/home ollama serve & sleep 5 && ollama pull llama3:8b && ollama create ai-agent -f custom_llama.txt

 EXPOSE 11434

Build and deploy and when the container starts it will see the models:

$ OLLAMA_HOST=https://test-123412341234.us-west1.run.app:443 ollama list
NAME            ID              SIZE    MODIFIED       
ai-agent:latest 3f2762d3ecf4    4.7 GB  18 minutes ago  
llama3:8b       365c0bd3c000    4.7 GB  18 minutes ago  
$ OLLAMA_HOST=https://test-123412341234.us-west1.run.app:443 ollama run ai-agent hello
Hello! I'm a bot that helps infer the meaning of text. You can provide me with some text, and I'll do my best to understand its meaning and provide you with relevant information or insights.

What would you like to talk about? Do you have any specific topics in mind, or would you like me to suggest some prompts to get us started?

Oluwafemi-Jegede commented 2 months ago

Setting HOME fixed the issue. Will be interesting to know if the same issue is observed with similar products from other cloud platforms Azure, AWS, or its just GCP

Thanks @rick-github

ollama / ollama

Problem Serving Custom LLAMA3 Using Google Cloud Run #6702

What is the issue?

OS

GPU

CPU

Ollama version