ggml cuda errors with ollama llms

Dorozhko-Anton commented 4 months ago

Describe the bug When I run any query with ollama and all-in-one docker of taskweaver I get CUDA and ggml errors that I don't understand.

To Reproduce Steps to reproduce the behavior:

Start the service in all-in-one docker with ollama in a separate container ( I can curl ollama inside all-in-one docker without issues)
Type the user query "xxx"

See error


File "/app/taskweaver/role/translator.py", line 85, in raw_text_to_post
for type_str, value, is_end in parser_stream:
File "/app/taskweaver/role/translator.py", line 273, in parse_llm_output_stream_v2
for ev in parser:
File "/app/taskweaver/utils/json_parser.py", line 389, in parse_json_stream
for chunk in itertools.chain(token_stream, [None]):
File "/app/taskweaver/role/translator.py", line 57, in stream_filter
for c in s:
File "/app/taskweaver/planner/planner.py", line 286, in stream_filter
for c in s:
File "/app/taskweaver/llm/__init__.py", line 284, in _stream_smoother
raise llm_source_error  # type:ignore
File "/app/taskweaver/llm/__init__.py", line 232, in base_stream_puller
for msg in stream:
File "/app/taskweaver/llm/ollama.py", line 116, in _chat_completion
raise Exception(
Exception: Failed to get completion with error: an unknown error was encountered while running the model CUDA error: unspecified launch failure
current device: 0, in function ggml_cuda_op_mul_mat at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:1606
cudaGetLastError()
GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:100: !"CUDA error"



**Expected behavior**
TaskWeaver should interacts with ollama without problems and doesn't require CUDA for inference code execution. 
Or
Make error messages more precise

**Screenshots**

**Environment Information (please complete the following information):**
 - docker all-in-one   latest  or 0.2-ws 
 - LLM that you're using: [ollama's llama3:8b,  phi3:medium ]

**Additional context**
I don't understand why error is related to LLM inference and requires cuda and calls to ggml when I use Ollama inference server

liqul commented 4 months ago

I don't think the CUDA error is triggered by TaskWeaver because TaskWeaver only call the endpoint via its API. Could you share your configurations on the LLM for Taskweaver?

Dorozhko-Anton commented 4 months ago

@liqul here is a config that I use

docker run --gpus=all -it -e LLM_API_BASE="http://<IP>:11434" -e LLM_API_KEY="ARBITRARY_STRING" -e LLM_API_TYPE="ollama" -e LLM_MODEL="phi3:medium" -p 48000:8000 --entrypoint bash taskweavercontainers/taskweaver-all-in-one:0.2-ws

  /app/entrypoint_chainlit.sh   

# or define env vars directly in container
 export LLM_API_BASE="http://<IP>:11434"
 export LLM_API_KEY="ARBITRARY_STRING" 
 export LLM_API_TYPE="ollama" 
 export LLM_MODEL="llama3:8b"
  /app/entrypoint_chainlit.sh   
  # or 
  # python -m taskweaver -p ./project/

ollama models are accessible on "http://\<IP>:11434" with a given name in the LLM_MODEL in and outside all-in-one container

liqul commented 4 months ago

I don't have a local env to reproduce this issue. So, it is hard for me to debug.

The only thing I can think of is the request payload of TaskWeaver, though it is hard to understand the correlation. You can see the error message is from the server side:

 File "/app/taskweaver/llm/ollama.py", line 116, in _chat_completion
    raise Exception(
Exception: Failed to get completion with error: an unknown error was encountered while running the model CUDA error: unspecified launch failure
  current device: 0, in function ggml_cuda_op_mul_mat at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:1606
  cudaGetLastError()

The prompt of the planner can be find at project/workspace/sessions/<session_id>/planner_prompt_xxxx.json.

Dorozhko-Anton commented 4 months ago

@liqul It was an issue of Ollama on V100 GPU

I had to use Ollama 0.2.4 which has fixes for V100 GPUs

Thanks. Closing an issue.

microsoft / TaskWeaver

ggml cuda errors with ollama llms #387