Open cesinsingapore opened 5 months ago
its working fine with another model
Same behaviour for me.
Temporary Workaroud: GPU_LAYERS: 0
Further infos:
https://huggingface.co/bartowski/Qwen2-7B-Instruct-GGUF/discussions/1
"You can also enable flash attention for llamacpp which should be able to work around the issue"
Is flash attention already set in the actual docker images?
I'm using docker-compose direct image from localai latest
version: "3.9" services: api: image: localai/localai:latest-aio-gpu-nvidia-cuda-12 healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"] interval: 1m timeout: 20m retries: 5 ports:
volumes:
deploy: resources: reservations: devices:
Have you entered flash_attention: true in your model yaml file?
do you mean like this ? but it still generate like that after i restarted
root@a4681b4b3146:/build/models# cat qwen2-7b-instruct.yaml
context_size: 4096 f16: true mmap: true name: qwen2-7b-instruct flash_attention: true parameters: model: Qwen2-7B-Instruct-Q4_K_M.gguf stopwords:
- <|im_end|>
- template: chat: | {{.Input -}} <|im_start|>assistant chat_message: | <|im_start|>{{ .RoleName }} {{ if .FunctionCall -}} Function call: {{ else if eq .RoleName "tool" -}} Function response: {{ end -}} {{ if .Content -}} {{.Content }} {{ end -}} {{ if .FunctionCall -}} {{toJson .FunctionCall}} {{ end -}}<|im_end|> completion: | {{.Input}} function: | <|im_start|>system You are a function calling AI model. You are provided with functions to execute. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions. Here are the available tools: {{range .Functions}} {'type': 'function', 'function': {'name': '{{.Name}}', 'description': '{{.Description}}', 'parameters': {{toJson .Parameters}} }} {{end}} For each function call return a json object with function name and arguments <|im_end|> {{.Input -}} <|im_start|>assistant
Yes, so it works for me.
@cesinsingapore did you solve your problem?
nope its not
AI reply not makesense