mudler / LocalAI

:robot: The free, Open Source alternative to OpenAI, Claude and others. Self-hosted and local-first. Drop-in replacement for OpenAI, running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more models architectures. Features: Generate Text, Audio, Video, Images, Voice Cloning, Distributed inference
https://localai.io
MIT License
24.22k stars 1.86k forks source link

Loop in answer #969

Closed Noooste closed 1 year ago

Noooste commented 1 year ago

LocalAI version: latest version

Environment, CPU architecture, OS, and Version: OS: Debian GNU/Linux 11 (bullseye) CPU architecture : x86_64

Linux euw1 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux

Describe the bug

The answer is repeated.

To Reproduce first apply model with

curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "url": "https://raw.githubusercontent.com/go-skynet/model-gallery/main/mpt-7b-chat.yaml"
   }'

from this py code

import requests

url = "http://localhost:8080/v1/chat/completions"

resp = requests.post(url, json={
     "model": "ggml-mpt-7b-chat.bin",
     "messages": [{"role": "user", "content": "How are you ?"}],
     "temperature": 0.1
   })

print(resp.json()["choices"][0]["message"]["content"])

Expected behavior

No repetitions

Logs

Logs ~/LocalAI# ./local-ai --threads 8 --address localhost:8080 --debug 10:02PM DBG no galleries to load 10:02PM INF Starting LocalAI using 8 threads, with models path: /root/LocalAI/models 10:02PM INF LocalAI version: v1.24.1-38-g9e5fb29 (9e5fb2996582bc45e5a9cbe6f8668e7f1557c15a) 10:02PM DBG Model: mpt-7b-chat (config: {PredictionOptions:{Model:ggml-mpt-7b-chat.bin Language: N:0 TopP:0.7 TopK:80 Temperature:0.2 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:mpt-7b-chat F16:true Threads:0 Debug:false Roles:map[assistant:Assistant: system:System: user:User:] Embeddings:false Backend:gpt4all-mpt TemplateConfig:{Chat:mpt-chat ChatMessage: Completion:mpt-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:1024 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0}}) 10:02PM DBG Extracting backend assets files to /tmp/localai/backend_data ┌───────────────────────────────────────────────────┐ │ Fiber v2.48.0 │ │ http://127.0.0.1:8080 │ │ │ │ Handlers ............ 59 Processes ........... 1 │ │ Prefork ....... Disabled PID ........... 2925529 │ └───────────────────────────────────────────────────┘ 10:02PM DBG Request received: 10:02PM DBG Configuration read: &{PredictionOptions:{Model:ggml-mpt-7b-chat.bin Language: N:0 TopP:0.7 TopK:80 Temperature:0.1 Maxtokens:512 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name: F16:false Threads:8 Debug:true Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:512 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0}} 10:02PM DBG Parameters: &{PredictionOptions:{Model:ggml-mpt-7b-chat.bin Language: N:0 TopP:0.7 TopK:80 Temperature:0.1 Maxtokens:512 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name: F16:false Threads:8 Debug:true Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:512 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0}} 10:02PM DBG Prompt (before templating): how are you ? 10:02PM DBG Template failed loading: failed loading a template for ggml-mpt-7b-chat.bin 10:02PM DBG Prompt (after templating): how are you ? 10:02PM DBG Loading model 'ggml-mpt-7b-chat.bin' greedly from all the available backends: llama, llama-stable, gpt4all, falcon, gptneox, bert-embeddings, falcon-ggml, gptj, gpt2, dolly, mpt, replit, starcoder, bloomz, rwkv, whisper, stablediffusion, piper 10:02PM DBG [llama] Attempting to load 10:02PM DBG Loading model llama from ggml-mpt-7b-chat.bin 10:02PM DBG Loading model in memory from file: /root/LocalAI/models/ggml-mpt-7b-chat.bin 10:02PM DBG Loading GRPC Model llama: {backendString:llama model:ggml-mpt-7b-chat.bin threads:8 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc000382000 externalBackends:map[] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false} 10:02PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama 10:02PM DBG GRPC Service for ggml-mpt-7b-chat.bin will be running at: '127.0.0.1:37345' 10:02PM DBG GRPC Service state dir: /tmp/go-processmanager2291379687 10:02PM DBG GRPC Service Started rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:37345: connect: connection refused" 10:02PM DBG GRPC(ggml-mpt-7b-chat.bin-127.0.0.1:37345): stderr 2023/08/27 22:02:50 gRPC Server listening at 127.0.0.1:37345 10:02PM DBG GRPC Service Ready 10:02PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:} sizeCache:0 unknownFields:[] Model:ggml-mpt-7b-chat.bin ContextSize:512 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:8 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/root/LocalAI/models/ggml-mpt-7b-chat.bin Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 Tokenizer: LoraBase: LoraAdapter: NoMulMatQ:false} 10:02PM DBG GRPC(ggml-mpt-7b-chat.bin-127.0.0.1:37345): stderr create_gpt_params: loading model /root/LocalAI/models/ggml-mpt-7b-chat.bin 10:02PM DBG GRPC(ggml-mpt-7b-chat.bin-127.0.0.1:37345): stderr gguf_init_from_file: invalid magic number 67676d6d 10:02PM DBG GRPC(ggml-mpt-7b-chat.bin-127.0.0.1:37345): stderr error loading model: llama_model_loader: failed to load model from /root/LocalAI/models/ggml-mpt-7b-chat.bin 10:02PM DBG GRPC(ggml-mpt-7b-chat.bin-127.0.0.1:37345): stderr 10:02PM DBG GRPC(ggml-mpt-7b-chat.bin-127.0.0.1:37345): stderr llama_load_model_from_file: failed to load model 10:02PM DBG GRPC(ggml-mpt-7b-chat.bin-127.0.0.1:37345): stderr llama_init_from_gpt_params: error: failed to load model '/root/LocalAI/models/ggml-mpt-7b-chat.bin' 10:02PM DBG GRPC(ggml-mpt-7b-chat.bin-127.0.0.1:37345): stderr load_binding_model: error: unable to load model 10:02PM DBG [llama] Fails: could not load model: rpc error: code = Unknown desc = failed loading model 10:02PM DBG [llama-stable] Attempting to load 10:02PM DBG Loading model llama-stable from ggml-mpt-7b-chat.bin 10:02PM DBG Loading model in memory from file: /root/LocalAI/models/ggml-mpt-7b-chat.bin 10:02PM DBG Loading GRPC Model llama-stable: {backendString:llama-stable model:ggml-mpt-7b-chat.bin threads:8 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc000382000 externalBackends:map[] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false} 10:02PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama-stable 10:02PM DBG GRPC Service for ggml-mpt-7b-chat.bin will be running at: '127.0.0.1:38191' 10:02PM DBG GRPC Service state dir: /tmp/go-processmanager972832918 10:02PM DBG GRPC Service Started rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:38191: connect: connection refused" 10:02PM DBG GRPC(ggml-mpt-7b-chat.bin-127.0.0.1:38191): stderr 2023/08/27 22:02:52 gRPC Server listening at 127.0.0.1:38191 10:02PM DBG GRPC Service Ready 10:02PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:} sizeCache:0 unknownFields:[] Model:ggml-mpt-7b-chat.bin ContextSize:512 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:8 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/root/LocalAI/models/ggml-mpt-7b-chat.bin Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 Tokenizer: LoraBase: LoraAdapter: NoMulMatQ:false} 10:02PM DBG GRPC(ggml-mpt-7b-chat.bin-127.0.0.1:38191): stderr create_gpt_params: loading model /root/LocalAI/models/ggml-mpt-7b-chat.bin 10:02PM DBG GRPC(ggml-mpt-7b-chat.bin-127.0.0.1:38191): stderr llama.cpp: loading model from /root/LocalAI/models/ggml-mpt-7b-chat.bin 10:02PM DBG GRPC(ggml-mpt-7b-chat.bin-127.0.0.1:38191): stderr error loading model: unknown (magic, version) combination: 67676d6d, 0000c500; is this really a GGML file? 10:02PM DBG GRPC(ggml-mpt-7b-chat.bin-127.0.0.1:38191): stderr llama_load_model_from_file: failed to load model 10:02PM DBG GRPC(ggml-mpt-7b-chat.bin-127.0.0.1:38191): stderr llama_init_from_gpt_params: error: failed to load model '/root/LocalAI/models/ggml-mpt-7b-chat.bin' 10:02PM DBG GRPC(ggml-mpt-7b-chat.bin-127.0.0.1:38191): stderr load_binding_model: error: unable to load model 10:02PM DBG [llama-stable] Fails: could not load model: rpc error: code = Unknown desc = failed loading model 10:02PM DBG [gpt4all] Attempting to load 10:02PM DBG Loading model gpt4all from ggml-mpt-7b-chat.bin 10:02PM DBG Loading model in memory from file: /root/LocalAI/models/ggml-mpt-7b-chat.bin 10:02PM DBG Loading GRPC Model gpt4all: {backendString:gpt4all model:ggml-mpt-7b-chat.bin threads:8 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc000382000 externalBackends:map[] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false} 10:02PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/gpt4all 10:02PM DBG GRPC Service for ggml-mpt-7b-chat.bin will be running at: '127.0.0.1:44127' 10:02PM DBG GRPC Service state dir: /tmp/go-processmanager3390470068 10:02PM DBG GRPC Service Started rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:44127: connect: connection refused" 10:02PM DBG GRPC(ggml-mpt-7b-chat.bin-127.0.0.1:44127): stderr 2023/08/27 22:02:54 gRPC Server listening at 127.0.0.1:44127 10:02PM DBG GRPC Service Ready 10:02PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:} sizeCache:0 unknownFields:[] Model:ggml-mpt-7b-chat.bin ContextSize:512 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:8 LibrarySearchPath:/tmp/localai/backend_data/backend-assets/gpt4all RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/root/LocalAI/models/ggml-mpt-7b-chat.bin Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 Tokenizer: LoraBase: LoraAdapter: NoMulMatQ:false} 10:03PM DBG GRPC(ggml-mpt-7b-chat.bin-127.0.0.1:44127): stdout mpt_model_load: loading model from '/root/LocalAI/models/ggml-mpt-7b-chat.bin' - please wait ... 10:03PM DBG GRPC(ggml-mpt-7b-chat.bin-127.0.0.1:44127): stdout mpt_model_load: n_vocab = 50432 10:03PM DBG GRPC(ggml-mpt-7b-chat.bin-127.0.0.1:44127): stdout mpt_model_load: n_ctx = 2048 10:03PM DBG GRPC(ggml-mpt-7b-chat.bin-127.0.0.1:44127): stdout mpt_model_load: n_embd = 4096 10:03PM DBG GRPC(ggml-mpt-7b-chat.bin-127.0.0.1:44127): stdout mpt_model_load: n_head = 32 10:03PM DBG GRPC(ggml-mpt-7b-chat.bin-127.0.0.1:44127): stdout mpt_model_load: n_layer = 32 10:03PM DBG GRPC(ggml-mpt-7b-chat.bin-127.0.0.1:44127): stdout mpt_model_load: alibi_bias_max = 8.000000 10:03PM DBG GRPC(ggml-mpt-7b-chat.bin-127.0.0.1:44127): stdout mpt_model_load: clip_qkv = 0.000000 10:03PM DBG GRPC(ggml-mpt-7b-chat.bin-127.0.0.1:44127): stdout mpt_model_load: ftype = 2 10:03PM DBG GRPC(ggml-mpt-7b-chat.bin-127.0.0.1:44127): stdout mpt_model_load: ggml ctx size = 5653.09 MB 10:03PM DBG GRPC(ggml-mpt-7b-chat.bin-127.0.0.1:44127): stdout mpt_model_load: kv self size = 1024.00 MB 10:03PM DBG GRPC(ggml-mpt-7b-chat.bin-127.0.0.1:44127): stdout mpt_model_load: ........................ done 10:03PM DBG GRPC(ggml-mpt-7b-chat.bin-127.0.0.1:44127): stdout mpt_model_load: model size = 4629.02 MB / num tensors = 194 10:03PM DBG [gpt4all] Loads OK 10:05PM DBG Response: {"object":"chat.completion","model":"ggml-mpt-7b-chat.bin","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"I am doing well, thank-you. How are you?\nI am doing well, thank-you. How are you?\nI am doing well, thank-you. How are you?\nI am doing well, thank-youso much. How are you?\nI am doing well, thank-youso much. How are you?\nI am doing well, thank-youso much. How are you?\nI am doing well, thank-youso much. How are you?\nI am doing well, thank-youso much. How are you?\nI am doing well, thank-youso much. How are you?\nI am doing well, thank-youso much. How are you?\nI am doing well, thank-youso much. How are you?\nI am doing well, thank-youso much. How are you?\nI am doing well, thank-youso much. How are you?\nI am doing well, thank-youso much. How are you?\nI am doing well, thank-youso much. How are you?\nI am doing well, thank-youso much. How are you?\nI am doing well, thank-youso much. How are you?\nI am doing well, thank-youso much. How are you?\nI am doing well, thank-youso much. How are you?\nI am doing well, thank-youso much. How are you?\nI am doing well, thank-youso much. How are you?\nI am doing well, thank-youso much. How are you?\nI am doing well, thank-youso much. How are you?\nI am doing well, thank-youso much. How are you?\nI am doing well, thank-youso much. How are you?\nI am doing well, thank-youso much. How are you?\nI am doing well, thank-youso much. How are you?\nI am doing well, thank-youso much. How are you?\nI am doing well, thank-youso much. How are you?\nI am doing well, thank-youso much. How are you?\nI am doing well, thank-youso"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
Aisuko commented 1 year ago

Hi, @Noooste. Thanks for your feedback. I believe this is not the bug of localAI. I have the same experiences as you while I am using copilot. Here are some potential reasons:

  1. The model may not have enough context to generate more content, so it defaults to a previous answer.
  2. Another reason is that the model may generate "hallucinations", or nonsensical response.
  3. The model may produce repeated text

And it always occurred with short sentences. For example, "Prefix sum technique is", and it will show me lots of repeat sentences which start with "Prefix sum."

And I will close this issue. If you still hit this kind issue, please reopen it anytime.