OpenVINO A770 segfaults after some number of tokens

richiejp commented 2 months ago

C.C @fakezeta

LocalAI version:

quay.io/go-skynet/local-ai@sha256:4e4e427433285b056f32bfaa313ec0e75aeacb5b5c8c273953f9d2242fb55a60 This is still the version without the AUTO GPU changes. I'll try updating when I get chance.

Environment, CPU architecture, OS, and Version:

Same as #2208, but using just the Arc dGPU

Describe the bug

libopenvino_intel_gpu_plugin.so segfaults during inference. It seems to be when the number of tokens produced is above some amount because it tends to fail in the same place, but sometimes it succeeds as well. I don't know how many tokens are being produced or if it is related to the context size.

To Reproduce

Ask it to summarize the output of for e.g. lscpu or explain 50 lines of a Makefile.

Expected behavior

Not to segfault.

Logs

From the kernel log

[ +34.787117] python[66644]: segfault at 1e ip 00007e9adf577063 sp 00007e98aa7ea280 error 4 in libopenvino_intel_gpu_plugin.so[7e9ade9d6000+de4000] likely on CPU 4 (core 8, socket 0)
[  +0.000010] Code: ff e8 81 2c c7 ff 48 8b b5 38 ce ff ff 4c 89 ff 80 8d 37 ce ff ff 80 e8 6b 2c c7 ff 48 8b 85 c0 da ff ff 80 8d 3f ce ff ff 80 <80> 38 00 0f 85 fc 0c 00 00 48 8b 85 f8 da ff ff 80 38 00 74 67 48

LocalAI log after a previous crash, hence why it is restarting the process:

DBG Request received: {"model":"openvino-llama-3-8b-instruct-ov-int8","language":"","n":0,"top_p":null,"top_k":null,"temperature":null,"max_tokens":null,"echo":false,"batch":0,"ignore_eos":false,"repeat_pe
nalty":0,"n_keep":0,"frequency_penalty":0,"presence_penalty":0,"tfz":null,"typical_p":null,"seed":null,"negative_prompt":"","rope_freq_base":0,"rope_freq_scale":0,"negative_prompt_scale":0,"use_fast_tokenizer":fa
lse,"clip_skip":0,"tokenizer":"","file":"","response_format":{},"size":"","prompt":null,"instruction":"","input":null,"stop":null,"messages":[{"role":"user","content":"Explain the following Makefile, ignore comme
nts such as the license: # Copyright 2019 The Skaffold Authors\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You ma
y obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on
an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\nGOPATH ?=
$(shell go env GOPATH)\nGOBIN ?= $(or $(shell go env GOBIN),$(GOPATH)/bin)\nGOOS ?= $(shell go env GOOS)\nGOARCH ?= $(shell go env GOARCH)\nBUILD_DIR ?= ./out\nORG = github.com/GoogleContainerTools\nPROJECT = ska
ffold\nREPOPATH ?= $(ORG)/$(PROJECT)\nRELEASE_BUCKET ?= $(PROJECT)\nGSC_BUILD_PATH ?= gs://$(RELEASE_BUCKET)/builds/$(COMMIT)\nGSC_BUILD_LATEST ?= gs://$(RELEASE_BUCKET)/builds/latest\nGSC_LTS_BUILD_PATH ?= gs://
$(RELEASE_BUCKET)/lts/builds/$(COMMIT)\nGSC_LTS_BUILD_LATEST ?= gs://$(RELEASE_BUCKET)/lts/builds/latest\nGSC_LTS_RELEASE_PATH ?= gs://$(RELEASE_BUCKET)/lts/releases/$(VERSION)\nGSC_LTS_RELEASE_LATEST ?= gs://$(R
ELEASE_BUCKET)/lts/releases/latest\nGSC_RELEASE_PATH ?= gs://$(RELEASE_BUCKET)/releases/$(VERSION)\nGSC_RELEASE_LATEST ?= gs://$(RELEASE_BUCKET)/releases/latest\n\nGCP_ONLY ?= false\nGCP_PROJECT ?= k8s-skaffold\n
GKE_CLUSTER_NAME ?= integration-tests\nGKE_ZONE ?= us-central1-a\n\nSUPPORTED_PLATFORMS = linux-amd64 darwin-amd64 windows-amd64.exe linux-arm64 darwin-arm64\nBUILD_PACKAGE = $(REPOPATH)/v2/cmd/skaffold\n\nSKAFFO
LD_TEST_PACKAGES = ./pkg/skaffold/... ./cmd/... ./hack/... ./pkg/webhook/...\nGO_FILES = $(shell find . -type f -name '*.go' -not -path \"./pkg/diag/*\")\n\nVERSION_PACKAGE = $(REPOPATH)/v2/pkg/skaffold/version\n
COMMIT = $(shell git rev-parse HEAD)\n\nifeq \"$(strip $(VERSION))\" \"\"\n\toverride VERSION = $(shell git describe --always --tags --dirty)\nendif\n\nDATE_FMT = +%Y-%m-%dT%H:%M:%SZ"}],"functions":null,"function
_call":null,"stream":false,"mode":0,"step":0,"grammar":"","grammar_json_functions":null,"backend":"","model_base_name":""}
9:36AM DBG Configuration read: &{PredictionOptions:{Model:fakezeta/llama-3-8b-instruct-ov-int8 Language: N:0 TopP:0xc0003cfea0 TopK:0xc0003cfea8 Temperature:0xc0003cfec0 Maxtokens:0xc0003cff00 Echo:false Batch:0
IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc0003cfee8 TypicalP:0xc0003cfee0 Seed:0xc0003cff18 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTok
enizer:false ClipSkip:0 Tokenizer:} Name:openvino-llama-3-8b-instruct-ov-int8 F16:0xc0003cfe88 Threads:0xc0003cfe80 Debug:0xc0008e6260 Roles:map[] Embeddings:false Backend:transformers TemplateConfig:{Chat: ChatM
essage: Completion: Edit: Functions: UseTokenizerTemplate:true} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionNa
me: NoActionDescriptionName: ParallelCalls:false NoGrammar:false ResponseRegex:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCa
cheRO:false MirostatETA:0xc0003cfed8 MirostatTAU:0xc0003cfed0 Mirostat:0xc0003cfec8 NGPULayers:0xc0003cff08 MMap:0xc0003cff10 MMlock:0xc0003cff11 LowVRAM:0xc0003cff11 Grammar: StopWords:[<|eot_id|> <|end_of_text|
>] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc0003cf948 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false E
nforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj: RopeScaling: ModelType:OVModelForCausalLM YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device:
Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 Attempt
sSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[] Description: Usage:}
9:36AM DBG Parameters: &{PredictionOptions:{Model:fakezeta/llama-3-8b-instruct-ov-int8 Language: N:0 TopP:0xc0003cfea0 TopK:0xc0003cfea8 Temperature:0xc0003cfec0 Maxtokens:0xc0003cff00 Echo:false Batch:0 IgnoreEO
S:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc0003cfee8 TypicalP:0xc0003cfee0 Seed:0xc0003cff18 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:f
alse ClipSkip:0 Tokenizer:} Name:openvino-llama-3-8b-instruct-ov-int8 F16:0xc0003cfe88 Threads:0xc0003cfe80 Debug:0xc0008e6260 Roles:map[] Embeddings:false Backend:transformers TemplateConfig:{Chat: ChatMessage:
Completion: Edit: Functions: UseTokenizerTemplate:true} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoAc
tionDescriptionName: ParallelCalls:false NoGrammar:false ResponseRegex:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:fa
lse MirostatETA:0xc0003cfed8 MirostatTAU:0xc0003cfed0 Mirostat:0xc0003cfec8 NGPULayers:0xc0003cff08 MMap:0xc0003cff10 MMlock:0xc0003cff11 LowVRAM:0xc0003cff11 Grammar: StopWords:[<|eot_id|> <|end_of_text|>] Cutst
rings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc0003cf948 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEa
ger:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj: RopeScaling: ModelType:OVModelForCausalLM YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:f
alse UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTi
me:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[] Description: Usage:}
9:36AM INF Loading model 'fakezeta/llama-3-8b-instruct-ov-int8' with backend transformers
9:36AM DBG Model already loaded in memory: fakezeta/llama-3-8b-instruct-ov-int8
9:36AM WRN GRPC Model not responding: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:33975: connect: connection refused"
9:36AM WRN Deleting the process in order to recreate it
9:36AM DBG GRPC Process is not responding: fakezeta/llama-3-8b-instruct-ov-int8
9:36AM ERR error stopping process error="process does not exist" process=fakezeta/llama-3-8b-instruct-ov-int8
9:36AM DBG Loading model in memory from file: /build/models/fakezeta/llama-3-8b-instruct-ov-int8
9:36AM DBG Loading Model fakezeta/llama-3-8b-instruct-ov-int8 with gRPC (file: /build/models/fakezeta/llama-3-8b-instruct-ov-int8) (backend: transformers): {backendString:transformers model:fakezeta/llama-3-8b-in
struct-ov-int8 threads:4 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc000824000 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.s
h coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/buil
d/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh parler-tts:/build/backend/python/parler-tts/run.sh petals:/build/backend/python/petals/run.sh rerankers:/build/backend/python/
rerankers/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run
.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
9:36AM DBG Loading external backend: /build/backend/python/transformers/run.sh
9:36AM DBG Loading GRPC Process: /build/backend/python/transformers/run.sh
9:36AM DBG GRPC Service for fakezeta/llama-3-8b-instruct-ov-int8 will be running at: '127.0.0.1:43285'
9:36AM DBG GRPC Service state dir: /tmp/go-processmanager3465528766
9:36AM DBG GRPC Service Started
9:36AM DBG GRPC(fakezeta/llama-3-8b-instruct-ov-int8-127.0.0.1:43285): stderr /usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and wil
l be removed in v5 of Transformers. Use `HF_HOME` instead.
9:36AM DBG GRPC(fakezeta/llama-3-8b-instruct-ov-int8-127.0.0.1:43285): stderr   warnings.warn(
9:36AM DBG GRPC(fakezeta/llama-3-8b-instruct-ov-int8-127.0.0.1:43285): stderr Server started. Listening on: 127.0.0.1:43285
9:36AM DBG GRPC Service Ready
9:36AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:fakezeta/llama-3-8b-instruct-ov-int8 ContextSize:
8192 Seed:1378654422 NBatch:512 F16Memory:false MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:4 LibrarySearchPath: RopeFreqBase:
0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/build/models/fakezeta/llama-3-8b-instruct-ov-int8 Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0
IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false Enfor
ceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:OVModelForCausalLM}
9:36AM DBG GRPC(fakezeta/llama-3-8b-instruct-ov-int8-127.0.0.1:43285): stdout INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, onnx, openvino
9:36AM DBG GRPC(fakezeta/llama-3-8b-instruct-ov-int8-127.0.0.1:43285): stderr /usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py:521: FutureWarning: `is_torch_tpu_available` is deprecated
and will be removed in 4.41.0. Please use the `is_torch_xla_available` instead.
9:36AM DBG GRPC(fakezeta/llama-3-8b-instruct-ov-int8-127.0.0.1:43285): stderr   warnings.warn(
9:36AM DBG GRPC(fakezeta/llama-3-8b-instruct-ov-int8-127.0.0.1:43285): stderr Compiling the model to GPU ...
9:36AM DBG GRPC(fakezeta/llama-3-8b-instruct-ov-int8-127.0.0.1:43285): stderr Setting OpenVINO CACHE_DIR to /build/models/models--fakezeta--llama-3-8b-instruct-ov-int8/snapshots/77bf2b84f5921a7e7b049aab2ead3fedad
786494/model_cache
9:36AM DBG GRPC(fakezeta/llama-3-8b-instruct-ov-int8-127.0.0.1:43285): stderr Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
9:37AM DBG GRPC(fakezeta/llama-3-8b-instruct-ov-int8-127.0.0.1:43285): stderr /build/backend/python/transformers/run.sh: line 20: 11352 Segmentation fault      (core dumped) python $DIR/transformers_server.py $@

Additional context

Similar requests succeed on iGPU.

localai-bot commented 2 months ago

It appears that your model loading process encountered an issue while trying to run on the CPU. The error message indicates a segmentation fault, which is a type of error that typically results from an attempt to access memory that doesn't belong to the process.

Possible reasons for this error could be hardware issues, conflicts between different processes for the CPU's resources, or incompatibilities between the software and the CPU architecture.

To troubleshoot the issue, you might want to try the following steps:

Verify the CPU is functioning correctly by running some basic tests.
Ensure there are no other resource-intensive processes running concurrently that could conflict with the model loading process.
Check the compatibility between the software and the CPU architecture. Make sure they are compatible.
If the error persists, you might want to consider using a different server or upgrading the hardware to ensure it meets the software requirements for running the model.

If you need further assistance, provide more details about your system'

fakezeta commented 2 months ago

Hi @richiejp thanks again for your time.

This seems an upstream (openvino) error. The kernel log seems to be related to accessing a memory region not allowed. Like a null or invalid pointer. The instruction at de4000 in libopenvino_intel_gpu_plugin.so is de4000: 48 8b 78 08 mov 0x8(%rax),%rdi That is copying from one register to another.

All the above just to say that sadly I have no clue because is an error related to the underlying operating system plugin that is transparent to LocalAI.

May I ask you to test directly openvino inference outside of LocalAI? If you don't know how to do I have a quick and dirty gradio test app in the qa_gradio.py file (python requirements in requirements.txt ). It's not meant for distribution, I made it just to see if it was worth investing time on openvino but could be usefult to understand if it's LocalAI related or not.

Another check is to be sure to have the latest ARC driver from Intel

richiejp commented 2 months ago

OK, I'll see if I can get it running outside LocalAI. As for drivers, I had some issues installing the out-of-tree driver, so that could take a while otherwise I have to wait for a kernel update from Ubuntu/Dell.

fakezeta commented 2 months ago

I'd like to understand if it's LocalAI specific or not, so that in case we can open an issue upstream.

richiejp commented 2 months ago

Yup, it also core dumped: egfault at 1e ip 0000783592b24063 sp 00007833977ec280 error 4 in libopenvino_intel_gpu_plugin.so[783591f83000+de4000] likely on CPU 6 (core 12, socket 0) [ +0.000010] Code: ff e8 81 2c c7 ff 48 8b b5 38 ce ff ff 4c 89 ff 80 8d 37 ce ff ff 80 e8 6b 2c c7 ff 48 8b 85 c0 da ff ff 80 8d 3f ce ff ff 80 <80> 38 00 0f 85 fc 0c 00 00 48 8b 85 f8 da ff ff 80 38 00 74 67 48

fakezeta commented 2 months ago

Thank you, it's segfaulting at the same instruction address de4000 so definitely an upstream bug or enviroment issue. Can you open an issue to https://github.com/openvinotoolkit/openvino linking it here so I can follow?

I'm sorry for not being of more help 😞 but I don't have the resources to investigate more.

mudler / LocalAI

OpenVINO A770 segfaults after some number of tokens #2219