Open 35develr opened 7 months ago
only the GRPC Process of the embedding is affected, when i run the embedding on another machine without CUDA and the main model on the local machine with CUDA, Memory Management is fine.
Embedding config:
name: paraphrase-multilingual-mpnet-base-v2 backend: sentencetransformers embeddings: true parameters: model: paraphrase-multilingual-mpnet-base-v2
Same issue here. I'm running localai/localai:v2.12.4-aio-gpu-nvidia-cuda-11
I first try
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "gpt-4-vision-preview",
"messages": [{"role": "user", "content": [{"type":"text", "text": "What is in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'
Which works, and then another model
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
"model": "gpt-4",
"prompt": "Why is the sky blue? Short and concise answer",
"temperature": 0.1, "top_p": 0.1
}'
Which fails.
Doign them separate after docker restart localai
always work, just not one after another. So I run out of memory always, when using different models.
Relevant section of docker log
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr ggml_backend_cuda_buffer_type_alloc_buffer: allocating 5563.66 MiB on device 0: cudaMalloc failed: out of memory
@35develr I think I found a solution. After some digging into the #1341 watchdog implementation, I found another issue that mentions a flag for "one single active backend": https://github.com/mudler/LocalAI/issues/909
Solution: https://github.com/mudler/LocalAI/pull/925
SINGLE_ACTIVE_BACKEND=true
docker-compose.yml
version: "3.9"
services:
api:
image: localai/localai:v2.12.4-aio-gpu-nvidia-cuda-11
container_name: localai
# For a specific version:
# image: localai/localai:v2.12.4-aio-cpu
# For Nvidia GPUs decomment one of the following (cuda11 or cuda12):
# Find out which version: `nvcc --version` (be aware, `nvidia-smi` only gives you max compatibility, it is
# not the nvidia container toolkit version installed)
# image: localai/localai:v2.12.4-aio-gpu-nvidia-cuda-11
# image: localai/localai:v2.12.4-aio-gpu-nvidia-cuda-12
# image: localai/localai:latest-aio-gpu-nvidia-cuda-11
# image: localai/localai:latest-aio-gpu-nvidia-cuda-12
healthcheck:
test: [ "CMD", "curl", "-f", "http://localhost:8080/readyz" ]
interval: 1m
timeout: 20m
retries: 5
ports:
- 8080:8080
environment:
- DEBUG=true
- SINGLE_ACTIVE_BACKEND=true
- PARALLEL_REQUESTS=false
- WATCHDOG_IDLE=true
- WATCHDOG_BUSY=true
- WATCHDOG_IDLE_TIMEOUT=5m
- WATCHDOG_BUSY_TIMEOUT=5m
#- GALLERIES: '[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}, {"url": "github:go-skynet/model-gallery/huggingface.yaml","name":"huggingface"}]'
volumes:
- ./models:/build/models:cached
#- ./images:/tmp
# decomment the following piece if running with Nvidia GPUs
restart: unless-stopped
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [ gpu ]
EDIT: I also added the WATCHDOG env vars, this will clear all VRAM from time to time.
Thanks for the work around, but I still would want a better strategy. e.g. when VRAM can't hold a new model, kill all present IDLE processes.
LocalAI version: master-cublas-cuda12-ffmpeg (from 19.02.2024)
Environment, CPU architecture, OS, and Version: Ubuntu 22.04.3 LTS, LocalAI running in Docker, CUDA12
Config: docker-compose.yaml:
gpt-4.yaml:
Describe the bug For each chat request a new GRPC Process is started, but not unloaded when completed request. So my Cuda Memory gets full and fuller until CUDA out of memory
nvitop:
To Reproduce
Expected behavior Stopping GPRC process after complete and free CUDA Memory or reuse existing GPRC Process.
Logs 10:52AM DBG Stopping all backends except 'paraphrase-multilingual-mpnet-base-v2'
10:52AM DBG Parameter Config: &{PredictionOptions:{Model:paraphrase-multilingual-mpnet-base-v2 Language: N:0 TopP:0 TopK:0 Temperature:0 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:paraphrase-multilingual-mpnet-base-v2 F16:false Threads:4 Debug:true Roles:map[] Embeddings:true Backend:sentencetransformers TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[us der Versicherungsbranche.] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[] Description: Usage:}
10:52AM DBG [single-backend] Stopping mistral-7b-instruct-v0.2.Q5_K_M.gguf
10:52AM INF Loading model 'paraphrase-multilingual-mpnet-base-v2' with backend sentencetransformers
10:52AM DBG Loading model in memory from file: /models/paraphrase-multilingual-mpnet-base-v2
10:52AM DBG Loading Model paraphrase-multilingual-mpnet-base-v2 with gRPC (file: /models/paraphrase-multilingual-mpnet-base-v2) (backend: sentencetransformers): {backendString:sentencetransformers model:paraphrase-multilingual-mpnet-base-v2 threads:4 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc000640000 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:true parallelRequests:false}
10:52AM DBG Loading external backend: /build/backend/python/sentencetransformers/run.sh
10:52AM DBG Loading GRPC Process: /build/backend/python/sentencetransformers/run.sh
10:52AM DBG GRPC Service for paraphrase-multilingual-mpnet-base-v2 will be running at: '127.0.0.1:38613'
10:52AM DBG GRPC Service state dir: /tmp/go-processmanager1753323238
10:52AM DBG GRPC Service Started
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr Server started. Listening on: 127.0.0.1:38613
10:52AM DBG GRPC Service Ready
10:52AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:} sizeCache:0 unknownFields:[] Model:paraphrase-multilingual-mpnet-base-v2 ContextSize:0 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:true NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:4 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/paraphrase-multilingual-mpnet-base-v2 Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:}
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr /opt/conda/envs/transformers/lib/python3.11/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr return self.fget.get(instance, owner)()
10:52AM DBG Stopping all backends except 'paraphrase-multilingual-mpnet-base-v2'
10:52AM DBG Model already loaded in memory: paraphrase-multilingual-mpnet-base-v2
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr Calculated embeddings for: Gebe 20 Beispiele für Arbeitgeberpositionierung a
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr Traceback (most recent call last):
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/grpc/_server.py", line 552, in _call_behavior
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr response_or_iterator = behavior(argument, context)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr ^^^^^^^^^^^^^^^^^^^^^^^^^^^
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/build/backend/python/sentencetransformers/sentencetransformers.py", line 80, in Embedding
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr sentence_embeddings = self.model.encode(request.Embeddings)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/sentence_transformers/SentenceTransformer.py", line 153, in encode
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr self.to(device)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1160, in to
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr return self._apply(convert)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr ^^^^^^^^^^^^^^^^^^^^
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr module._apply(fn)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr module._apply(fn)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr module._apply(fn)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr [Previous line repeated 4 more times]
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 833, in _apply
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr param_applied = fn(param)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr ^^^^^^^^^
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1158, in convert
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacty of 5.93 GiB of which 17.62 MiB is free. Process 112775 has 1.55 GiB memory in use. Process 165903 has 1.55 GiB memory in use. Process 170069 has 1.55 GiB memory in use. Process 175275 has 1.25 GiB memory in use. Of the allocated memory 878.23 MiB is allocated by PyTorch, and 17.77 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[172.18.0.1]:57394 500 - POST /v1/embeddings
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr Calculated embeddings for: us der Versicherungsbranche.
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr Traceback (most recent call last):
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/grpc/_server.py", line 552, in _call_behavior
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr response_or_iterator = behavior(argument, context)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr ^^^^^^^^^^^^^^^^^^^^^^^^^^^
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/build/backend/python/sentencetransformers/sentencetransformers.py", line 80, in Embedding
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr sentence_embeddings = self.model.encode(request.Embeddings)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/sentence_transformers/SentenceTransformer.py", line 153, in encode
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr self.to(device)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1160, in to
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr return self._apply(convert)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr ^^^^^^^^^^^^^^^^^^^^
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr module._apply(fn)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr module._apply(fn)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr module._apply(fn)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr [Previous line repeated 4 more times]
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 833, in _apply
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr param_applied = fn(param)
[172.18.0.1]:57408 500 - POST /v1/embeddings
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr ^^^^^^^^^
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1158, in convert
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacty of 5.93 GiB of which 17.62 MiB is free. Process 112775 has 1.55 GiB memory in use. Process 165903 has 1.55 GiB memory in use. Process 170069 has 1.55 GiB memory in use. Process 175275 has 1.25 GiB memory in use. Of the allocated memory 878.23 MiB is allocated by PyTorch, and 17.77 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF