wrap runningServices map in mutex

lun-4 commented 3 months ago

since the crashes from #5 are around access to runningServices, rather than modification to the RunningService struct, the way to "fix" this would be to wrap overall service map access in a mutex.

this is a draft PR, as I'm running this patch right now with my scripts, if they work out throughout the next 12 hours, I'd call it ready.

closes #5

lun-4 commented 3 months ago

turns out my batch jobs finished. thought I'd have more data to churn through. all worked out without any issues though!

perk11 commented 3 months ago

Thanks for the code. Something isn't working for me:

2024/08/06 00:04:14 [gemma] New client connection received [::1]:8081->[::1]:36684
2024/08/06 00:04:14 [gemma] Reserving VRAM-GPU-1: 22400, RAM: 3000
2024/08/06 00:04:14 [gemma] Starting "/home/perk11/LLM/llama.cpp/llama-server -m /home/perk11/LLM/gemma.gguf -c 8192 -ngl 100 --port 18081", log file: logs/gemma.log, workdir: 
2024/08/06 00:04:17 [gemma] Opened service connection 127.0.0.1:32898->127.0.0.1:18081
2024/08/06 00:04:18 [automatic1111] New client connection received 127.0.0.1:7860->127.0.0.1:48416
2024/08/06 00:04:18 [automatic1111] Reserving VRAM-GPU-1: 7200, RAM: 30000
2024/08/06 00:04:18 [automatic1111] Not enough VRAM-GPU-1 to start. Total: 23900, In use: 22400, Required: 7200
2024/08/06 00:04:54 [flux] New client connection received [::1]:8094->[::1]:58432

Here I opened the connection to automatic111 while connection to gemma was already opened. Then the client closed connection to gemma, but this never showed up in the logs and I don't see any logs for trying to start automatic1111 as I saw them before.

perk11 commented 3 months ago

Here are the logs for the same process on the main branch:

2024/08/06 00:12:40 [gemma] New client connection received [::1]:8081->[::1]:59704
2024/08/06 00:12:40 [gemma] Reserving RAM: 3000, VRAM-GPU-1: 22400
2024/08/06 00:12:40 [gemma] Starting "/home/perk11/LLM/llama.cpp/llama-server -m /home/perk11/LLM/gemma.gguf -c 8192 -ngl 100 --port 18081", log file: logs/gemma.log, workdir: 
2024/08/06 00:12:41 [automatic1111] New client connection received 127.0.0.1:7860->127.0.0.1:33648
2024/08/06 00:12:41 [automatic1111] Reserving VRAM-GPU-1: 7200, RAM: 30000
2024/08/06 00:12:41 [automatic1111] Not enough VRAM-GPU-1 to start. Total: 23900, In use: 22400, Required: 7200
2024/08/06 00:12:41 [automatic1111] Failed to find a service to stop, checking again in 1 second
2024/08/06 00:12:42 [gemma] Opened service connection 127.0.0.1:32986->127.0.0.1:18081
2024/08/06 00:12:42 [automatic1111] Failed to find a service to stop, checking again in 1 second
2024/08/06 00:12:43 [automatic1111] Failed to find a service to stop, checking again in 1 second
2024/08/06 00:12:44 [flux] New client connection received [::1]:8094->[::1]:38596
2024/08/06 00:12:44 [flux] Reserving VRAM-GPU-1: 0, RAM: 0
2024/08/06 00:12:44 [flux] Starting "/usr/bin/python /home/perk11/LLM/viktor89/inference-servers/flux/main.py", log file: logs/flux.log, workdir: 
2024/08/06 00:12:44 [automatic1111] Failed to find a service to stop, checking again in 1 second
2024/08/06 00:12:45 [automatic1111] Failed to find a service to stop, checking again in 1 second
2024/08/06 00:12:46 [automatic1111] Failed to find a service to stop, checking again in 1 second
2024/08/06 00:12:46 [flux] Opened service connection 127.0.0.1:47534->127.0.0.1:18094
2024/08/06 00:12:46 [comfyui] New client connection received 127.0.0.1:8188->127.0.0.1:37708
2024/08/06 00:12:46 [comfyui] Reserving VRAM-GPU-1: 23900, RAM: 35072
2024/08/06 00:12:46 [comfyui] Not enough VRAM-GPU-1 to start. Total: 23900, In use: 22400, Required: 23900
2024/08/06 00:12:46 [comfyui] Failed to find a service to stop, checking again in 1 second
2024/08/06 00:12:47 [automatic1111] Failed to find a service to stop, checking again in 1 second
2024/08/06 00:12:47 [comfyui] Failed to find a service to stop, checking again in 1 second
2024/08/06 00:12:48 [automatic1111] Failed to find a service to stop, checking again in 1 second
2024/08/06 00:12:48 [comfyui] Failed to find a service to stop, checking again in 1 second
2024/08/06 00:12:49 [gemma] Closing service connection 127.0.0.1:32986->127.0.0.1:18081
2024/08/06 00:12:49 [gemma] Closing client connection [::1]:8081->[::1]:59704 on port 8081
2024/08/06 00:12:49 [gemma] Stopping service to free resources for automatic1111
2024/08/06 00:12:49 [gemma] Sending SIGTERM to service process: 2445743
2024/08/06 00:12:49 [comfyui] Failed to find a service to stop, checking again in 1 second
2024/08/06 00:12:49 [gemma] Done stopping pid 2445743
2024/08/06 00:12:49 [automatic1111] Starting "/home/perk11/LLM/stable-diffusion-webui/webui.sh --port 17860", log file: logs/automatic1111.log, workdir: /home/perk11/LLM/stable-diffusion-webui

lun-4 commented 3 months ago

while reviewing with fresh eyes, I found a possible deadlock on service startup, can you take a look with your workloads?

perk11 commented 3 months ago

This resolved the issue I had.

I still encountered an issue with stopping the proxy by Ctrl+C, but not sure if it's caused by this code, or is an existing issue. Going to merge this now, and will focus on tests next and then will try to fix all the concurrency bugs.

perk11 commented 3 months ago

Thank you for all the work for this fix!

perk11 / large-model-proxy

wrap runningServices map in mutex #6