golemfactory / ya-runtime-ai

1 stars 1 forks source link

Automatic unrecoverable state after disabling GPU #99

Open pwalski opened 2 months ago

pwalski commented 2 months ago

Disabling GPU followed by sdapi/v1/txt2img request makes Automatic runtime to enter an unrecoverable state. (without deliberately disabling a GPU this can be reproduced e.g. by hosting runtime on gaming laptop and changing its power policy) Runtime logs:

[2024-04-17T10:10:32.636+0200 DEBUG ya_gsb_http_proxy::gsb_to_http] Calling http://localhost:7861/sdapi/v1/txt2img
[2024-04-17T10:10:32.636+0200 DEBUG reqwest::connect] starting new connection: http://localhost:7861/
[2024-04-17T10:10:32.636+0200 DEBUG hyper::client::connect::dns] resolving host="localhost"
[2024-04-17T10:10:32.637+0200 DEBUG hyper::client::connect::http] connecting to [::1]:7861
[2024-04-17T10:10:32.776+0200 DEBUG ya_runtime_ai] Successfully sent activity usage message
[2024-04-17T10:10:32.945+0200 DEBUG hyper::client::connect::http] connecting to 127.0.0.1:7861
[2024-04-17T10:10:32.946+0200 DEBUG hyper::client::connect::http] connected to 127.0.0.1:7861
[2024-04-17T10:10:32.946+0200 DEBUG hyper::proto::h1::io] flushed 466 bytes
[2024-04-17T10:10:33.241+0200 DEBUG reqwest::connect] starting new connection: http://localhost:7861/
[2024-04-17T10:10:33.241+0200 DEBUG hyper::client::connect::dns] resolving host="localhost"
[2024-04-17T10:10:33.242+0200 DEBUG hyper::client::connect::http] connecting to [::1]:7861
[2024-04-17T10:10:33.552+0200 DEBUG hyper::client::connect::http] connecting to 127.0.0.1:7861
[2024-04-17T10:10:33.553+0200 DEBUG hyper::client::connect::http] connected to 127.0.0.1:7861
[2024-04-17T10:10:33.553+0200 DEBUG hyper::proto::h1::io] flushed 53 bytes
[2024-04-17T10:10:33.555+0200 DEBUG hyper::proto::h1::io] parsed 5 headers
[2024-04-17T10:10:33.555+0200 DEBUG hyper::proto::h1::conn] incoming body is content-length (22 bytes)
[2024-04-17T10:10:33.787+0200 DEBUG ya_runtime_ai] Successfully sent activity usage message
[2024-04-17T10:10:33.913+0200 DEBUG ya_runtime_ai::process::automatic::monitor] > *** API error: POST: http://localhost:7861/sdapi/v1/txt2img {'error': 'RuntimeError', 'detail': '', 'body': '', 'errors': 'CUDA error: unknown error\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n'}
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >     Traceback (most recent call last):
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\system\python\lib\site-packages\anyio\streams\memory.py", line 98, in receive
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         return self.receive_nowait()
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\system\python\lib\site-packages\anyio\streams\memory.py", line 93, in receive_nowait
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         raise WouldBlock
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >     anyio.WouldBlock
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] > 
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >     During handling of the above exception, another exception occurred:
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] > 
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >     Traceback (most recent call last):
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\system\python\lib\site-packages\starlette\middleware\base.py", line 78, in call_next
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         message = await recv_stream.receive()
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\system\python\lib\site-packages\anyio\streams\memory.py", line 118, in receive
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         raise EndOfStream
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >     anyio.EndOfStream
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] > 
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >     During handling of the above exception, another exception occurred:
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] > 
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >     Traceback (most recent call last):
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\webui\modules\api\api.py", line 187, in exception_handling
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         return await call_next(request)
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\system\python\lib\site-packages\starlette\middleware\base.py", line 84, in call_next
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         raise app_exc
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] > INFO:     127.0.0.1:57355 - "POST /sdapi/v1/txt2img HTTP/1.1" 500 Internal Server Error
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\system\python\lib\site-packages\starlette\middleware\base.py", line 70, in coro
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         await self.app(scope, receive_or_disconnect, send_no_error)
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\system\python\lib\site-packages\starlette\middleware\base.py", line 108, in __call__
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         response = await self.dispatch_func(request, call_next)
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\webui\modules\api\api.py", line 151, in log_and_time
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         res: Response = await call_next(req)
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\system\python\lib\site-packages\starlette\middleware\base.py", line 84, in call_next
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         raise app_exc
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\system\python\lib\site-packages\starlette\middleware\base.py", line 70, in coro
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         await self.app(scope, receive_or_disconnect, send_no_error)
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\system\python\lib\site-packages\starlette\middleware\cors.py", line 84, in __call__
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         await self.app(scope, receive, send)
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\system\python\lib\site-packages\starlette\middleware\gzip.py", line 26, in __call__
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         await self.app(scope, receive, send)
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\system\python\lib\site-packages\starlette\middleware\exceptions.py", line 79, in __call__
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         raise exc
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\system\python\lib\site-packages\starlette\middleware\exceptions.py", line 68, in __call__
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         await self.app(scope, receive, sender)
[2024-04-17T10:10:33.915+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\system\python\lib\site-packages\fastapi\middleware\asyncexitstack.py", line 21, in __call__
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         raise e
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\system\python\lib\site-packages\fastapi\middleware\asyncexitstack.py", line 18, in __call__
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         await self.app(scope, receive, send)
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\system\python\lib\site-packages\starlette\routing.py", line 718, in __call__
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         await route.handle(scope, receive, send)
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\system\python\lib\site-packages\starlette\routing.py", line 276, in handle
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         await self.app(scope, receive, send)
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\system\python\lib\site-packages\starlette\routing.py", line 66, in app
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         response = await func(request)
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\system\python\lib\site-packages\fastapi\routing.py", line 237, in app
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         raw_response = await run_endpoint_function(
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\system\python\lib\site-packages\fastapi\routing.py", line 165, in run_endpoint_function
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         return await run_in_threadpool(dependant.call, **values)
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\system\python\lib\site-packages\starlette\concurrency.py", line 41, in run_in_threadpool
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         return await anyio.to_thread.run_sync(func, *args)
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\system\python\lib\site-packages\anyio\to_thread.py", line 33, in run_sync
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         return await get_asynclib().run_sync_in_worker_thread(
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\system\python\lib\site-packages\anyio\_backends\_asyncio.py", line 877, in run_sync_in_worker_thread
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         return await future
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\system\python\lib\site-packages\anyio\_backends\_asyncio.py", line 807, in run
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         result = context.run(func, *args)
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\webui\modules\api\api.py", line 380, in text2imgapi
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         processed = process_images(p)
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\webui\modules\processing.py", line 732, in process_images
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         res = process_images_inner(p)
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\webui\modules\processing.py", line 856, in process_images_inner
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         p.setup_conds()
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\webui\modules\processing.py", line 1309, in setup_conds
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         super().setup_conds()
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\webui\modules\processing.py", line 470, in setup_conds
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         self.c = self.get_conds_with_caching(prompt_parser.get_multicond_learned_conditioning, prompts, total_steps, [self.cached_c], self.extra_network_data)
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\webui\modules\processing.py", line 455, in get_conds_with_caching
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         cache[1] = function(shared.sd_model, required_prompts, steps, hires_steps, shared.opts.use_old_scheduling)
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\webui\modules\prompt_parser.py", line 262, in get_multicond_learned_conditioning
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         learned_conditioning = get_learned_conditioning(model, prompt_flat_list, steps, hires_steps, use_old_scheduling)
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\webui\modules\prompt_parser.py", line 189, in get_learned_conditioning
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         conds = model.get_learned_conditioning(texts)
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\webui\repositories\stable-diffusion-stability-ai\ldm\models\diffusion\ddpm.py", line 669, in get_learned_conditioning
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         c = self.cond_stage_model(c)
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\system\python\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         return forward_call(*args, **kwargs)
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\webui\modules\sd_hijack_clip.py", line 234, in forward
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         z = self.process_tokens(tokens, multipliers)
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >       File "D:\Code\gamerhash-facade\modules\plugins\ya-automatic-ai\automatic\webui\modules\sd_hijack_clip.py", line 265, in process_tokens
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >         tokens = torch.asarray(remade_batch_tokens).to(devices.device)
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >     RuntimeError: CUDA error: unknown error
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >     CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >     For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] >     Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] > 
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] > 
[2024-04-17T10:10:33.916+0200 DEBUG ya_runtime_ai::process::automatic::monitor] > ---
[2024-04-17T10:10:33.916+0200 DEBUG hyper::proto::h1::io] parsed 4 headers
[2024-04-17T10:10:33.916+0200 DEBUG hyper::proto::h1::conn] incoming body is content-length (331 bytes)
[2024-04-17T10:10:33.916+0200 DEBUG hyper::proto::h1::conn] incoming body completed
[2024-04-17T10:10:33.916+0200 DEBUG hyper::client::pool] pooling idle connection for ("http", localhost:7861)
[2024-04-17T10:10:33.916+0200 INFO  ya_gsb_http_proxy::gsb_to_http] sending GsbEvent nr 0

Activity does not get stopped. Re-enabling GPU does not resolve the issue - API response does not change after enabling it.

We could monitor for log message RuntimeError: CUDA error: unknown error and end activity when it happens.

The problem is Automatic breaks up silently whenever GPU turns off, but any error log we could track appears only after an API request.

Originally posted by @pwalski in #79