nod-ai / SHARK-Studio

SHARK Studio -- Web UI for SHARK+IREE High Performance Machine Learning Distribution
Apache License 2.0
1.42k stars 172 forks source link

6600XT, resource exhausted error across multiple builds and multiple methods (img2img, inpainting) #1306

Closed SourStrips closed 1 year ago

SourStrips commented 1 year ago

Recently started getting this error, regular txt2img works fine. I did update drivers recently as well due to a windows update mishap.

WARNING: [Loader Message] Code 0 : windows_read_data_files_in_registry: Registry lookup failed to get layer manifest files. 0%| | 0/1 [00:05<?, ?it/s] Traceback (most recent call last): File "gradio\routes.py", line 401, in run_predict File "gradio\blocks.py", line 1302, in process_api File "gradio\blocks.py", line 1025, in call_function File "anyio\to_thread.py", line 31, in run_sync File "anyio_backends_asyncio.py", line 937, in run_sync_in_worker_thread File "anyio_backends_asyncio.py", line 867, in run File "ui\img2img_ui.py", line 231, in img2img_inf File "apps\stable_diffusion\src\pipelines\pipeline_shark_stable_diffusion_stencil.py", line 265, in generate_images File "apps\stable_diffusion\src\pipelines\pipeline_shark_stable_diffusion_utils.py", line 172, in decode_latents File "shark\shark_inference.py", line 138, in call File "shark\shark_runner.py", line 93, in run File "shark\iree_utils\compile_utils.py", line 385, in get_results File "iree\runtime\function.py", line 130, in call File "iree\runtime\function.py", line 154, in _invoke RuntimeError: Error invoking function: D:\a\SHARK-Runtime\SHARK-Runtime\c\runtime\src\iree\hal\drivers\vulkan\native_semaphore.cc:155: RESOURCE_EXHAUSTED; overflowed timeline semaphore max value; while invoking native function hal.fence.await; while calling import; [ 1] native hal.fence.await:0 - [ 0] bytecode module@0:32986 -

nirvedhmeshram commented 1 year ago

I am facing this on RTX4000 which is an 8GB GPU just like the 6600XT, and interestingly at the time of failure I see it using almost all the memory 8015MiB / 8192MiB so this might be an OOM , any ideas why its happening now, it used to work in older builds @powderluv do you or anyone else know? I also tried the low VRAM option but didnt do anything

nirvedhmeshram commented 1 year ago

Yes I can confirm this issue goes away with smaller sizes like 384x384, the default being 512x512 I was using that, so its an Out of memory (OOM), we could work on handling this error better in the future. @SourStrips Closing the issue but feel free to reopen if needed.

SourStrips commented 1 year ago

@nirvedhmeshram is there a fix in the works? Before I was rendering hundreds of images a day at 768 x 512 or 512 x 768. It suddenly stopped working. Are you saying that the error is on my end from my card?

powderluv commented 1 year ago

Lets re-open and track the IR changes. If it worked in the past we should be able to get to that state at least.

nirvedhmeshram commented 1 year ago

@SourStrips can you check now, after https://github.com/nod-ai/SHARK/pull/1339 landing it is working on my GPU

SourStrips commented 1 year ago

Yes it works now thank you! I did have to remove some arguments to make it work reliably though. I’m going to slowly add them back in yo see which one is causing the error again.

SourStrips commented 1 year ago

FYI it seems that —device_allocator=caching is the main culprit causing this error now that it’s fixed. Without this argument it works perfectly

nirvedhmeshram commented 1 year ago

Thanks for pointing this out. Curious - based on looking at the code the default behavior is to not specify any device allocator, so only advance users trying command line arguments are likely to use this, is that correct?

SourStrips commented 1 year ago

Probably regular users as well, I have seen it mentioned in the discord as it gives a generous boost to IT/s. It was working before without doing anything extra except adding the argument. I wonder what happened.

nirvedhmeshram commented 1 year ago

I see, would you mind filing a separate issue with the error message for that so we can track it, unless its giving the same exact error as this one with the flag, then we can reopen this issue.

SourStrips commented 1 year ago

I will create a new ticket and description so it’s more clear what is going on

SourStrips commented 1 year ago

Well shark just wants to make me look like a dummy. I tried to get the error to copy and paste for you but now it decided to start working ¯_(ツ)_/¯

nirvedhmeshram commented 1 year ago

I can believe the caching allocator having intermittent issues, we will have to think how to make it reliable, thanks for pointing it out anyway.