No CUDA GPUs are available.

jareddr commented 7 months ago

Hi there, sorry if this problem is purely on my end, but I'm stuck and don't know where to turn.

I wanted to push my own version of this to replicate because I wanted to add a different resolution to the list of possible resolutions.

I pulled the repo and just changed the cog.yaml to push to my replicate account instead of yours and then added my new resolution to the list of supported resolutions.

I can successfully run cog build but when i cog push it to replicate or run cog predict locally on my machine I get this No CUDA GPUs are available error.

^^^^^^^^^^^^^^^^^^
File "/src/repositories/Fooocus/ldm_patched/modules/model_management.py", line 87, in get_torch_device
return torch.device(torch.cuda.current_device())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/torch/cuda/__init__.py", line 769, in current_device
_lazy_init()
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/torch/cuda/__init__.py", line 298, in _lazy_init
torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available
Traceback (most recent call last):
  File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/cog/server/runner.py", line 317, in setup
    for event in worker.setup():
  File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/cog/server/worker.py", line 126, in _wait
    raise FatalWorkerException(raise_on_error + ": " + done.error_detail)
cog.server.exceptions.FatalWorkerException: Predictor errored during setup: No CUDA GPUs are available

I've tried a couple variations on my computer, once in WSL2 while running windows. I know I have the Nvidia docker support installed because I have another docker container running with GPU support. I also tried in regular linux on my local computer with all the nvidia drivers and cuda toolkit and nvidia docker stuff installed and I get the same error.

The most confusing thing is that I get this same error when I push to replicate. The above log is from my model trying to boot up after successfully being pushed to replicate. This part makes me thing there is something wrong with my configuration rather than drivers on my computer.

Apologies again if this is unrelated and thanks for your time!

jareddr commented 7 months ago

I'm investigating this more today, but I'm still stuck.

I've been looking at these lines in main.py

    class Args(object):
        host = '127.0.0.1'
        port = 8888
        base_url = None
        sync_repo = None
        disable_image_log = False
        skip_pip = False
        preload_pipeline = False
        queue_size = 100
        queue_history = 0
        preset = None
        webhook_url = None
        persistent = False
        always_gpu = False
        all_in_fp16 = False
        gpu_device_id = None,
        apikey = None

and

    if args.gpu_device_id is not None:
        os.environ['CUDA_VISIBLE_DEVICES'] = str(args.gpu_device_id)
        print("Set device to:", args.gpu_device_id)

There is a comma after None in gpu_device_id = None, which then allows this condition to evaluate to true if args.gpu_device_id is not None: which then goes on to set os.environ['CUDA_VISIBLE_DEVICES'] to the string (None,).

This causes the 'No CUDA GPUs error' I was seeing before.

If I remove the comma after None my model will build and boot correctly on replicate. However, something still seems off. The task runs for >100s which makes me think its running on CPU? I'm not sure how to verify that yet, but I'll keep digging.

jareddr commented 7 months ago

I'm testing this locally now with the following commands

cog build cog predict -i prompt='a cat'

If I don't make the None, change I mentioned above I just get the No CUDA GPUs available error. With the change it seems like the task queue isn't actually doing any work ever. My GPU and CPU are both at 0 load and the logs just say waiting on task queue forever-ish.

[Task Queue] Already waiting for 5091.4 seconds, job_id=f5f0be66-c65c-4da0-a2cc-e8dd4b450905
[Task Queue] Already waiting for 5101.4 seconds, job_id=f5f0be66-c65c-4da0-a2cc-e8dd4b450905
[Task Queue] Already waiting for 5111.4 seconds, job_id=f5f0be66-c65c-4da0-a2cc-e8dd4b450905
[Task Queue] Already waiting for 5121.4 seconds, job_id=f5f0be66-c65c-4da0-a2cc-e8dd4b450905
[Task Queue] Already waiting for 5131.5 seconds, job_id=f5f0be66-c65c-4da0-a2cc-e8dd4b450905```

It seems like torch is correctly locating my GPU as I also see in the logs

Preload pipeline
Total VRAM 24564 MB, total RAM 31666 MB
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4090 : native
VAE dtype: torch.bfloat16
Using pytorch cross attention
Refiner unloaded.
model_type EPS
UNet ADM Dimension 2816
Using pytorch attention in VAE
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
Using pytorch attention in VAE

jyoung105 commented 7 months ago

I'm investigating this more today, but I'm still stuck.

I've been looking at these lines in main.py
    class Args(object):
        host = '127.0.0.1'
        port = 8888
        base_url = None
        sync_repo = None
        disable_image_log = False
        skip_pip = False
        preload_pipeline = False
        queue_size = 100
        queue_history = 0
        preset = None
        webhook_url = None
        persistent = False
        always_gpu = False
        all_in_fp16 = False
        gpu_device_id = None,
        apikey = None
and
    if args.gpu_device_id is not None:
        os.environ['CUDA_VISIBLE_DEVICES'] = str(args.gpu_device_id)
        print("Set device to:", args.gpu_device_id)
There is a comma after None in gpu_device_id = None, which then allows this condition to evaluate to true if args.gpu_device_id is not None: which then goes on to set os.environ['CUDA_VISIBLE_DEVICES'] to the string (None,).

This causes the 'No CUDA GPUs error' I was seeing before.

If I remove the comma after None my model will build and boot correctly on replicate. However, something still seems off. The task runs for >100s which makes me think its running on CPU? I'm not sure how to verify that yet, but I'll keep digging.

Wow I was struggling like what you did for 3 days. And solution was almost same, too. Let's talk in here to solve the issue.

jyoung105 commented 7 months ago

@jareddr I think the issue is about sys.argv. check this code, too.

jareddr commented 7 months ago

just check out the cog realistic branch and everything works perfectly.

There are bugs and missing code in the main branch.

On Tue, Feb 13, 2024 at 9:13 PM Jeongmin Lee @.***> wrote:

@jareddr https://github.com/jareddr I think the issue is about sys.argv. check this code, too.

— Reply to this email directly, view it on GitHub https://github.com/konieshadow/Fooocus-API/issues/212#issuecomment-1943094228, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFHZ42BWEQFWV6I2AWMNGLYTRBY3AVCNFSM6AAAAABDC5WW2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBTGA4TIMRSHA . You are receiving this because you were mentioned.Message ID: @.***>

jyoung105 commented 7 months ago

I think there was some issue for sys.argv and maybe router too? Due to connection with gradio and launch, I think in there might be issue.

jareddr commented 7 months ago

I don't know anything about gradio. I was purely using the cog system with cog build and cog predict.

konieshadow commented 7 months ago

@jareddr Are you having multiple gpus on local machine? The task stucking often because unexpected exception. You can check the log before the stuck appeared.

jareddr commented 7 months ago

@konieshadow no the problem was that I was using cog predict on the main branch. The predict.py code doesn't actually start a queue worker or call the process function, it puts a task in the queue and then waits for it to be done, but never processes it.

Everything is working perfectly on the cog-realistic-preset branch.

It might be beneficial for others finding your lovely repo to merge some of those cog based changed into main so they don't experience this confusion.

jareddr commented 7 months ago

For example here is the relevant snippet from predict.py on main

        async_task = worker_queue.add_task(TaskType.text_2_img, {'params': params.__dict__, 'require_base64': False})
        if async_task is None:
            print("[Task Queue] The task queue has reached limit")
            raise Exception(
                f"The task queue has reached limit."
            )
        results = blocking_get_task_result(async_task.job_id)

and here it is on the cog-relistic-preset branch

queue_task = task_queue.add_task(TaskType.text_2_img, {'params': params.__dict__, 'require_base64': False})
        if queue_task is None:
            print("[Task Queue] The task queue has reached limit")
            raise Exception(
                f"The task queue has reached limit."
            )
        results = process_generate(queue_task, params)

        output_paths: List[Path] = []

The other issue on main is line 352 of main.py https://github.com/konieshadow/Fooocus-API/blob/main/main.py#L352

 gpu_device_id = None,

The comma after None is creating a tuple (None, ) and then later the code checks to see if gpu_device_id is not equal to None and it's not because it's a tuple. The cog-predict-preset branch does not have this issue either.

mrhan1993 / Fooocus-API

No CUDA GPUs are available. #212