Closed jareddr closed 7 months ago
I'm investigating this more today, but I'm still stuck.
I've been looking at these lines in main.py
class Args(object):
host = '127.0.0.1'
port = 8888
base_url = None
sync_repo = None
disable_image_log = False
skip_pip = False
preload_pipeline = False
queue_size = 100
queue_history = 0
preset = None
webhook_url = None
persistent = False
always_gpu = False
all_in_fp16 = False
gpu_device_id = None,
apikey = None
and
if args.gpu_device_id is not None:
os.environ['CUDA_VISIBLE_DEVICES'] = str(args.gpu_device_id)
print("Set device to:", args.gpu_device_id)
There is a comma after None in gpu_device_id = None,
which then allows this condition to evaluate to true if args.gpu_device_id is not None:
which then goes on to set os.environ['CUDA_VISIBLE_DEVICES']
to the string (None,)
.
This causes the 'No CUDA GPUs error' I was seeing before.
If I remove the comma after None my model will build and boot correctly on replicate. However, something still seems off. The task runs for >100s which makes me think its running on CPU? I'm not sure how to verify that yet, but I'll keep digging.
I'm testing this locally now with the following commands
cog build
cog predict -i prompt='a cat'
If I don't make the None,
change I mentioned above I just get the No CUDA GPUs available
error. With the change it seems like the task queue isn't actually doing any work ever. My GPU and CPU are both at 0 load and the logs just say waiting on task queue forever-ish.
[Task Queue] Already waiting for 5091.4 seconds, job_id=f5f0be66-c65c-4da0-a2cc-e8dd4b450905
[Task Queue] Already waiting for 5101.4 seconds, job_id=f5f0be66-c65c-4da0-a2cc-e8dd4b450905
[Task Queue] Already waiting for 5111.4 seconds, job_id=f5f0be66-c65c-4da0-a2cc-e8dd4b450905
[Task Queue] Already waiting for 5121.4 seconds, job_id=f5f0be66-c65c-4da0-a2cc-e8dd4b450905
[Task Queue] Already waiting for 5131.5 seconds, job_id=f5f0be66-c65c-4da0-a2cc-e8dd4b450905```
It seems like torch is correctly locating my GPU as I also see in the logs
Preload pipeline
Total VRAM 24564 MB, total RAM 31666 MB
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4090 : native
VAE dtype: torch.bfloat16
Using pytorch cross attention
Refiner unloaded.
model_type EPS
UNet ADM Dimension 2816
Using pytorch attention in VAE
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
Using pytorch attention in VAE
I'm investigating this more today, but I'm still stuck.
I've been looking at these lines in main.py
class Args(object): host = '127.0.0.1' port = 8888 base_url = None sync_repo = None disable_image_log = False skip_pip = False preload_pipeline = False queue_size = 100 queue_history = 0 preset = None webhook_url = None persistent = False always_gpu = False all_in_fp16 = False gpu_device_id = None, apikey = None
and
if args.gpu_device_id is not None: os.environ['CUDA_VISIBLE_DEVICES'] = str(args.gpu_device_id) print("Set device to:", args.gpu_device_id)
There is a comma after None in
gpu_device_id = None,
which then allows this condition to evaluate to trueif args.gpu_device_id is not None:
which then goes on to setos.environ['CUDA_VISIBLE_DEVICES']
to the string(None,)
.This causes the 'No CUDA GPUs error' I was seeing before.
If I remove the comma after None my model will build and boot correctly on replicate. However, something still seems off. The task runs for >100s which makes me think its running on CPU? I'm not sure how to verify that yet, but I'll keep digging.
Wow I was struggling like what you did for 3 days. And solution was almost same, too. Let's talk in here to solve the issue.
@jareddr I think the issue is about sys.argv. check this code, too.
just check out the cog realistic branch and everything works perfectly.
There are bugs and missing code in the main branch.
On Tue, Feb 13, 2024 at 9:13 PM Jeongmin Lee @.***> wrote:
@jareddr https://github.com/jareddr I think the issue is about sys.argv. check this code, too.
— Reply to this email directly, view it on GitHub https://github.com/konieshadow/Fooocus-API/issues/212#issuecomment-1943094228, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFHZ42BWEQFWV6I2AWMNGLYTRBY3AVCNFSM6AAAAABDC5WW2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBTGA4TIMRSHA . You are receiving this because you were mentioned.Message ID: @.***>
I think there was some issue for sys.argv and maybe router too? Due to connection with gradio and launch, I think in there might be issue.
I don't know anything about gradio. I was purely using the cog system with cog build
and cog predict
.
@jareddr Are you having multiple gpus on local machine? The task stucking often because unexpected exception. You can check the log before the stuck appeared.
@konieshadow no the problem was that I was using cog predict
on the main
branch. The predict.py
code doesn't actually start a queue worker or call the process function, it puts a task in the queue and then waits for it to be done, but never processes it.
Everything is working perfectly on the cog-realistic-preset
branch.
It might be beneficial for others finding your lovely repo to merge some of those cog based changed into main so they don't experience this confusion.
For example here is the relevant snippet from predict.py on main
async_task = worker_queue.add_task(TaskType.text_2_img, {'params': params.__dict__, 'require_base64': False})
if async_task is None:
print("[Task Queue] The task queue has reached limit")
raise Exception(
f"The task queue has reached limit."
)
results = blocking_get_task_result(async_task.job_id)
and here it is on the cog-relistic-preset
branch
queue_task = task_queue.add_task(TaskType.text_2_img, {'params': params.__dict__, 'require_base64': False})
if queue_task is None:
print("[Task Queue] The task queue has reached limit")
raise Exception(
f"The task queue has reached limit."
)
results = process_generate(queue_task, params)
output_paths: List[Path] = []
The other issue on main is line 352 of main.py https://github.com/konieshadow/Fooocus-API/blob/main/main.py#L352
gpu_device_id = None,
The comma after None is creating a tuple (None, ) and then later the code checks to see if gpu_device_id is not equal to None and it's not because it's a tuple. The cog-predict-preset
branch does not have this issue either.
Hi there, sorry if this problem is purely on my end, but I'm stuck and don't know where to turn.
I wanted to push my own version of this to replicate because I wanted to add a different resolution to the list of possible resolutions.
I pulled the repo and just changed the cog.yaml to push to my replicate account instead of yours and then added my new resolution to the list of supported resolutions.
I can successfully run
cog build
but when icog push
it to replicate or runcog predict
locally on my machine I get thisNo CUDA GPUs are available
error.I've tried a couple variations on my computer, once in WSL2 while running windows. I know I have the Nvidia docker support installed because I have another docker container running with GPU support. I also tried in regular linux on my local computer with all the nvidia drivers and cuda toolkit and nvidia docker stuff installed and I get the same error.
The most confusing thing is that I get this same error when I push to replicate. The above log is from my model trying to boot up after successfully being pushed to replicate. This part makes me thing there is something wrong with my configuration rather than drivers on my computer.
Apologies again if this is unrelated and thanks for your time!