Issue with multiple CUDA GPUs (lightning\fabric\utilities\distributed.py:244)

Blueblade11 commented 1 year ago

I'm using x2 (diff) GPUs (3070ti, 4070ti) installed into my desktop. ryzen 5950X 32gb ram Win 11 conda lightning==2.0.2 so-vits-svc-fork==3.11.0

It seems there's still an issue with multiple CUDA enabled GPUs when running svc train -t

I'm running into this error:

[02:33:54] INFO     [02:33:54] Loaded checkpoint 'logs\44k\G_7500.pth' (iteration 7500)                 utils.py:259
           INFO     [02:33:54] Loaded checkpoint 'logs\44k\D_7500.pth' (iteration 7500)                 utils.py:259
INFO: MASTER_ADDR, MASTER_PORT: 127.0.0.1, 53717
           INFO     [02:33:54] MASTER_ADDR, MASTER_PORT: 127.0.0.1, 53717              distributed.py:244
INFO: Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
           INFO     [02:33:54] Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2              distributed.py:245
INFO: torch_distributed_backend: gloo
           INFO     [02:33:54] torch_distributed_backend: gloo                                                        distributed.py:246
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:53717 (system error: 10049 - The requested address is not valid in its context.).
s:\VS Code\so-vits-svc-fork\.conda\python.exe: Error while finding module specification for '__main__' (ValueError: __main__.__spec__ is None)

I added a few lines for debugging at lightning\fabric\utilities\distributed.py:244:

    log.info(f"MASTER_ADDR, MASTER_PORT: {os.environ['MASTER_ADDR']}, {os.environ['MASTER_PORT']}")
    log.info(f"Initializing distributed: GLOBAL_RANK: {global_rank}, MEMBER: {global_rank + 1}/{world_size}")
    log.info(f"torch_distributed_backend: {torch_distributed_backend}")
    torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)

more info: nvidia-smi.txt requirements.txt log.txt

34j commented 1 year ago

Do you have any solutions?

Blueblade11 commented 1 year ago

Do you have any solutions?

I've been studying the fabric docs, to be honest I'm very novice but I'm defs eager to take a fair shot at figuring this out.

I'll update if I have anything notable.

So far all I've really done is swap around hardware/configure network settings, I noticed the address shows up in the host file on windows, but I have no idea how it relates 😅

voicepaw / so-vits-svc-fork

Issue with multiple CUDA GPUs (lightning\fabric\utilities\distributed.py:244) #481