meta-llama / llama3

The official Meta Llama 3 GitHub site
Other
21.42k stars 2.14k forks source link

The client socket has failed to connect to [Maxim]:12355 (system error: 10049 - The requested address is not valid in its context.). #205

Closed nightsSeeker closed 3 days ago

nightsSeeker commented 3 weeks ago

I am trying to use the example repo to see an initial output from the 70B-Instructed meta model. however, i am stuck in what seems to be a PyTorch issue. i isolated it down to that piece of code.

if not torch.distributed.is_initialized():
            torch.distributed.init_process_group(backend='gloo', init_method='tcp://localhost:12355', rank = torch.cuda.device_count(), world_size = 8) **<------this line of code hangs and causes the following error.**
        if not model_parallel_is_initialized():
            if model_parallel_size is None:
                model_parallel_size = int(os.environ.get("WORLD_SIZE", 8))
            initialize_model_parallel(model_parallel_size)

err: [W socket.cpp:697] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:18355 (system error: 10049 - The requested address is not valid in its context.).

I have cuda 12.1 and PyTorch latest installed. I am on windows, so hence the backend change to gloo. I have tried it on my other machines with the same issue. i disconnected the internet, and still pesists. Eventually, i tried it on a friends machine that lives in the nearby and he also faced the same issue.

subramen commented 2 weeks ago

The error is probably related to the init_method arg you have passed... why are you passing that in?

Ensure your machine has 8 GPUs as that is a requirement for 70B. If not, then you can use HF to load the 70B model.

Running on windows is possible with gloo, pls take a look at https://github.com/meta-llama/llama3/issues/127#issuecomment-2075800144 for how they did it.