microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.9k stars 4.06k forks source link

LSB_AFFINITY_HOSTFILE could not be found #6581

Open fabiosanger opened 1 day ago

fabiosanger commented 1 day ago

I am using the following mpi:

# Run the DeepSpeed job
mpirun \
    -H \$MPI_HOST_STRING \
    --bind-to socket \
    --map-by slot \
    --display-allocation \
    --display-map \
    deepspeed \
    -H $HOSTFILE_PATH \
    --launcher openmpi \
    --no_ssh \
    --master_addr \${MASTER_ADDR} \
    --master_port=\${MASTER_PORT} \
    src/dna_mlm/runner.py

MPI_HOST_STRING and hostfile

farm22-gpu0103:2,farm22-gpu0104:2 farm22-gpu0103 slots=2 farm22-gpu0104 slots=2

the mapping looks correct:

======================== JOB MAP ======================== Data for JOB prterun-farm22-gpu0103-226592@1 offset 0 Total slots allocated 128 Mapping policy: BYSLOT:NOOVERSUBSCRIBE Ranking policy: SLOT Binding policy: PACKAGE Cpu set: N/A PPR: N/A Cpus-per-rank: N/A Cpu Type: CORE

Data for node: farm22-gpu0103 Num slots: 2 Max slots: 0 Num procs: 2 Process jobid: prterun-farm22-gpu0103-226592@1 App: 0 Process rank: 0 Bound: package[0][core:0-15] Process jobid: prterun-farm22-gpu0103-226592@1 App: 0 Process rank: 1 Bound: package[0][core:0-15]

Data for node: farm22-gpu0104 Num slots: 2 Max slots: 0 Num procs: 2 Process jobid: prterun-farm22-gpu0103-226592@1 App: 0 Process rank: 2 Bound: package[0][core:0-15] Process jobid: prterun-farm22-gpu0103-226592@1 App: 0 Process rank: 3 Bound: package[0][core:0-15]

============================================================= but the process is stopped with the following error


The rankfile that was used claimed that a host was either not allocated or oversubscribed its slots. Please review your rank-slot assignments and your host allocation to ensure a proper match. Also, some systems may require using full hostnames, such as "host1.example.com" (instead of just plain "host1").

Host: farm22-gpu0103

-------------------------------------------------------------------------- The rankfile that was used claimed that a host was either not allocated or oversubscribed its slots. Please review your rank-slot assignments and your host allocation to ensure a proper match. Also, some systems may require using full hostnames, such as "host1.example.com" (instead of just plain "host1").

Host: farm22-gpu0103

-------------------------------------------------------------------------- The affinity file provided in LSB_AFFINITY_HOSTFILE could not be found:

File: /tmp/1727431443.3130.hostAffinityFile

We cannot continue.


The affinity file provided in LSB_AFFINITY_HOSTFILE could not be found:

File: /tmp/1727431443.3130.hostAffinityFile

We cannot continue.


prterun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [prterun-farm22-gpu0103-226592@1,0] Exit code: 213

try to unset and export LSB_AFFINITY_HOSTFILE but it has not made a difference.

fabiosanger commented 1 day ago

I managed to get it to work, using unset LSB_AFFINITY_HOSTFILE

but I now get the following error

[W socket.cpp:464] [c10d] The server socket has failed to bind to [::]:29700 (errno: 98 - Address already in use). [W socket.cpp:464] [c10d] The server socket has failed to bind to 0.0.0.0:29700 (errno: 98 - Address already in use). [E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address.

do i need to use the actual machine IP or the alias would be fine?

fabiosanger commented 10 hours ago

I am not sure I have done the right thing I added the following function

def initialize_distributed(backend='nccl'):
    if not dist.is_initialized():
        # Ensure LOCAL_RANK matches MPI's local rank
        if "LOCAL_RANK" in os.environ:
            # Environment variables set by torch.distributed.launch or torchrun
            LOCAL_RANK = int(os.environ["LOCAL_RANK"])
            WORLD_SIZE = int(os.environ["WORLD_SIZE"])
            WORLD_RANK = int(os.environ["RANK"])
        elif "OMPI_COMM_WORLD_LOCAL_RANK" in os.environ:
            # Environment variables set by mpirun
            LOCAL_RANK = int(os.environ["OMPI_COMM_WORLD_LOCAL_RANK"])
            WORLD_SIZE = int(os.environ["OMPI_COMM_WORLD_SIZE"])
            WORLD_RANK = int(os.environ["OMPI_COMM_WORLD_RANK"])

        backend = 'nccl'  # Use 'nccl' for GPU training
        master_addr = os.environ.get('MASTER_ADDR', 'localhost')
        master_port = os.environ.get('MASTER_PORT', '29500')

        # Log the initialization message
    logger.debug(f"Initializing process group with MASTER_ADDR={master_addr}, "
                 f"MASTER_PORT={master_port}, RANK={WORLD_RANK}, WORLD_SIZE={WORLD_SIZE}")

    dist.init_process_group(
            backend=backend,
            init_method=f"tcp://{master_addr}:{master_port}",
            rank=WORLD_RANK,
            world_size=WORLD_SIZE
        )

and now the stack trace

File "/software/isg/users/fg12/envs/.dna-mlm/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 174, in _create_c10d_store return TCPStore( torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29700 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:29700 (errno: 98 - Address already in use). 2024-09-27 21:11:34.765 | DEBUG | main:initialize_distributed:93 - Initializing process group with MASTER_ADDR=farm22-gpu0104, MASTER_PORT=29700, RANK=3, WORLD_SIZE=4 2024-09-27 21:11:34.769 | DEBUG | main:initialize_distributed:93 - Initializing process group with MASTER_ADDR=farm22-gpu0104, MASTER_PORT=29700, RANK=2, WORLD_SIZE=4 2024-09-27 21:11:34.794 | DEBUG | main:initialize_distributed:93 - Initializing process group with MASTER_ADDR=farm22-gpu0104, MASTER_PORT=29700, RANK=2, WORLD_SIZE=4 2024-09-27 21:11:34.797 | DEBUG | main:initialize_distributed:93 - Initializing process group with MASTER_ADDR=farm22-gpu0104, MASTER_PORT=29700, RANK=2, WORLD_SIZE=4 2024-09-27 21:11:34.837 | DEBUG | main:initialize_distributed:93 - Initializing process group with MASTER_ADDR=farm22-gpu0104, MASTER_PORT=29700, RANK=3, WORLD_SIZE=4 2024-09-27 21:11:34.906 | DEBUG | main:initialize_distributed:93 - Initializing process group with MASTER_ADDR=farm22-gpu0104, MASTER_PORT=29700, RANK=2, WORLD_SIZE=4