ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.04k stars 5.46k forks source link

RayTrain: error in local ranks calculated for every worker #45799

Open prannaykhtech opened 1 month ago

prannaykhtech commented 1 month ago

What happened + What you expected to happen

Logs show incorrect assignment of ranks:

(TorchTrainer pid=16518) Started distributed worker processes:
(TorchTrainer pid=16518) - (ip=10.114.131.76, pid=16729) world_rank=0, local_rank=0, node_rank=0
(TorchTrainer pid=16518) - (ip=10.114.131.76, pid=16730) world_rank=1, local_rank=1, node_rank=0
(TorchTrainer pid=16518) - (ip=10.114.131.76, pid=16731) world_rank=2, local_rank=2, node_rank=0
(TorchTrainer pid=16518) - (ip=10.114.131.76, pid=16732) world_rank=3, local_rank=3, node_rank=0
(TorchTrainer pid=16518) - (ip=10.114.131.76, pid=16733) world_rank=4, local_rank=4, node_rank=0
(TorchTrainer pid=16518) - (ip=10.114.131.76, pid=16734) world_rank=5, local_rank=5, node_rank=0
(TorchTrainer pid=16518) - (ip=10.114.131.76, pid=16735) world_rank=6, local_rank=6, node_rank=0
(TorchTrainer pid=16518) - (ip=10.114.131.76, pid=16736) world_rank=7, local_rank=7, node_rank=0
(TorchTrainer pid=16518) - (ip=10.114.141.217, pid=429) world_rank=8, local_rank=0, node_rank=1
(TorchTrainer pid=16518) - (ip=10.114.129.98, pid=387) world_rank=9, local_rank=0, node_rank=2
(TorchTrainer pid=16518) - (ip=10.114.131.89, pid=429) world_rank=10, local_rank=0, node_rank=3
(TorchTrainer pid=16518) - (ip=10.114.141.218, pid=415) world_rank=11, local_rank=0, node_rank=4
(TorchTrainer pid=16518) - (ip=10.114.129.91, pid=428) world_rank=12, local_rank=0, node_rank=5
(TorchTrainer pid=16518) - (ip=10.114.130.86, pid=429) world_rank=13, local_rank=0, node_rank=6
(TorchTrainer pid=16518) - (ip=10.114.129.96, pid=429) world_rank=14, local_rank=0, node_rank=7
(TorchTrainer pid=16518) - (ip=10.114.131.93, pid=416) world_rank=15, local_rank=0, node_rank=8
(TorchTrainer pid=16518) - (ip=10.114.129.95, pid=429) world_rank=16, local_rank=0, node_rank=9
(TorchTrainer pid=16518) - (ip=10.114.130.85, pid=429) world_rank=17, local_rank=0, node_rank=10
(TorchTrainer pid=16518) - (ip=10.114.141.212, pid=429) world_rank=18, local_rank=0, node_rank=11
(TorchTrainer pid=16518) - (ip=10.114.131.91, pid=428) world_rank=19, local_rank=0, node_rank=12
(TorchTrainer pid=16518) - (ip=10.114.141.213, pid=429) world_rank=20, local_rank=0, node_rank=13
(TorchTrainer pid=16518) - (ip=10.114.141.216, pid=415) world_rank=21, local_rank=0, node_rank=14
(TorchTrainer pid=16518) - (ip=10.114.131.90, pid=429) world_rank=22, local_rank=0, node_rank=15
(TorchTrainer pid=16518) - (ip=10.114.131.96, pid=415) world_rank=23, local_rank=0, node_rank=16
(TorchTrainer pid=16518) - (ip=10.114.129.97, pid=429) world_rank=24, local_rank=0, node_rank=17
(TorchTrainer pid=16518) - (ip=10.114.129.90, pid=429) world_rank=25, local_rank=0, node_rank=18
(TorchTrainer pid=16518) - (ip=10.114.131.97, pid=372) world_rank=26, local_rank=0, node_rank=19
(TorchTrainer pid=16518) - (ip=10.114.129.93, pid=429) world_rank=27, local_rank=0, node_rank=20
(TorchTrainer pid=16518) - (ip=10.114.131.95, pid=429) world_rank=28, local_rank=0, node_rank=21
(TorchTrainer pid=16518) - (ip=10.114.129.94, pid=429) world_rank=29, local_rank=0, node_rank=22
(TorchTrainer pid=16518) - (ip=10.114.141.214, pid=429) world_rank=30, local_rank=0, node_rank=23
(TorchTrainer pid=16518) - (ip=10.114.141.215, pid=415) world_rank=31, local_rank=0, node_rank=24

I would have expected node_rank to be between [0-3] and local_rank to be between [0-7] since I have 8 ray workers per machine.

Versions / Dependencies

None

Reproduction script

    trainer = TorchTrainer(
        worker_fn,
        train_loop_config=training_config,
        scaling_config=ScalingConfig(num_workers=32, use_gpu=True, resources_per_worker={"cpu": 18}),
    )

Tough to provide more than this without the env.

Issue Severity

High

justinvyu commented 1 week ago

@prannaykhtech All of the workers have different IPs starting from 10.114.141.217, are the node IPs correct here? If so, then the rank assignment seems to make sense.

Could you provide some more information about the cluster configuration? What node types are available?