ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.95k stars 5.77k forks source link

RayTrain: error in local ranks calculated for every worker #45799

Open prannaykhtech opened 5 months ago

prannaykhtech commented 5 months ago

What happened + What you expected to happen

Logs show incorrect assignment of ranks:

(TorchTrainer pid=16518) Started distributed worker processes:
(TorchTrainer pid=16518) - (ip=10.114.131.76, pid=16729) world_rank=0, local_rank=0, node_rank=0
(TorchTrainer pid=16518) - (ip=10.114.131.76, pid=16730) world_rank=1, local_rank=1, node_rank=0
(TorchTrainer pid=16518) - (ip=10.114.131.76, pid=16731) world_rank=2, local_rank=2, node_rank=0
(TorchTrainer pid=16518) - (ip=10.114.131.76, pid=16732) world_rank=3, local_rank=3, node_rank=0
(TorchTrainer pid=16518) - (ip=10.114.131.76, pid=16733) world_rank=4, local_rank=4, node_rank=0
(TorchTrainer pid=16518) - (ip=10.114.131.76, pid=16734) world_rank=5, local_rank=5, node_rank=0
(TorchTrainer pid=16518) - (ip=10.114.131.76, pid=16735) world_rank=6, local_rank=6, node_rank=0
(TorchTrainer pid=16518) - (ip=10.114.131.76, pid=16736) world_rank=7, local_rank=7, node_rank=0
(TorchTrainer pid=16518) - (ip=10.114.141.217, pid=429) world_rank=8, local_rank=0, node_rank=1
(TorchTrainer pid=16518) - (ip=10.114.129.98, pid=387) world_rank=9, local_rank=0, node_rank=2
(TorchTrainer pid=16518) - (ip=10.114.131.89, pid=429) world_rank=10, local_rank=0, node_rank=3
(TorchTrainer pid=16518) - (ip=10.114.141.218, pid=415) world_rank=11, local_rank=0, node_rank=4
(TorchTrainer pid=16518) - (ip=10.114.129.91, pid=428) world_rank=12, local_rank=0, node_rank=5
(TorchTrainer pid=16518) - (ip=10.114.130.86, pid=429) world_rank=13, local_rank=0, node_rank=6
(TorchTrainer pid=16518) - (ip=10.114.129.96, pid=429) world_rank=14, local_rank=0, node_rank=7
(TorchTrainer pid=16518) - (ip=10.114.131.93, pid=416) world_rank=15, local_rank=0, node_rank=8
(TorchTrainer pid=16518) - (ip=10.114.129.95, pid=429) world_rank=16, local_rank=0, node_rank=9
(TorchTrainer pid=16518) - (ip=10.114.130.85, pid=429) world_rank=17, local_rank=0, node_rank=10
(TorchTrainer pid=16518) - (ip=10.114.141.212, pid=429) world_rank=18, local_rank=0, node_rank=11
(TorchTrainer pid=16518) - (ip=10.114.131.91, pid=428) world_rank=19, local_rank=0, node_rank=12
(TorchTrainer pid=16518) - (ip=10.114.141.213, pid=429) world_rank=20, local_rank=0, node_rank=13
(TorchTrainer pid=16518) - (ip=10.114.141.216, pid=415) world_rank=21, local_rank=0, node_rank=14
(TorchTrainer pid=16518) - (ip=10.114.131.90, pid=429) world_rank=22, local_rank=0, node_rank=15
(TorchTrainer pid=16518) - (ip=10.114.131.96, pid=415) world_rank=23, local_rank=0, node_rank=16
(TorchTrainer pid=16518) - (ip=10.114.129.97, pid=429) world_rank=24, local_rank=0, node_rank=17
(TorchTrainer pid=16518) - (ip=10.114.129.90, pid=429) world_rank=25, local_rank=0, node_rank=18
(TorchTrainer pid=16518) - (ip=10.114.131.97, pid=372) world_rank=26, local_rank=0, node_rank=19
(TorchTrainer pid=16518) - (ip=10.114.129.93, pid=429) world_rank=27, local_rank=0, node_rank=20
(TorchTrainer pid=16518) - (ip=10.114.131.95, pid=429) world_rank=28, local_rank=0, node_rank=21
(TorchTrainer pid=16518) - (ip=10.114.129.94, pid=429) world_rank=29, local_rank=0, node_rank=22
(TorchTrainer pid=16518) - (ip=10.114.141.214, pid=429) world_rank=30, local_rank=0, node_rank=23
(TorchTrainer pid=16518) - (ip=10.114.141.215, pid=415) world_rank=31, local_rank=0, node_rank=24

I would have expected node_rank to be between [0-3] and local_rank to be between [0-7] since I have 8 ray workers per machine.

Versions / Dependencies

None

Reproduction script

    trainer = TorchTrainer(
        worker_fn,
        train_loop_config=training_config,
        scaling_config=ScalingConfig(num_workers=32, use_gpu=True, resources_per_worker={"cpu": 18}),
    )

Tough to provide more than this without the env.

Issue Severity

High

justinvyu commented 4 months ago

@prannaykhtech All of the workers have different IPs starting from 10.114.141.217, are the node IPs correct here? If so, then the rank assignment seems to make sense.

Could you provide some more information about the cluster configuration? What node types are available?