[RFC] Support running in different modes.

jovany-wang commented 1 year ago

I'm proposing that support running rayfed job in single-controller mode.

I'd like to propose 2 options on how we startup the single-controller cluster and how we connect to the cluster and run our jobs.

option 1

Add a new cli toolkit to start the cluster, it just wrapper the ray cli toolkit, for example:

A. running single-controller mode

> rayfed start --head --mode=single-controller --party=ALICE  # node1, listening on 1.2.3.4:5555
> rayfed start --address="1.2.3.4:5555" --party=ALICE  # node2, connecting to the node1
> rayfed start --address="1.2.3.4:5555" --party=BOB  # node3, connecting to the node1
> rayfed start --address="1.2.3.4:5555" --party=BOB  # node4, connecting to the node1

And then, the job could be run in single controller mode automatically:

# main.py
fed.init(address="1.2.3.4:5555", xxx)
# Nothing need to be changed in this job script.

B. running multiple-controller mode

> rayfed start --head --mode=multiple-controller --party=ALICE  # node1, listening on 1.2.3.4:5555
> rayfed start --address="1.2.3.4:5555" --party=ALICE  # node2, connecting to the node1
> rayfed start --head --mode=multiple-controller --party=BOB  # node3, listening on 5.6.7.8:6666
> rayfed start --address="5.6.7.8:6666" --party=BOB  # node4, connecting to the node3

And then, you run the following script in 2 clusters:

# main.py
fed.init(address="1.2.3.4:5555", xxx)
# nothing need to be changed in this job script.

# in node2
> python main.py --party=ALICE

# in node3
> python main.py --party=BOB

option 2

No need to add a new toolkit, but we should tell users that add some extra arguments when starting up the Ray cluster. For example,

A. running single-controller mode

> ray start --head --resources={"_PARTY_ALICE", 9999}  # node1, listening on 1.2.3.4:5555
> ray start --address="1.2.3.4:5555" --resources={"_PARTY_ALICE", 9999}  # node2, connecting to the node1
> ray start --address="1.2.3.4:5555" --resources={"_PARTY_BOB", 9999}  # node3, connecting to the node1
> ray start --address="1.2.3.4:5555" --resources={"_PARTY_BOB", 9999}  # node4, connecting to the node1

And then, add the extra mode info when fed.init():

# main.py
fed.init(address="1.2.3.4:5555", mode="single-controller", xxx)
# Nothing need to be changed in this job script.

A. running multiple-controller mode

> ray start --head # node1, listening on 1.2.3.4:5555
> ray start --address="1.2.3.4:5555" # node2, connecting to the node1
> ray start --head # node3, listening on 5.6.7.8:6666
> ray start --address="5.6.7.8:6666" # node4, connecting to the node3

And then, add the extra mode info when fed.init()(And we could ignore it if we provide a default value):

# main.py
fed.init(address="1.2.3.4:5555", mode="multiple-controller", xxx)
# Nothing need to be changed in this job script.

jovany-wang commented 1 year ago

@ray-project/rayfed-dev CC

jovany-wang commented 1 year ago

Excepting single-controller and multi-controller mode, we also need to support simulation mode for launching thousands of ends.

jovany-wang commented 1 year ago

For SIMULATION mode, we might run actors in different parties in one process(one Ray worker process). For SINGLE_CONTROLLER mode, actors in different parties should be run in different Ray nodes. For MULTI_CONTROLLER mode, actors in different parties should be run in different Ray clusters.

ray-project / rayfed

[RFC] Support running in different modes. #29

option 1

option 2