ray-project / rayfed

A multiple parties joint, distributed execution engine based on Ray, to help build your own federated learning frameworks in minutes.
https://rayfed.readthedocs.io
Apache License 2.0
92 stars 21 forks source link

[RFC] support a global barrier to sync the process in all parties. #66

Open zhaocaibei123 opened 1 year ago

zhaocaibei123 commented 1 year ago

Currently, it's hard to control the process due to the asymmetrical workloads. So let's propose a global barrier global_sync to make sure all parties are in here.

o1 = f.party("ALICE").remote()
o2 = g.party("BOB").remote()
global_sync()
h.party("ALICE").remote() # `h` will be invoked after the global_sync getting invoked in all parties.
jovany-wang commented 1 year ago

@ray-project/rayfed-dev CC for more discussions on the risks.

jovany-wang commented 1 year ago

Hi @zhaocaibei123 Are you willing to contribute this feature?

jovany-wang commented 1 year ago

It seems that we should support the infinite retries for fed.init() like:

fed.init(infinite_retry=True)
jovany-wang commented 1 year ago

I don't have any concrete proposal yet~