ray-project / rayfed

A multiple parties joint, distributed execution engine based on Ray, to help build your own federated learning frameworks in minutes.
https://rayfed.readthedocs.io
Apache License 2.0
92 stars 21 forks source link

Provide a forced shutdown option. #206

Closed zhouaihui closed 10 months ago

zhouaihui commented 10 months ago

What happened

Demo code.

@fed.remote
def error_fun():
    raise Exception('By design')

@fed.remote
def random():
    return [1, 2] 

@fed.remote
def foo(a):
    print(a)

try:
    # Alice ran into an error and broadcast error to bob. And exit then.
    a = error_fun.party('alice').remote()
    b = foo.party('bob').remote(a)

    # Alice did not execute the following codes.
    data = random.party('alice').remote()
    c = foo.party('bob').remote(data)
   # Bob was going to send c to alice but alice won't send `data` to bob since alice exited already. 
    foo.party('alice').remote(c)

   # Bob got the error.
    fed.get(b)

finally:
    # Bob was blocked here because data sending was hanging.
    fed.shutdown()

A possible solution

Give an forced option to fed.shutdown(forced=False), if forced==True, shutdown anyway ignoring the data sending.

jovany-wang commented 10 months ago

fed.shutdown(forced=False) or fed.force_shutdown() ?

zhouaihui commented 10 months ago

We can hide the forced behavior? Once bob received error from alice, the job became meaningless already. So bob can mark the job as failure when received an error from alice, and then do not wait for data sending when shutdown.

zhouaihui commented 10 months ago

Receiving an error from peer and Error on self data sending are same case actually. We should not wait data sending if error occurred no matter from alice itself or from bob, since a job became meaningless already.

jovany-wang commented 10 months ago

Receiving an error from peer and Error on self data sending are same case actually. We should not wait data sending if error occurred no matter from alice itself or from bob, since a job became meaningless already.

+1