oap-project / raydp

RayDP provides simple APIs for running Spark on Ray and integrating Spark with AI libraries.
Apache License 2.0
293 stars 66 forks source link

Rename Spark master actor resource config and apply to both actors #313

Closed pang-wu closed 1 year ago

pang-wu commented 1 year ago

PR https://github.com/oap-project/raydp/pull/287 only set spark master resource to spark master actor. There is another actor which comprised the driver infra RayAppMaster doesn't has the config. It is possible that actor scheduled on worker node and get killed with the worker, causing Spark job stuck with following error:

23/03/02 01:55:38 ERROR TransportClient: Failed to send RPC RPC 5816546953779296894 to /10.191.56.127:45851: io.netty.channel.StacklessClosedChannelException
io.netty.channel.StacklessClosedChannelException
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source)
23/03/02 01:55:38 WARN ExecutorAllocationManager: Unable to reach the cluster manager to request more executors!
Screenshot 2023-03-01 at 5 58 21 PM

This PR does the following changes:

kira-lin commented 1 year ago

LGTM