PR https://github.com/oap-project/raydp/pull/287 only set spark master resource to spark master actor. There is another actor which comprised the driver infra RayAppMaster doesn't has the config. It is possible that actor scheduled on worker node and get killed with the worker, causing Spark job stuck with following error:
23/03/02 01:55:38 ERROR TransportClient: Failed to send RPC RPC 5816546953779296894 to /10.191.56.127:45851: io.netty.channel.StacklessClosedChannelException
io.netty.channel.StacklessClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source)
23/03/02 01:55:38 WARN ExecutorAllocationManager: Unable to reach the cluster manager to request more executors!
This PR does the following changes:
Renames config spark.ray.raydp_spark_master.resource.* to spark.ray.raydp_spark_master.actor.resource.* to make it a bit less confusing as we have two actors.
Set the custom resources to both driver actors.
Fix a bug in the raydp jar package in setup.py
Fix a bug in startUpAppMaster where sparkProps might be a python dict containing Dict[str, Any]
PR https://github.com/oap-project/raydp/pull/287 only set spark master resource to spark master actor. There is another actor which comprised the driver infra
RayAppMaster
doesn't has the config. It is possible that actor scheduled on worker node and get killed with the worker, causing Spark job stuck with following error:This PR does the following changes:
spark.ray.raydp_spark_master.resource.*
tospark.ray.raydp_spark_master.actor.resource.*
to make it a bit less confusing as we have two actors.setup.py
startUpAppMaster
wheresparkProps
might be a python dict containingDict[str, Any]