Closed jiafuzha closed 1 year ago
@carsonwang please help review.
@kira-lin I just update the description.
I feel like our way to set options(configurations) for different processes is very messy now. The logic is not clear, and is split around in our code.
We can write down how configurations are spread to our processes. For example, we first take input from users, and set configurations for our jvm(runs RayAppMaster), and spark driver. Spark executor's configuration is set by RayAppMaster, how does it do so, etc.
Does spark driver need these config? If not, can we separate these from native spark ones in
init_spark
?
I'll add more doc for these configs. Spark driver needs some of them. For user, "init_spark" is the only entry for them to set config.
I just addressed all comments. Please help review again.
@kira-lin one of test failed with "RuntimeError: [enforce fail at /Users/runner/work/pytorch/pytorch/pytorch/third_party/gloo/gloo/transport/uv/device.cc:153] rp != nullptr. Unable to find address for: Mac-1679480349858.local".
Did you see similar issue before?
For user, "init_spark" is the only entry for them to set config. Yes, I wonder if we can have two parameters for this function, one for native spark config, one for ours.
Did you see similar issue before? No. Another mac test has passed. Maybe it's some problem with github CI.
For user, "init_spark" is the only entry for them to set config. Yes, I wonder if we can have two parameters for this function, one for native spark config, one for ours.
It has subtlety here since some configs need to be prefixed with "spark." otherwise they'll be filtered out by spark and thus cannot be propagated in spark JVMs.
Did you see similar issue before? No. Another mac test has passed. Maybe it's some problem with github CI.
Ok, I assume there is no issue in our code then.
@kira-lin Beside below comments, do you have other concerns for this PR? ' Yes, I wonder if we can have two parameters for this function, one for native spark config, one for ours. ' Considering it's API change, @carsonwang , what's your points on the API change in the init_spark() function?
LGTM. One last question: what will happen if fault_tolerant_mode is set to True? In that case, spark driver will also be connected to Ray.
LGTM. One last question: what will happen if fault_tolerant_mode is set to True? In that case, spark driver will also be connected to Ray.
let me check.
LGTM. One last question: what will happen if fault_tolerant_mode is set to True? In that case, spark driver will also be connected to Ray.
let me check.
I added below line in 'connectToRay' method after Ray.init() since it initializes log in ray's own way instead of Spark's.
SparkContext.getOrCreate().setLogLevel("WARN")
It restores driver's log level to 'WARN'. Without above line, I got below additional output with 'fault_tolerant_mode=True'.
' 2023-03-27 06:34:35,970 INFO SecurityManager [Thread-4]: Changing view acls to: jiafu 2023-03-27 06:34:35,970 INFO SecurityManager [Thread-4]: Changing modify acls to: jiafu 2023-03-27 06:34:35,970 INFO SecurityManager [Thread-4]: Changing view acls groups to: 2023-03-27 06:34:35,971 INFO SecurityManager [Thread-4]: Changing modify acls groups to: 2023-03-27 06:34:35,971 INFO SecurityManager [Thread-4]: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(jiafu); groups with view permissions: Set(); users with modify permissions: Set(jiafu); groups with modify permissions: Set() 2023-03-27 06:34:35,987 INFO Utils [Thread-4]: Successfully started service 'RAY_RPC_ENV' on port 43805.
2023-03-27 06:34:39,026 INFO CoarseGrainedSchedulerBackend$DriverEndpoint [dispatcher-CoarseGrainedScheduler]: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.239.34.11:50312) with ID 0, ResourceProfileId 0 2023-03-27 06:34:39,028 INFO CoarseGrainedSchedulerBackend$DriverEndpoint [dispatcher-CoarseGrainedScheduler]: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.239.34.11:50302) with ID 1, ResourceProfileId 0 2023-03-27 06:34:39,088 INFO BlockManagerMasterEndpoint [dispatcher-BlockManagerMaster]: Registering block manager 10.239.34.11:44525 with 2.1 GiB RAM, BlockManagerId(0, 10.239.34.11, 44525, None) 2023-03-27 06:34:39,091 INFO BlockManagerMasterEndpoint [dispatcher-BlockManagerMaster]: Registering block manager 10.239.34.11:35787 with 2.1 GiB RAM, BlockManagerId(1, 10.239.34.11, 35787, None) '
thanks.
LGTM. Thanks
Basic Idea: