oap-project / raydp

RayDP provides simple APIs for running Spark on Ray and integrating Spark with AI libraries.
Apache License 2.0
308 stars 68 forks source link

[raydp-317] reconcile slf4j and log4j versions between spark and ray #318

Closed jiafuzha closed 1 year ago

jiafuzha commented 1 year ago

Basic Idea:

  1. Use agent to capture annoying SLF4J warning message so that it doesn't show up in shell
  2. Use our own StaticLoggerBinder to choose which underlying log4j framework to bind with SLF4J
  3. divide-and-conquer strategy. Different bindings based on actual spark version and ray version for Spark driver and Spark executors inside ray worker.
jiafuzha commented 1 year ago

@carsonwang please help review.

jiafuzha commented 1 year ago

@kira-lin I just update the description.

jiafuzha commented 1 year ago

I feel like our way to set options(configurations) for different processes is very messy now. The logic is not clear, and is split around in our code.

We can write down how configurations are spread to our processes. For example, we first take input from users, and set configurations for our jvm(runs RayAppMaster), and spark driver. Spark executor's configuration is set by RayAppMaster, how does it do so, etc.

Does spark driver need these config? If not, can we separate these from native spark ones in init_spark?

I'll add more doc for these configs. Spark driver needs some of them. For user, "init_spark" is the only entry for them to set config.

jiafuzha commented 1 year ago

I just addressed all comments. Please help review again.

jiafuzha commented 1 year ago

@kira-lin one of test failed with "RuntimeError: [enforce fail at /Users/runner/work/pytorch/pytorch/pytorch/third_party/gloo/gloo/transport/uv/device.cc:153] rp != nullptr. Unable to find address for: Mac-1679480349858.local".

Did you see similar issue before?

kira-lin commented 1 year ago

For user, "init_spark" is the only entry for them to set config. Yes, I wonder if we can have two parameters for this function, one for native spark config, one for ours.

kira-lin commented 1 year ago

Did you see similar issue before? No. Another mac test has passed. Maybe it's some problem with github CI.

jiafuzha commented 1 year ago

For user, "init_spark" is the only entry for them to set config. Yes, I wonder if we can have two parameters for this function, one for native spark config, one for ours.

It has subtlety here since some configs need to be prefixed with "spark." otherwise they'll be filtered out by spark and thus cannot be propagated in spark JVMs.

jiafuzha commented 1 year ago

Did you see similar issue before? No. Another mac test has passed. Maybe it's some problem with github CI.

Ok, I assume there is no issue in our code then.

jiafuzha commented 1 year ago

@kira-lin Beside below comments, do you have other concerns for this PR? ' Yes, I wonder if we can have two parameters for this function, one for native spark config, one for ours. ' Considering it's API change, @carsonwang , what's your points on the API change in the init_spark() function?

kira-lin commented 1 year ago

LGTM. One last question: what will happen if fault_tolerant_mode is set to True? In that case, spark driver will also be connected to Ray.

jiafuzha commented 1 year ago

LGTM. One last question: what will happen if fault_tolerant_mode is set to True? In that case, spark driver will also be connected to Ray.

let me check.

jiafuzha commented 1 year ago

LGTM. One last question: what will happen if fault_tolerant_mode is set to True? In that case, spark driver will also be connected to Ray.

let me check.

I added below line in 'connectToRay' method after Ray.init() since it initializes log in ray's own way instead of Spark's.

SparkContext.getOrCreate().setLogLevel("WARN")

It restores driver's log level to 'WARN'. Without above line, I got below additional output with 'fault_tolerant_mode=True'.

' 2023-03-27 06:34:35,970 INFO SecurityManager [Thread-4]: Changing view acls to: jiafu 2023-03-27 06:34:35,970 INFO SecurityManager [Thread-4]: Changing modify acls to: jiafu 2023-03-27 06:34:35,970 INFO SecurityManager [Thread-4]: Changing view acls groups to: 2023-03-27 06:34:35,971 INFO SecurityManager [Thread-4]: Changing modify acls groups to: 2023-03-27 06:34:35,971 INFO SecurityManager [Thread-4]: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(jiafu); groups with view permissions: Set(); users with modify permissions: Set(jiafu); groups with modify permissions: Set() 2023-03-27 06:34:35,987 INFO Utils [Thread-4]: Successfully started service 'RAY_RPC_ENV' on port 43805.

2023-03-27 06:34:39,026 INFO CoarseGrainedSchedulerBackend$DriverEndpoint [dispatcher-CoarseGrainedScheduler]: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.239.34.11:50312) with ID 0, ResourceProfileId 0 2023-03-27 06:34:39,028 INFO CoarseGrainedSchedulerBackend$DriverEndpoint [dispatcher-CoarseGrainedScheduler]: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.239.34.11:50302) with ID 1, ResourceProfileId 0 2023-03-27 06:34:39,088 INFO BlockManagerMasterEndpoint [dispatcher-BlockManagerMaster]: Registering block manager 10.239.34.11:44525 with 2.1 GiB RAM, BlockManagerId(0, 10.239.34.11, 44525, None) 2023-03-27 06:34:39,091 INFO BlockManagerMasterEndpoint [dispatcher-BlockManagerMaster]: Registering block manager 10.239.34.11:35787 with 2.1 GiB RAM, BlockManagerId(1, 10.239.34.11, 35787, None) '

thanks.

kira-lin commented 1 year ago

LGTM. Thanks