Closed SjeYinTeoIntel closed 11 months ago
Hi @SjeYinTeoIntel,
We haven't support Dynamic Allocation for Spark yet, and the support for this is stilling planning. Currently, please refer to this known issue for detailed information and workaround.
Best, Cengguang
Hi CengGuang, Noted and thanks!
Regards, sjeyin
From: Cengguang Zhang @.> Sent: Monday, September 11, 2023 9:56 AM To: intel-analytics/BigDL @.> Cc: Teo, Sje Yin @.>; Mention @.> Subject: Re: [intel-analytics/BigDL] DynamicAllocation parameter not supported. (Issue #8934)
Hi @SjeYinTeoIntelhttps://github.com/SjeYinTeoIntel,
We haven't support Dynamic Allocation for Spark yet, and the support for this is stilling planning. Currently, please refer to this known issuehttps://bigdl.readthedocs.io/en/latest/doc/Orca/Overview/known_issues.html#spark-dynamic-allocation for detailed information and workaround.
Best, Cengguang
— Reply to this email directly, view it on GitHubhttps://github.com/intel-analytics/BigDL/issues/8934#issuecomment-1713042228, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ASXCPF34F7YRL2ZSMGIYAO3XZZVUDANCNFSM6AAAAAA4QVLKQA. You are receiving this because you were mentioned.Message ID: @.**@.>>
Issue: DynamicAllocation parameter not supported. ———————————————————————————— 23-09-08 04:44:06 [Thread-4] WARN Engine$:470 - Engine.init: spark.driver.extraJavaOptions should be -Dlog4j2.info, but it is -Dcom.amazonaws.services.s3.enableV4=true. For details please check https://bigdl-project.github.io/master/#APIGuide/Engine/ cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.Sample BigDLBasePickler registering: bigdl.dllib.utils.common Sample cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.EvaluatedResult BigDLBasePickler registering: bigdl.dllib.utils.common EvaluatedResult cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.JTensor BigDLBasePickler registering: bigdl.dllib.utils.common JTensor cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.JActivity BigDLBasePickler registering: bigdl.dllib.utils.common JActivity ###################debug session after orca function Launching Ray on cluster with Spark barrier mode 23/09/08 04:44:09 WARN DAGScheduler: Creating new stage failed due to exception - job: 2 org.apache.spark.scheduler.BarrierJobRunWithDynamicAllocationException: [SPARK-24942]: Barrier execution mode does not support dynamic resource allocation for now. You can disable dynamic resource allocation by setting Spark conf "spark.dynamicAllocation.enabled" to "false". at org.apache.spark.scheduler.DAGScheduler.checkBarrierStageWithDynamicAllocation(DAGScheduler.scala:500) at org.apache.spark.scheduler.DAGScheduler.createResultStage(DAGScheduler.scala:588) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1196) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2592) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) 2023-09-08 04:44:09,320 - DataTransformation - MainThread - ERROR - Exception in processing job: 164_AutoProphet-final_autoprophet_Y576X6P Exception: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.scheduler.BarrierJobRunWithDynamicAllocationException: [SPARK-24942]: Barrier execution mode does not support dynamic resource allocation for now. You can disable dynamic resource allocation by setting Spark conf "spark.dynamicAllocation.enabled" to "false". at org.apache.spark.scheduler.DAGScheduler.checkBarrierStageWithDynamicAllocation(DAGScheduler.scala:500) at org.apache.spark.scheduler.DAGScheduler.createResultStage(DAGScheduler.scala:588) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1196) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2592) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2235) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2254) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2279) at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) at org.apache.spark.rdd.RDD.collect(RDD.scala:1029) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:180) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Unknown Source) Traceback (most recent call last): File "/opt/easydata-app/python/operation/data_transformation.py", line 1253, in run File "/opt/easydata-app/python/transformation_analytics/ml_model_train_test.py", line 1089, in AutoProphet_Forecaster File "/usr/local/lib/python3.9/dist-packages/bigdl/chronos/autots/model/auto_prophet.py", line 112, in init self.auto_est = AutoEstimator(model_builder=model_builder, File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/auto_estimator.py", line 53, in init self.searcher = SearchEngineFactory.create_engine( File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/init.py", line 25, in create_engine return RayTuneSearchEngine(*args, *kwargs) File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 53, in init self.remote_dir = remote_dir or RayTuneSearchEngine.get_default_remote_dir(name) File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 60, in get_default_remote_dir ray_ctx = OrcaRayContext.get() File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/ray/raycontext.py", line 103, in get ray_ctx.init() File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/ray/raycontext.py", line 77, in init results = self._ray_on_spark_context.init(driver_cores=driver_cores) File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/ray/ray_on_spark_context.py", line 600, in init redis_address = self._start_cluster() File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/ray/ray_on_spark_context.py", line 628, in _start_cluster process_infos = ray_rdd.barrier().mapPartitions( File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 950, in collect sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File "/opt/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py", line 1321, in call return_value = get_return_value( File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco return f(a, **kw) File "/opt/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/protocol.py", line 326, in get_return_value raise Py4JJavaError( py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.scheduler.BarrierJobRunWithDynamicAllocationException: [SPARK-24942]: Barrier execution mode does not support dynamic resource allocation for now. You can disable dynamic resource allocation by setting Spark conf "spark.dynamicAllocation.enabled" to "false". at org.apache.spark.scheduler.DAGScheduler.checkBarrierStageWithDynamicAllocation(DAGScheduler.scala:500) at org.apache.spark.scheduler.DAGScheduler.createResultStage(DAGScheduler.scala:588) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1196) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2592) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2235) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2254) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2279) at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) at org.apache.spark.rdd.RDD.collect(RDD.scala:1029) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:180) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Unknown Source)
ERROR:DataTransformation:Exception in processing job: 164_AutoProphet-final_autoprophet_Y576X6P Exception: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.scheduler.BarrierJobRunWithDynamicAllocationException: [SPARK-24942]: Barrier execution mode does not support dynamic resource allocation for now. You can disable dynamic resource allocation by setting Spark conf "spark.dynamicAllocation.enabled" to "false". at org.apache.spark.scheduler.DAGScheduler.checkBarrierStageWithDynamicAllocation(DAGScheduler.scala:500) at org.apache.spark.scheduler.DAGScheduler.createResultStage(DAGScheduler.scala:588) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1196) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2592) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2235) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2254) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2279) at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) at org.apache.spark.rdd.RDD.collect(RDD.scala:1029) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:180) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Unknown Source) Traceback (most recent call last): File "/opt/easydata-app/python/operation/data_transformation.py", line 1253, in run File "/opt/easydata-app/python/transformation_analytics/ml_model_train_test.py", line 1089, in AutoProphet_Forecaster File "/usr/local/lib/python3.9/dist-packages/bigdl/chronos/autots/model/auto_prophet.py", line 112, in init self.auto_est = AutoEstimator(model_builder=model_builder, File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/auto_estimator.py", line 53, in init self.searcher = SearchEngineFactory.create_engine( File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/init.py", line 25, in create_engine return RayTuneSearchEngine(*args, *kwargs) File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 53, in init self.remote_dir = remote_dir or RayTuneSearchEngine.get_default_remote_dir(name) File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 60, in get_default_remote_dir ray_ctx = OrcaRayContext.get() File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/ray/raycontext.py", line 103, in get ray_ctx.init() File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/ray/raycontext.py", line 77, in init results = self._ray_on_spark_context.init(driver_cores=driver_cores) File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/ray/ray_on_spark_context.py", line 600, in init redis_address = self._start_cluster() File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/ray/ray_on_spark_context.py", line 628, in _start_cluster process_infos = ray_rdd.barrier().mapPartitions( File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 950, in collect sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File "/opt/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py", line 1321, in call return_value = get_return_value( File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco return f(a, **kw) File "/opt/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/protocol.py", line 326, in get_return_value raise Py4JJavaError( py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.scheduler.BarrierJobRunWithDynamicAllocationException: [SPARK-24942]: Barrier execution mode does not support dynamic resource allocation for now. You can disable dynamic resource allocation by setting Spark conf "spark.dynamicAllocation.enabled" to "false". at org.apache.spark.scheduler.DAGScheduler.checkBarrierStageWithDynamicAllocation(DAGScheduler.scala:500) at org.apache.spark.scheduler.DAGScheduler.createResultStage(DAGScheduler.scala:588) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1196) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2592) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2235) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2254) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2279) at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) at org.apache.spark.rdd.RDD.collect(RDD.scala:1029) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:180) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Unknown Source)
2023-09-08 04:44:09,323 - DataTransformation - MainThread - INFO - Spark Session is stopped INFO:DataTransformation:Spark Session is stopped 23/09/08 04:44:09 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed. Stopping ray_orca context