intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.49k stars 1.24k forks source link

DynamicAllocation parameter not supported. #8934

Closed SjeYinTeoIntel closed 11 months ago

SjeYinTeoIntel commented 1 year ago

Issue: DynamicAllocation parameter not supported. ———————————————————————————— 23-09-08 04:44:06 [Thread-4] WARN Engine$:470 - Engine.init: spark.driver.extraJavaOptions should be -Dlog4j2.info, but it is -Dcom.amazonaws.services.s3.enableV4=true. For details please check https://bigdl-project.github.io/master/#APIGuide/Engine/ cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.Sample BigDLBasePickler registering: bigdl.dllib.utils.common Sample cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.EvaluatedResult BigDLBasePickler registering: bigdl.dllib.utils.common EvaluatedResult cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.JTensor BigDLBasePickler registering: bigdl.dllib.utils.common JTensor cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.JActivity BigDLBasePickler registering: bigdl.dllib.utils.common JActivity ###################debug session after orca function Launching Ray on cluster with Spark barrier mode 23/09/08 04:44:09 WARN DAGScheduler: Creating new stage failed due to exception - job: 2 org.apache.spark.scheduler.BarrierJobRunWithDynamicAllocationException: [SPARK-24942]: Barrier execution mode does not support dynamic resource allocation for now. You can disable dynamic resource allocation by setting Spark conf "spark.dynamicAllocation.enabled" to "false". at org.apache.spark.scheduler.DAGScheduler.checkBarrierStageWithDynamicAllocation(DAGScheduler.scala:500) at org.apache.spark.scheduler.DAGScheduler.createResultStage(DAGScheduler.scala:588) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1196) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2592) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) 2023-09-08 04:44:09,320 - DataTransformation - MainThread - ERROR - Exception in processing job: 164_AutoProphet-final_autoprophet_Y576X6P Exception: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.scheduler.BarrierJobRunWithDynamicAllocationException: [SPARK-24942]: Barrier execution mode does not support dynamic resource allocation for now. You can disable dynamic resource allocation by setting Spark conf "spark.dynamicAllocation.enabled" to "false". at org.apache.spark.scheduler.DAGScheduler.checkBarrierStageWithDynamicAllocation(DAGScheduler.scala:500) at org.apache.spark.scheduler.DAGScheduler.createResultStage(DAGScheduler.scala:588) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1196) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2592) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2235) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2254) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2279) at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) at org.apache.spark.rdd.RDD.collect(RDD.scala:1029) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:180) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Unknown Source) Traceback (most recent call last): File "/opt/easydata-app/python/operation/data_transformation.py", line 1253, in run File "/opt/easydata-app/python/transformation_analytics/ml_model_train_test.py", line 1089, in AutoProphet_Forecaster File "/usr/local/lib/python3.9/dist-packages/bigdl/chronos/autots/model/auto_prophet.py", line 112, in init self.auto_est = AutoEstimator(model_builder=model_builder, File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/auto_estimator.py", line 53, in init self.searcher = SearchEngineFactory.create_engine( File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/init.py", line 25, in create_engine return RayTuneSearchEngine(*args, *kwargs) File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 53, in init self.remote_dir = remote_dir or RayTuneSearchEngine.get_default_remote_dir(name) File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 60, in get_default_remote_dir ray_ctx = OrcaRayContext.get() File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/ray/raycontext.py", line 103, in get ray_ctx.init() File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/ray/raycontext.py", line 77, in init results = self._ray_on_spark_context.init(driver_cores=driver_cores) File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/ray/ray_on_spark_context.py", line 600, in init redis_address = self._start_cluster() File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/ray/ray_on_spark_context.py", line 628, in _start_cluster process_infos = ray_rdd.barrier().mapPartitions( File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 950, in collect sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File "/opt/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py", line 1321, in call return_value = get_return_value( File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco return f(a, **kw) File "/opt/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/protocol.py", line 326, in get_return_value raise Py4JJavaError( py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.scheduler.BarrierJobRunWithDynamicAllocationException: [SPARK-24942]: Barrier execution mode does not support dynamic resource allocation for now. You can disable dynamic resource allocation by setting Spark conf "spark.dynamicAllocation.enabled" to "false". at org.apache.spark.scheduler.DAGScheduler.checkBarrierStageWithDynamicAllocation(DAGScheduler.scala:500) at org.apache.spark.scheduler.DAGScheduler.createResultStage(DAGScheduler.scala:588) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1196) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2592) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2235) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2254) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2279) at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) at org.apache.spark.rdd.RDD.collect(RDD.scala:1029) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:180) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Unknown Source)

ERROR:DataTransformation:Exception in processing job: 164_AutoProphet-final_autoprophet_Y576X6P Exception: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.scheduler.BarrierJobRunWithDynamicAllocationException: [SPARK-24942]: Barrier execution mode does not support dynamic resource allocation for now. You can disable dynamic resource allocation by setting Spark conf "spark.dynamicAllocation.enabled" to "false". at org.apache.spark.scheduler.DAGScheduler.checkBarrierStageWithDynamicAllocation(DAGScheduler.scala:500) at org.apache.spark.scheduler.DAGScheduler.createResultStage(DAGScheduler.scala:588) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1196) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2592) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2235) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2254) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2279) at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) at org.apache.spark.rdd.RDD.collect(RDD.scala:1029) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:180) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Unknown Source) Traceback (most recent call last): File "/opt/easydata-app/python/operation/data_transformation.py", line 1253, in run File "/opt/easydata-app/python/transformation_analytics/ml_model_train_test.py", line 1089, in AutoProphet_Forecaster File "/usr/local/lib/python3.9/dist-packages/bigdl/chronos/autots/model/auto_prophet.py", line 112, in init self.auto_est = AutoEstimator(model_builder=model_builder, File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/auto_estimator.py", line 53, in init self.searcher = SearchEngineFactory.create_engine( File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/init.py", line 25, in create_engine return RayTuneSearchEngine(*args, *kwargs) File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 53, in init self.remote_dir = remote_dir or RayTuneSearchEngine.get_default_remote_dir(name) File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 60, in get_default_remote_dir ray_ctx = OrcaRayContext.get() File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/ray/raycontext.py", line 103, in get ray_ctx.init() File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/ray/raycontext.py", line 77, in init results = self._ray_on_spark_context.init(driver_cores=driver_cores) File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/ray/ray_on_spark_context.py", line 600, in init redis_address = self._start_cluster() File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/ray/ray_on_spark_context.py", line 628, in _start_cluster process_infos = ray_rdd.barrier().mapPartitions( File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 950, in collect sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File "/opt/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py", line 1321, in call return_value = get_return_value( File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco return f(a, **kw) File "/opt/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/protocol.py", line 326, in get_return_value raise Py4JJavaError( py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.scheduler.BarrierJobRunWithDynamicAllocationException: [SPARK-24942]: Barrier execution mode does not support dynamic resource allocation for now. You can disable dynamic resource allocation by setting Spark conf "spark.dynamicAllocation.enabled" to "false". at org.apache.spark.scheduler.DAGScheduler.checkBarrierStageWithDynamicAllocation(DAGScheduler.scala:500) at org.apache.spark.scheduler.DAGScheduler.createResultStage(DAGScheduler.scala:588) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1196) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2592) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2235) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2254) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2279) at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) at org.apache.spark.rdd.RDD.collect(RDD.scala:1029) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:180) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Unknown Source)

2023-09-08 04:44:09,323 - DataTransformation - MainThread - INFO - Spark Session is stopped INFO:DataTransformation:Spark Session is stopped 23/09/08 04:44:09 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed. Stopping ray_orca context

lalalapotter commented 1 year ago

Hi @SjeYinTeoIntel,

We haven't support Dynamic Allocation for Spark yet, and the support for this is stilling planning. Currently, please refer to this known issue for detailed information and workaround.

Best, Cengguang

SjeYinTeoIntel commented 1 year ago

Hi CengGuang, Noted and thanks!

Regards, sjeyin

From: Cengguang Zhang @.> Sent: Monday, September 11, 2023 9:56 AM To: intel-analytics/BigDL @.> Cc: Teo, Sje Yin @.>; Mention @.> Subject: Re: [intel-analytics/BigDL] DynamicAllocation parameter not supported. (Issue #8934)

Hi @SjeYinTeoIntelhttps://github.com/SjeYinTeoIntel,

We haven't support Dynamic Allocation for Spark yet, and the support for this is stilling planning. Currently, please refer to this known issuehttps://bigdl.readthedocs.io/en/latest/doc/Orca/Overview/known_issues.html#spark-dynamic-allocation for detailed information and workaround.

Best, Cengguang

— Reply to this email directly, view it on GitHubhttps://github.com/intel-analytics/BigDL/issues/8934#issuecomment-1713042228, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ASXCPF34F7YRL2ZSMGIYAO3XZZVUDANCNFSM6AAAAAA4QVLKQA. You are receiving this because you were mentioned.Message ID: @.**@.>>