intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.66k stars 1.26k forks source link

BigDL AutoProphet function not working fine when executing spark-submit to AWS EKS cluster #8895

Closed SjeYinTeoIntel closed 1 year ago

SjeYinTeoIntel commented 1 year ago

Facing issue when running below codes. from bigdl.orca import init_orca_context init_orca_context(cluster_mode="local", cores=1) forecaster = AutoProphet(optimizer=opt, loss=los)


Logs: 23/09/05 10:45:09 WARN SparkContext: Using an existing SparkContext; some configuration may not take effect. 23-09-05 10:45:09 [Thread-4] INFO Engine$:461 - Find existing spark context. Checking the spark conf... 23-09-05 10:45:09 [Thread-4] WARN Engine$:467 - Engine.init: Can not find spark.shuffle.reduceLocality.enabled. For details please check https://bigdl-project.github.io/master/#APIGuide/Engine/ 23-09-05 10:45:09 [Thread-4] WARN Engine$:467 - Engine.init: Can not find spark.shuffle.blockTransferService. For details please check https://bigdl-project.github.io/master/#APIGuide/Engine/ 23-09-05 10:45:09 [Thread-4] WARN Engine$:467 - Engine.init: Can not find spark.scheduler.minRegisteredResourcesRatio. For details please check https://bigdl-project.github.io/master/#APIGuide/Engine/ 23-09-05 10:45:09 [Thread-4] WARN Engine$:467 - Engine.init: Can not find spark.scheduler.maxRegisteredResourcesWaitingTime. For details please check https://bigdl-project.github.io/master/#APIGuide/Engine/ 23-09-05 10:45:09 [Thread-4] WARN Engine$:467 - Engine.init: Can not find spark.speculation. For details please check https://bigdl-project.github.io/master/#APIGuide/Engine/ 23-09-05 10:45:09 [Thread-4] WARN Engine$:467 - Engine.init: Can not find spark.serializer. For details please check https://bigdl-project.github.io/master/#APIGuide/Engine/ 23-09-05 10:45:09 [Thread-4] WARN Engine$:470 - Engine.init: spark.driver.extraJavaOptions should be -Dlog4j2.info, but it is -Dcom.amazonaws.services.s3.enableV4=true. For details please check https://bigdl-project.github.io/master/#APIGuide/Engine/ cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.Sample BigDLBasePickler registering: bigdl.dllib.utils.common Sample cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.EvaluatedResult BigDLBasePickler registering: bigdl.dllib.utils.common EvaluatedResult cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.JTensor BigDLBasePickler registering: bigdl.dllib.utils.common JTensor cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.JActivity BigDLBasePickler registering: bigdl.dllib.utils.common JActivity Launching Ray on cluster with Spark barrier mode Start to launch ray driver Executing command: ray start --address 172.31.184.254:30106 --num-cpus 0 --node-ip-address 172.31.211.226 2023-09-05 10:45:18,705 INFO scripts.py:904 -- Local node IP: 172.31.211.226 2023-09-05 10:45:18,844 SUCC scripts.py:916 -- -------------------- 2023-09-05 10:45:18,844 SUCC scripts.py:917 -- Ray runtime started. 2023-09-05 10:45:18,844 SUCC scripts.py:918 -- -------------------- 2023-09-05 10:45:18,844 INFO scripts.py:920 -- To terminate the Ray runtime, run 2023-09-05 10:45:18,844 INFO scripts.py:921 -- ray stop

2023-09-05 10:45:18,721 WARNING services.py:1832 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=0.94gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM. [2023-09-05 10:45:18,842 I 205 205] global_state_accessor.cc:356: This node has an IP address of 172.31.211.226, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.

2023-09-05 10:45:18,705 INFO scripts.py:904 -- Local node IP: 172.31.211.226 2023-09-05 10:45:18,844 SUCC scripts.py:916 -- -------------------- 2023-09-05 10:45:18,844 SUCC scripts.py:917 -- Ray runtime started. 2023-09-05 10:45:18,844 SUCC scripts.py:918 -- -------------------- 2023-09-05 10:45:18,844 INFO scripts.py:920 -- To terminate the Ray runtime, run 2023-09-05 10:45:18,844 INFO scripts.py:921 -- ray stop

File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/ray/ray_daemon.py", line 26 logging.info(f"Stopping pgid {pgid} by ray_daemon.") ^ SyntaxError: invalid syntax 2023-09-05 10:45:19,974 INFO worker.py:1431 -- Connecting to existing Ray cluster at address: 172.31.184.254:30106... 2023-09-05 10:45:19,987 INFO worker.py:1621 -- Connected to Ray cluster. RayContext(dashboard_url='', python_version='3.9.2', ray_version='2.6.3', ray_commit='8a434b4ee7cd48e60fa1531315d39901fac5d79e', protocol_version=None)

/bin/sh: line 1: hadoop: command not found /bin/sh: line 1: hadoop: command not found

ERROR:bigdl.dllib.utils.log4Error:

****Usage Error**** /bin/sh: line 1: hadoop: command not found /bin/sh: line 1: hadoop: command not found

ERROR:bigdl.dllib.utils.log4Error:

****Call Stack*** 2023-09-05 10:45:20,019 - DataTransformation - MainThread - ERROR - Exception in processing job: 164_autoprophet_autoprophet_FEJKVP6 Exception: /bin/sh: line 1: hadoop: command not found /bin/sh: line 1: hadoop: command not found Traceback (most recent call last): File "/opt/easydata-app/python/operation/data_transformation.py", line 1236, in run File "/opt/easydata-app/python/transformation_analytics/ml_model_train_test.py", line 1088, in AutoProphet_Forecaster File "/usr/local/lib/python3.9/dist-packages/bigdl/chronos/autots/model/auto_prophet.py", line 112, in init self.auto_est = AutoEstimator(model_builder=model_builder, File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/auto_estimator.py", line 53, in init self.searcher = SearchEngineFactory.create_engine( File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/init.py", line 25, in create_engine return RayTuneSearchEngine(*args, kwargs) File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 53, in init self.remote_dir = remote_dir or RayTuneSearchEngine.get_default_remote_dir(name) File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 65, in get_default_remote_dir process(command=f"hadoop fs -mkdir -p {default_remote_dir}; " File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/utils.py", line 39, in process invalidInputError(False, err) File "/usr/local/lib/python3.9/dist-packages/bigdl/dllib/utils/log4Error.py", line 33, in invalidInputError raise RuntimeError(errMsg) RuntimeError: /bin/sh: line 1: hadoop: command not found /bin/sh: line 1: hadoop: command not found

ERROR:DataTransformation:Exception in processing job: 164_autoprophet_autoprophet_FEJKVP6 Exception: /bin/sh: line 1: hadoop: command not found /bin/sh: line 1: hadoop: command not found Traceback (most recent call last): File "/opt/easydata-app/python/operation/data_transformation.py", line 1236, in run File "/opt/easydata-app/python/transformation_analytics/ml_model_train_test.py", line 1088, in AutoProphet_Forecaster File "/usr/local/lib/python3.9/dist-packages/bigdl/chronos/autots/model/auto_prophet.py", line 112, in init self.auto_est = AutoEstimator(model_builder=model_builder, File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/auto_estimator.py", line 53, in init self.searcher = SearchEngineFactory.create_engine( File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/init.py", line 25, in create_engine return RayTuneSearchEngine(*args, **kwargs) File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 53, in init self.remote_dir = remote_dir or RayTuneSearchEngine.get_default_remote_dir(name) File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 65, in get_default_remote_dir process(command=f"hadoop fs -mkdir -p {default_remote_dir}; " File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/utils.py", line 39, in process invalidInputError(False, err) File "/usr/local/lib/python3.9/dist-packages/bigdl/dllib/utils/log4Error.py", line 33, in invalidInputError raise RuntimeError(errMsg) RuntimeError: /bin/sh: line 1: hadoop: command not found /bin/sh: line 1: hadoop: command not found

2023-09-05 10:45:20,020 - DataTransformation - MainThread - INFO - Spark Session is stopped INFO:DataTransformation:Spark Session is stopped 23/09/05 10:45:20 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed.

sgwhat commented 1 year ago

The reason for this error is that the Ray Driver was not running within the current pod. As a result, bigdl.orca.automl is attempting to call hadoop cmd. We are currently validating a specific solution for this issue🙂.

sgwhat commented 1 year ago

Solved in PR https://github.com/intel-analytics/BigDL/pull/8901, now you can reach the latest version of BigDL, e.g. pip install --pre --upgrade bigdl to validate your programs😄. @SjeYinTeoIntel

sgwhat commented 1 year ago

New issue raised when creating spark context on EKS(K8s):

image

similar to the error we met in https://github.com/intel-analytics/BigDL/issues/8870 before.

sgwhat commented 1 year ago

fixed in https://github.com/intel-analytics/BigDL/pull/8914. @SjeYinTeoIntel