BigDL AutoProphet function not working fine when executing spark-submit to AWS EKS cluster

SjeYinTeoIntel commented 1 year ago

Facing issue when running below codes. from bigdl.orca import init_orca_context init_orca_context(cluster_mode="local", cores=1) forecaster = AutoProphet(optimizer=opt, loss=los)

Logs: 23/09/05 10:45:09 WARN SparkContext: Using an existing SparkContext; some configuration may not take effect. 23-09-05 10:45:09 [Thread-4] INFO Engine$:461 - Find existing spark context. Checking the spark conf... 23-09-05 10:45:09 [Thread-4] WARN Engine$:467 - Engine.init: Can not find spark.shuffle.reduceLocality.enabled. For details please check https://bigdl-project.github.io/master/#APIGuide/Engine/ 23-09-05 10:45:09 [Thread-4] WARN Engine$:467 - Engine.init: Can not find spark.shuffle.blockTransferService. For details please check https://bigdl-project.github.io/master/#APIGuide/Engine/ 23-09-05 10:45:09 [Thread-4] WARN Engine$:467 - Engine.init: Can not find spark.scheduler.minRegisteredResourcesRatio. For details please check https://bigdl-project.github.io/master/#APIGuide/Engine/ 23-09-05 10:45:09 [Thread-4] WARN Engine$:467 - Engine.init: Can not find spark.scheduler.maxRegisteredResourcesWaitingTime. For details please check https://bigdl-project.github.io/master/#APIGuide/Engine/ 23-09-05 10:45:09 [Thread-4] WARN Engine$:467 - Engine.init: Can not find spark.speculation. For details please check https://bigdl-project.github.io/master/#APIGuide/Engine/ 23-09-05 10:45:09 [Thread-4] WARN Engine$:467 - Engine.init: Can not find spark.serializer. For details please check https://bigdl-project.github.io/master/#APIGuide/Engine/ 23-09-05 10:45:09 [Thread-4] WARN Engine$:470 - Engine.init: spark.driver.extraJavaOptions should be -Dlog4j2.info, but it is -Dcom.amazonaws.services.s3.enableV4=true. For details please check https://bigdl-project.github.io/master/#APIGuide/Engine/ cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.Sample BigDLBasePickler registering: bigdl.dllib.utils.common Sample cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.EvaluatedResult BigDLBasePickler registering: bigdl.dllib.utils.common EvaluatedResult cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.JTensor BigDLBasePickler registering: bigdl.dllib.utils.common JTensor cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.JActivity BigDLBasePickler registering: bigdl.dllib.utils.common JActivity Launching Ray on cluster with Spark barrier mode Start to launch ray driver Executing command: ray start --address 172.31.184.254:30106 --num-cpus 0 --node-ip-address 172.31.211.226 2023-09-05 10:45:18,705 INFO scripts.py:904 -- Local node IP: 172.31.211.226 2023-09-05 10:45:18,844 SUCC scripts.py:916 -- -------------------- 2023-09-05 10:45:18,844 SUCC scripts.py:917 -- Ray runtime started. 2023-09-05 10:45:18,844 SUCC scripts.py:918 -- -------------------- 2023-09-05 10:45:18,844 INFO scripts.py:920 -- To terminate the Ray runtime, run 2023-09-05 10:45:18,844 INFO scripts.py:921 -- ray stop

2023-09-05 10:45:18,721 WARNING services.py:1832 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=0.94gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM. [2023-09-05 10:45:18,842 I 205 205] global_state_accessor.cc:356: This node has an IP address of 172.31.211.226, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.

2023-09-05 10:45:18,705 INFO scripts.py:904 -- Local node IP: 172.31.211.226 2023-09-05 10:45:18,844 SUCC scripts.py:916 -- -------------------- 2023-09-05 10:45:18,844 SUCC scripts.py:917 -- Ray runtime started. 2023-09-05 10:45:18,844 SUCC scripts.py:918 -- -------------------- 2023-09-05 10:45:18,844 INFO scripts.py:920 -- To terminate the Ray runtime, run 2023-09-05 10:45:18,844 INFO scripts.py:921 -- ray stop

File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/ray/ray_daemon.py", line 26 logging.info(f"Stopping pgid {pgid} by ray_daemon.") ^ SyntaxError: invalid syntax 2023-09-05 10:45:19,974 INFO worker.py:1431 -- Connecting to existing Ray cluster at address: 172.31.184.254:30106... 2023-09-05 10:45:19,987 INFO worker.py:1621 -- Connected to Ray cluster. RayContext(dashboard_url='', python_version='3.9.2', ray_version='2.6.3', ray_commit='8a434b4ee7cd48e60fa1531315d39901fac5d79e', protocol_version=None)

/bin/sh: line 1: hadoop: command not found /bin/sh: line 1: hadoop: command not found

ERROR:bigdl.dllib.utils.log4Error:

****Usage Error**** /bin/sh: line 1: hadoop: command not found /bin/sh: line 1: hadoop: command not found

ERROR:bigdl.dllib.utils.log4Error:

****Call Stack*** 2023-09-05 10:45:20,019 - DataTransformation - MainThread - ERROR - Exception in processing job: 164_autoprophet_autoprophet_FEJKVP6 Exception: /bin/sh: line 1: hadoop: command not found /bin/sh: line 1: hadoop: command not found Traceback (most recent call last): File "/opt/easydata-app/python/operation/data_transformation.py", line 1236, in run File "/opt/easydata-app/python/transformation_analytics/ml_model_train_test.py", line 1088, in AutoProphet_Forecaster File "/usr/local/lib/python3.9/dist-packages/bigdl/chronos/autots/model/auto_prophet.py", line 112, in init self.auto_est = AutoEstimator(model_builder=model_builder, File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/auto_estimator.py", line 53, in init self.searcher = SearchEngineFactory.create_engine( File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/init.py", line 25, in create_engine return RayTuneSearchEngine(*args, kwargs) File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 53, in init self.remote_dir = remote_dir or RayTuneSearchEngine.get_default_remote_dir(name) File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 65, in get_default_remote_dir process(command=f"hadoop fs -mkdir -p {default_remote_dir}; " File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/utils.py", line 39, in process invalidInputError(False, err) File "/usr/local/lib/python3.9/dist-packages/bigdl/dllib/utils/log4Error.py", line 33, in invalidInputError raise RuntimeError(errMsg) RuntimeError: /bin/sh: line 1: hadoop: command not found /bin/sh: line 1: hadoop: command not found

ERROR:DataTransformation:Exception in processing job: 164_autoprophet_autoprophet_FEJKVP6 Exception: /bin/sh: line 1: hadoop: command not found /bin/sh: line 1: hadoop: command not found Traceback (most recent call last): File "/opt/easydata-app/python/operation/data_transformation.py", line 1236, in run File "/opt/easydata-app/python/transformation_analytics/ml_model_train_test.py", line 1088, in AutoProphet_Forecaster File "/usr/local/lib/python3.9/dist-packages/bigdl/chronos/autots/model/auto_prophet.py", line 112, in init self.auto_est = AutoEstimator(model_builder=model_builder, File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/auto_estimator.py", line 53, in init self.searcher = SearchEngineFactory.create_engine( File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/init.py", line 25, in create_engine return RayTuneSearchEngine(*args, **kwargs) File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 53, in init self.remote_dir = remote_dir or RayTuneSearchEngine.get_default_remote_dir(name) File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 65, in get_default_remote_dir process(command=f"hadoop fs -mkdir -p {default_remote_dir}; " File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/utils.py", line 39, in process invalidInputError(False, err) File "/usr/local/lib/python3.9/dist-packages/bigdl/dllib/utils/log4Error.py", line 33, in invalidInputError raise RuntimeError(errMsg) RuntimeError: /bin/sh: line 1: hadoop: command not found /bin/sh: line 1: hadoop: command not found

2023-09-05 10:45:20,020 - DataTransformation - MainThread - INFO - Spark Session is stopped INFO:DataTransformation:Spark Session is stopped 23/09/05 10:45:20 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed.

sgwhat commented 1 year ago

The reason for this error is that the Ray Driver was not running within the current pod. As a result, bigdl.orca.automl is attempting to call hadoop cmd. We are currently validating a specific solution for this issue🙂.

sgwhat commented 1 year ago

Solved in PR https://github.com/intel-analytics/BigDL/pull/8901, now you can reach the latest version of BigDL, e.g. pip install --pre --upgrade bigdl to validate your programs😄. @SjeYinTeoIntel

sgwhat commented 1 year ago

New issue raised when creating spark context on EKS(K8s):

similar to the error we met in https://github.com/intel-analytics/BigDL/issues/8870 before.

sgwhat commented 1 year ago

fixed in https://github.com/intel-analytics/BigDL/pull/8914. @SjeYinTeoIntel

intel-analytics / ipex-llm

BigDL AutoProphet function not working fine when executing spark-submit to AWS EKS cluster #8895