intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.45k stars 1.24k forks source link

autoTS yarn-client mode couldn't find hadoop command, when hadoop is not in default linux PATH. #6686

Open qiuxin2012 opened 1 year ago

qiuxin2012 commented 1 year ago

init spark on yarn with ray, automl yarn-client mode couldn't find hadoop command, when hadoop is not in default linux PATH. Apache hadoop has this problem, while CDH does't have.

ray::ImplicitFunc.train_buffered() (pid=1673826, ip=172.168.0.202, repr=<types.ImplicitFunc object at 0x7f33dd334350>)
  File "/tmp/hadoop-cpx/nm-local-dir/usercache/root/appcache/application_1669020546318_0004/container_1669020546318_0004_01_000003/python_env/lib/python3.7/site-packages/ray/tune/function_runner.py", line 262, in run
    self._entrypoint()
  File "/tmp/hadoop-cpx/nm-local-dir/usercache/root/appcache/application_1669020546318_0004/container_1669020546318_0004_01_000003/python_env/lib/python3.7/site-packages/ray/tune/function_runner.py", line 331, in entrypoint
    self._status_reporter.get_checkpoint())
  File "/home/cpx/miniconda3/envs/xin-chronos/lib/python3.7/site-packages/ray/tune/function_runner.py", line 597, in _trainable_func
    output = fn()
  File "/home/cpx/miniconda3/envs/xin-chronos/lib/python3.7/site-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 369, in train_func
    put_ckpt_hdfs(remote_dir, checkpoint_filename)
  File "/tmp/hadoop-cpx/nm-local-dir/usercache/root/appcache/application_1669020546318_0004/container_1669020546318_0004_01_000003/python_env/lib/python3.7/site-packages/bigdl/orca/automl/search/utils.py", line 78, in put_ckpt_hdfs
    process(cmd)
  File "/tmp/hadoop-cpx/nm-local-dir/usercache/root/appcache/application_1669020546318_0004/container_1669020546318_0004_01_000003/python_env/lib/python3.7/site-packages/bigdl/orca/automl/search/utils.py", line 39, in process
    invalidInputError(False, err)
  File "/tmp/hadoop-cpx/nm-local-dir/usercache/root/appcache/application_1669020546318_0004/container_1669020546318_0004_01_000003/python_env/lib/python3.7/site-packages/bigdl/dllib/utils/log4Error.py", line 33, in invalidInputError
    raise RuntimeError(errMsg)
RuntimeError: ERROR: ld.so: object 'python_env/lib/libpython3.7m.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
/bin/sh: 1: hadoop: not found
/bin/sh: 1: hadoop: not found
2022-11-21 09:39:41,039 ERROR trial_runner.py:958 -- Trial train_func_643af_00016: Error processing event.
Traceback (most recent call last):
  File "/home/cpx/miniconda3/envs/xin-chronos/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 924, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/cpx/miniconda3/envs/xin-chronos/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 787, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/cpx/miniconda3/envs/xin-chronos/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/cpx/miniconda3/envs/xin-chronos/lib/python3.7/site-packages/ray/worker.py", line 1713, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train_buffered() (pid=1673470, ip=172.168.0.202, repr=<types.ImplicitFunc object at 0x7f54452f7810>)
  File "/tmp/hadoop-cpx/nm-local-dir/usercache/root/appcache/application_1669020546318_0004/container_1669020546318_0004_01_000002/python_env/lib/python3.7/site-packages/ray/tune/trainable.py", line 255, in train_buffered
    result = self.train()
  File "/tmp/hadoop-cpx/nm-local-dir/usercache/root/appcache/application_1669020546318_0004/container_1669020546318_0004_01_000002/python_env/lib/python3.7/site-packages/ray/tune/trainable.py", line 314, in train
    result = self.step()
  File "/tmp/hadoop-cpx/nm-local-dir/usercache/root/appcache/application_1669020546318_0004/container_1669020546318_0004_01_000002/python_env/lib/python3.7/site-packages/ray/tune/function_runner.py", line 381, in step
    self._report_thread_runner_error(block=True)
  File "/tmp/hadoop-cpx/nm-local-dir/usercache/root/appcache/application_1669020546318_0004/container_1669020546318_0004_01_000002/python_env/lib/python3.7/site-packages/ray/tune/function_runner.py", line 532, in _report_thread_runner_error
    ("Trial raised an exception. Traceback:\n{}".format(err_tb_str)
ray.tune.error.TuneError: Trial raised an exception. Traceback:
ray::ImplicitFunc.train_buffered() (pid=1673470, ip=172.168.0.202, repr=<types.ImplicitFunc object at 0x7f54452f7810>)
  File "/tmp/hadoop-cpx/nm-local-dir/usercache/root/appcache/application_1669020546318_0004/container_1669020546318_0004_01_000002/python_env/lib/python3.7/site-packages/ray/tune/function_runner.py", line 262, in run
    self._entrypoint()
  File "/tmp/hadoop-cpx/nm-local-dir/usercache/root/appcache/application_1669020546318_0004/container_1669020546318_0004_01_000002/python_env/lib/python3.7/site-packages/ray/tune/function_runner.py", line 331, in entrypoint
    self._status_reporter.get_checkpoint())
  File "/home/cpx/miniconda3/envs/xin-chronos/lib/python3.7/site-packages/ray/tune/function_runner.py", line 597, in _trainable_func
    output = fn()
  File "/home/cpx/miniconda3/envs/xin-chronos/lib/python3.7/site-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 369, in train_func
    put_ckpt_hdfs(remote_dir, checkpoint_filename)
  File "/tmp/hadoop-cpx/nm-local-dir/usercache/root/appcache/application_1669020546318_0004/container_1669020546318_0004_01_000002/python_env/lib/python3.7/site-packages/bigdl/orca/automl/search/utils.py", line 78, in put_ckpt_hdfs
    process(cmd)
  File "/tmp/hadoop-cpx/nm-local-dir/usercache/root/appcache/application_1669020546318_0004/container_1669020546318_0004_01_000002/python_env/lib/python3.7/site-packages/bigdl/orca/automl/search/utils.py", line 39, in process
    invalidInputError(False, err)
  File "/tmp/hadoop-cpx/nm-local-dir/usercache/root/appcache/application_1669020546318_0004/container_1669020546318_0004_01_000002/python_env/lib/python3.7/site-packages/bigdl/dllib/utils/log4Error.py", line 33, in invalidInputError
    raise RuntimeError(errMsg)
RuntimeError: ERROR: ld.so: object 'python_env/lib/libpython3.7m.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
/bin/sh: 1: hadoop: not found
/bin/sh: 1: hadoop: not found
qiuxin2012 commented 1 year ago

executor could not find hadoop commad, user should add soft link to /usr/local/bin. Like

sudo ln -s $HADOOP_HOME/bin/hadoop /usr/local/bin
sudo ln -s $HADOOP_HOME/bin/hdfs /usr/local/bin
sudo ln -s $HADOOP_HOME/libexec /usr/local
qiuxin2012 commented 1 year ago

get another error:

****************************Usage Error************************
get: `/tmp/auto_lstm/auto_lstm/train_func_0d583_00039/best.ckpt': File exists

****************************Call Stack*************************
Traceback (most recent call last):
  File "lstm.py", line 82, in <module>
    n_sampling=40)
  File "/home/cpx/miniconda3/envs/xin-chronos/lib/python3.7/site-packages/bigdl/chronos/autots/model/base_automodel.py", line 101, in fit
    self.best_model = self.auto_est._get_best_automl_model()
  File "/home/cpx/miniconda3/envs/xin-chronos/lib/python3.7/site-packages/bigdl/orca/automl/auto_estimator.py", line 244, in _get_best_automl_model
    self.best_trial = self.searcher.get_best_trial()
  File "/home/cpx/miniconda3/envs/xin-chronos/lib/python3.7/site-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 222, in get_best_trial
    return self.get_best_trials(k=1)[0]
  File "/home/cpx/miniconda3/envs/xin-chronos/lib/python3.7/site-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 234, in get_best_trials
    return [self._make_trial_output(t) for t in best_trials]
  File "/home/cpx/miniconda3/envs/xin-chronos/lib/python3.7/site-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 234, in <listcomp>
    return [self._make_trial_output(t) for t in best_trials]
  File "/home/cpx/miniconda3/envs/xin-chronos/lib/python3.7/site-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 239, in _make_trial_output
    get_ckpt_hdfs(self.remote_dir, model_path)
  File "/home/cpx/miniconda3/envs/xin-chronos/lib/python3.7/site-packages/bigdl/orca/automl/search/utils.py", line 92, in get_ckpt_hdfs
    process(cmd)
  File "/home/cpx/miniconda3/envs/xin-chronos/lib/python3.7/site-packages/bigdl/orca/automl/search/utils.py", line 39, in process
    invalidInputError(False, err)
  File "/home/cpx/miniconda3/envs/xin-chronos/lib/python3.7/site-packages/bigdl/dllib/utils/log4Error.py", line 33, in invalidInputError
    raise RuntimeError(errMsg)
RuntimeError: get: `/tmp/auto_lstm/auto_lstm/train_func_0d583_00039/best.ckpt': File exists

Stopping orca context
qiuxin2012 commented 1 year ago

get another error:

****************************Usage Error************************
get: `/tmp/auto_lstm/auto_lstm/train_func_0d583_00039/best.ckpt': File exists

****************************Call Stack*************************
Traceback (most recent call last):
  File "lstm.py", line 82, in <module>
    n_sampling=40)
  File "/home/cpx/miniconda3/envs/xin-chronos/lib/python3.7/site-packages/bigdl/chronos/autots/model/base_automodel.py", line 101, in fit
    self.best_model = self.auto_est._get_best_automl_model()
  File "/home/cpx/miniconda3/envs/xin-chronos/lib/python3.7/site-packages/bigdl/orca/automl/auto_estimator.py", line 244, in _get_best_automl_model
    self.best_trial = self.searcher.get_best_trial()
  File "/home/cpx/miniconda3/envs/xin-chronos/lib/python3.7/site-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 222, in get_best_trial
    return self.get_best_trials(k=1)[0]
  File "/home/cpx/miniconda3/envs/xin-chronos/lib/python3.7/site-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 234, in get_best_trials
    return [self._make_trial_output(t) for t in best_trials]
  File "/home/cpx/miniconda3/envs/xin-chronos/lib/python3.7/site-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 234, in <listcomp>
    return [self._make_trial_output(t) for t in best_trials]
  File "/home/cpx/miniconda3/envs/xin-chronos/lib/python3.7/site-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 239, in _make_trial_output
    get_ckpt_hdfs(self.remote_dir, model_path)
  File "/home/cpx/miniconda3/envs/xin-chronos/lib/python3.7/site-packages/bigdl/orca/automl/search/utils.py", line 92, in get_ckpt_hdfs
    process(cmd)
  File "/home/cpx/miniconda3/envs/xin-chronos/lib/python3.7/site-packages/bigdl/orca/automl/search/utils.py", line 39, in process
    invalidInputError(False, err)
  File "/home/cpx/miniconda3/envs/xin-chronos/lib/python3.7/site-packages/bigdl/dllib/utils/log4Error.py", line 33, in invalidInputError
    raise RuntimeError(errMsg)
RuntimeError: get: `/tmp/auto_lstm/auto_lstm/train_func_0d583_00039/best.ckpt': File exists

Stopping orca context

this error will happen when driver is running on the same machine with executor. Because driver and executor is using the same local tmp dir.