intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.61k stars 1.26k forks source link

Automatically detect hdfs in AutoEstimator #3889

Open shanyu-sys opened 2 years ago

shanyu-sys commented 2 years ago

Problem description

Currently we directly use hdfs command on worker to save the intermediate results in trials. However, the hdfs path may not be added to $PATH, therefore worker may raise exception for not finding hdfs. Error log

Traceback (most recent call last):
  File "/ads_storage/udap/ray/bigdl2-venv/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 924, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/ads_storage/udap/ray/bigdl2-venv/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 787, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/ads_storage/udap/ray/bigdl2-venv/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/ads_storage/udap/ray/bigdl2-venv/lib/python3.7/site-packages/ray/worker.py", line 1713, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TuneError): [36mray::ImplicitFunc.train_buffered()[39m (pid=1094313, ip=10.154.180.24, repr=<types.ImplicitFunc object at 0x7f337849a160>)
  File "/dfs/10/yarn/nm/usercache/dw/appcache/application_1639751979565_40518/container_e55_1639751979565_40518_01_000010/bigdl2-venv.zip/bigdl2-venv/lib/python3.7/site-packages/ray/tune/trainable.py", line 255, in train_buffered
    result = self.train()
  File "/dfs/10/yarn/nm/usercache/dw/appcache/application_1639751979565_40518/container_e55_1639751979565_40518_01_000010/bigdl2-venv.zip/bigdl2-venv/lib/python3.7/site-packages/ray/tune/trainable.py", line 314, in train
    result = self.step()
  File "/dfs/10/yarn/nm/usercache/dw/appcache/application_1639751979565_40518/container_e55_1639751979565_40518_01_000010/bigdl2-venv.zip/bigdl2-venv/lib/python3.7/site-packages/ray/tune/function_runner.py", line 381, in step
    self._report_thread_runner_error(block=True)
  File "/dfs/10/yarn/nm/usercache/dw/appcache/application_1639751979565_40518/container_e55_1639751979565_40518_01_000010/bigdl2-venv.zip/bigdl2-venv/lib/python3.7/site-packages/ray/tune/function_runner.py", line 532, in _report_thread_runner_error
    ("Trial raised an exception. Traceback:\n{}".format(err_tb_str)
ray.tune.error.TuneError: Trial raised an exception. Traceback:
[36mray::ImplicitFunc.train_buffered()[39m (pid=1094313, ip=10.154.180.24, repr=<types.ImplicitFunc object at 0x7f337849a160>)
  File "/dfs/10/yarn/nm/usercache/dw/appcache/application_1639751979565_40518/container_e55_1639751979565_40518_01_000010/bigdl2-venv.zip/bigdl2-venv/lib/python3.7/site-packages/ray/tune/function_runner.py", line 262, in run
    self._entrypoint()
  File "/dfs/10/yarn/nm/usercache/dw/appcache/application_1639751979565_40518/container_e55_1639751979565_40518_01_000010/bigdl2-venv.zip/bigdl2-venv/lib/python3.7/site-packages/ray/tune/function_runner.py", line 331, in entrypoint
    self._status_reporter.get_checkpoint())
  File "/ads_storage/udap/ray/bigdl2-venv/lib/python3.7/site-packages/ray/tune/function_runner.py", line 597, in _trainable_func
  File "/ads_storage/udap/ray/lib/bigdl-orca-spark_2.4.6-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 360, in train_func
  File "/dfs/10/yarn/nm/usercache/dw/appcache/application_1639751979565_40518/container_e55_1639751979565_40518_01_000010/bigdl-orca-spark_2.4.6-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/search/utils.py", line 72, in put_ckpt_hdfs
    if remote_ckpt_basename not in get_remote_list(remote_dir):
  File "/dfs/10/yarn/nm/usercache/dw/appcache/application_1639751979565_40518/container_e55_1639751979565_40518_01_000010/bigdl-orca-spark_2.4.6-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/search/utils.py", line 47, in get_remote_list
    s_output, _ = process(args)
  File "/dfs/10/yarn/nm/usercache/dw/appcache/application_1639751979565_40518/container_e55_1639751979565_40518_01_000010/bigdl-orca-spark_2.4.6-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/search/utils.py", line 38, in process
    raise Exception(err)
Exception: /bin/sh: hdfs: command not found
/bin/sh: awk: command not found

One work-around is add hdfs path to spark.executorEnv.PATH, e.g. --conf spark.executorEnv.PATH=$PATH:/opt/cloudera/parcels/CDH/bin. However, directly changing the executor PATH may be risky.

Solution

  1. Expose an environment variable name, e.g. HDFS_PATH for users to specify the path to find hdfs. Internally, we should first find whether HDFS_PATH has been set, and then check the commonly used environment variable for hdfs path, e.g. HDFS_HOME.... In the end, use the absolute path for hdfs command.
  2. Use pyarrow internally instead of directly executing hdfs command in subprocess.
jason-dai commented 2 years ago

What if there is no HDFS?

shanyu-sys commented 2 years ago

What if there is no HDFS?

I guess we may try using ray tune syncing, it supports distributed checkpointing with a shared directory (e.g. NFS), cloud storage (s3/ or GS) or hdfs.