Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Currently we directly use hdfs command on worker to save the intermediate results in trials. However, the hdfs path may not be added to $PATH, therefore worker may raise exception for not finding hdfs.
Error log
Traceback (most recent call last):
File "/ads_storage/udap/ray/bigdl2-venv/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 924, in _process_trial
results = self.trial_executor.fetch_result(trial)
File "/ads_storage/udap/ray/bigdl2-venv/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 787, in fetch_result
result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
File "/ads_storage/udap/ray/bigdl2-venv/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/ads_storage/udap/ray/bigdl2-venv/lib/python3.7/site-packages/ray/worker.py", line 1713, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TuneError): [36mray::ImplicitFunc.train_buffered()[39m (pid=1094313, ip=10.154.180.24, repr=<types.ImplicitFunc object at 0x7f337849a160>)
File "/dfs/10/yarn/nm/usercache/dw/appcache/application_1639751979565_40518/container_e55_1639751979565_40518_01_000010/bigdl2-venv.zip/bigdl2-venv/lib/python3.7/site-packages/ray/tune/trainable.py", line 255, in train_buffered
result = self.train()
File "/dfs/10/yarn/nm/usercache/dw/appcache/application_1639751979565_40518/container_e55_1639751979565_40518_01_000010/bigdl2-venv.zip/bigdl2-venv/lib/python3.7/site-packages/ray/tune/trainable.py", line 314, in train
result = self.step()
File "/dfs/10/yarn/nm/usercache/dw/appcache/application_1639751979565_40518/container_e55_1639751979565_40518_01_000010/bigdl2-venv.zip/bigdl2-venv/lib/python3.7/site-packages/ray/tune/function_runner.py", line 381, in step
self._report_thread_runner_error(block=True)
File "/dfs/10/yarn/nm/usercache/dw/appcache/application_1639751979565_40518/container_e55_1639751979565_40518_01_000010/bigdl2-venv.zip/bigdl2-venv/lib/python3.7/site-packages/ray/tune/function_runner.py", line 532, in _report_thread_runner_error
("Trial raised an exception. Traceback:\n{}".format(err_tb_str)
ray.tune.error.TuneError: Trial raised an exception. Traceback:
[36mray::ImplicitFunc.train_buffered()[39m (pid=1094313, ip=10.154.180.24, repr=<types.ImplicitFunc object at 0x7f337849a160>)
File "/dfs/10/yarn/nm/usercache/dw/appcache/application_1639751979565_40518/container_e55_1639751979565_40518_01_000010/bigdl2-venv.zip/bigdl2-venv/lib/python3.7/site-packages/ray/tune/function_runner.py", line 262, in run
self._entrypoint()
File "/dfs/10/yarn/nm/usercache/dw/appcache/application_1639751979565_40518/container_e55_1639751979565_40518_01_000010/bigdl2-venv.zip/bigdl2-venv/lib/python3.7/site-packages/ray/tune/function_runner.py", line 331, in entrypoint
self._status_reporter.get_checkpoint())
File "/ads_storage/udap/ray/bigdl2-venv/lib/python3.7/site-packages/ray/tune/function_runner.py", line 597, in _trainable_func
File "/ads_storage/udap/ray/lib/bigdl-orca-spark_2.4.6-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 360, in train_func
File "/dfs/10/yarn/nm/usercache/dw/appcache/application_1639751979565_40518/container_e55_1639751979565_40518_01_000010/bigdl-orca-spark_2.4.6-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/search/utils.py", line 72, in put_ckpt_hdfs
if remote_ckpt_basename not in get_remote_list(remote_dir):
File "/dfs/10/yarn/nm/usercache/dw/appcache/application_1639751979565_40518/container_e55_1639751979565_40518_01_000010/bigdl-orca-spark_2.4.6-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/search/utils.py", line 47, in get_remote_list
s_output, _ = process(args)
File "/dfs/10/yarn/nm/usercache/dw/appcache/application_1639751979565_40518/container_e55_1639751979565_40518_01_000010/bigdl-orca-spark_2.4.6-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/search/utils.py", line 38, in process
raise Exception(err)
Exception: /bin/sh: hdfs: command not found
/bin/sh: awk: command not found
One work-around is add hdfs path to spark.executorEnv.PATH, e.g. --conf spark.executorEnv.PATH=$PATH:/opt/cloudera/parcels/CDH/bin. However, directly changing the executor PATH may be risky.
Solution
Expose an environment variable name, e.g. HDFS_PATH for users to specify the path to find hdfs. Internally, we should first find whether HDFS_PATH has been set, and then check the commonly used environment variable for hdfs path, e.g. HDFS_HOME.... In the end, use the absolute path for hdfs command.
Use pyarrow internally instead of directly executing hdfs command in subprocess.
I guess we may try using ray tune syncing, it supports distributed checkpointing with a shared directory (e.g. NFS), cloud storage (s3/ or GS) or hdfs.
Problem description
Currently we directly use
hdfs
command on worker to save the intermediate results in trials. However, thehdfs
path may not be added to$PATH
, therefore worker may raise exception for not findinghdfs
. Error logOne work-around is add hdfs path to
spark.executorEnv.PATH
, e.g.--conf spark.executorEnv.PATH=$PATH:/opt/cloudera/parcels/CDH/bin
. However, directly changing the executor PATH may be risky.Solution
HDFS_PATH
for users to specify the path to find hdfs. Internally, we should first find whetherHDFS_PATH
has been set, and then check the commonly used environment variable for hdfs path, e.g.HDFS_HOME
.... In the end, use the absolute path forhdfs
command.pyarrow
internally instead of directly executinghdfs
command insubprocess
.