Automatically detect hdfs in AutoEstimator

shanyu-sys commented 2 years ago

Problem description

Currently we directly use hdfs command on worker to save the intermediate results in trials. However, the hdfs path may not be added to $PATH, therefore worker may raise exception for not finding hdfs. Error log

Traceback (most recent call last):
  File "/ads_storage/udap/ray/bigdl2-venv/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 924, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/ads_storage/udap/ray/bigdl2-venv/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 787, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/ads_storage/udap/ray/bigdl2-venv/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/ads_storage/udap/ray/bigdl2-venv/lib/python3.7/site-packages/ray/worker.py", line 1713, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TuneError): [36mray::ImplicitFunc.train_buffered()[39m (pid=1094313, ip=10.154.180.24, repr=<types.ImplicitFunc object at 0x7f337849a160>)
  File "/dfs/10/yarn/nm/usercache/dw/appcache/application_1639751979565_40518/container_e55_1639751979565_40518_01_000010/bigdl2-venv.zip/bigdl2-venv/lib/python3.7/site-packages/ray/tune/trainable.py", line 255, in train_buffered
    result = self.train()
  File "/dfs/10/yarn/nm/usercache/dw/appcache/application_1639751979565_40518/container_e55_1639751979565_40518_01_000010/bigdl2-venv.zip/bigdl2-venv/lib/python3.7/site-packages/ray/tune/trainable.py", line 314, in train
    result = self.step()
  File "/dfs/10/yarn/nm/usercache/dw/appcache/application_1639751979565_40518/container_e55_1639751979565_40518_01_000010/bigdl2-venv.zip/bigdl2-venv/lib/python3.7/site-packages/ray/tune/function_runner.py", line 381, in step
    self._report_thread_runner_error(block=True)
  File "/dfs/10/yarn/nm/usercache/dw/appcache/application_1639751979565_40518/container_e55_1639751979565_40518_01_000010/bigdl2-venv.zip/bigdl2-venv/lib/python3.7/site-packages/ray/tune/function_runner.py", line 532, in _report_thread_runner_error
    ("Trial raised an exception. Traceback:\n{}".format(err_tb_str)
ray.tune.error.TuneError: Trial raised an exception. Traceback:
[36mray::ImplicitFunc.train_buffered()[39m (pid=1094313, ip=10.154.180.24, repr=<types.ImplicitFunc object at 0x7f337849a160>)
  File "/dfs/10/yarn/nm/usercache/dw/appcache/application_1639751979565_40518/container_e55_1639751979565_40518_01_000010/bigdl2-venv.zip/bigdl2-venv/lib/python3.7/site-packages/ray/tune/function_runner.py", line 262, in run
    self._entrypoint()
  File "/dfs/10/yarn/nm/usercache/dw/appcache/application_1639751979565_40518/container_e55_1639751979565_40518_01_000010/bigdl2-venv.zip/bigdl2-venv/lib/python3.7/site-packages/ray/tune/function_runner.py", line 331, in entrypoint
    self._status_reporter.get_checkpoint())
  File "/ads_storage/udap/ray/bigdl2-venv/lib/python3.7/site-packages/ray/tune/function_runner.py", line 597, in _trainable_func
  File "/ads_storage/udap/ray/lib/bigdl-orca-spark_2.4.6-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 360, in train_func
  File "/dfs/10/yarn/nm/usercache/dw/appcache/application_1639751979565_40518/container_e55_1639751979565_40518_01_000010/bigdl-orca-spark_2.4.6-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/search/utils.py", line 72, in put_ckpt_hdfs
    if remote_ckpt_basename not in get_remote_list(remote_dir):
  File "/dfs/10/yarn/nm/usercache/dw/appcache/application_1639751979565_40518/container_e55_1639751979565_40518_01_000010/bigdl-orca-spark_2.4.6-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/search/utils.py", line 47, in get_remote_list
    s_output, _ = process(args)
  File "/dfs/10/yarn/nm/usercache/dw/appcache/application_1639751979565_40518/container_e55_1639751979565_40518_01_000010/bigdl-orca-spark_2.4.6-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/search/utils.py", line 38, in process
    raise Exception(err)
Exception: /bin/sh: hdfs: command not found
/bin/sh: awk: command not found

One work-around is add hdfs path to spark.executorEnv.PATH, e.g. --conf spark.executorEnv.PATH=$PATH:/opt/cloudera/parcels/CDH/bin. However, directly changing the executor PATH may be risky.

Solution

Expose an environment variable name, e.g. HDFS_PATH for users to specify the path to find hdfs. Internally, we should first find whether HDFS_PATH has been set, and then check the commonly used environment variable for hdfs path, e.g. HDFS_HOME.... In the end, use the absolute path for hdfs command.
Use pyarrow internally instead of directly executing hdfs command in subprocess.

jason-dai commented 2 years ago

What if there is no HDFS?

shanyu-sys commented 2 years ago

What if there is no HDFS?

I guess we may try using ray tune syncing, it supports distributed checkpointing with a shared directory (e.g. NFS), cloud storage (s3/ or GS) or hdfs.

intel-analytics / ipex-llm

Automatically detect hdfs in AutoEstimator #3889

Problem description

Solution