Open krfricke opened 1 year ago
same error here
Any update? same error
Same error
Any update? I met with the similar error.
Any update?
So far I've found a way to bypass this error. It seems like the Ray framework cannot properly work with the default JNI-based libhdfs
(aka $HADOOP_HOME/lib/native/libhdfs.so
for Linux OS), so I tried to use a native golang client (github repo: HDFS for Go) to replace the default API. I also implemented a FSSpec interface based on Pydoop (github repo: Pydoop FSSpec Interface) to replace pyarrow.fs.HadoopFileSystem
module, because Arrow always raises the following error in my use case (I need to use ray.data.read_parquet
to read a 700Mb parquet file from HDFS and use ray.train.RunConfig
to set the persistent storage in HDFS for model artifacts):
/arrow/cpp/src/arrow/status.cc:155: Failed to disconnect hdfs client: IOError: HDFS hdfsFS::Disconnect failed.
Here are my tests:
Environment: a ray cluster on kubernetes deployed by kuberay.
Tasks: benchmarks of xgboost model training and prediction using ray.train
and ray.data
.
Combinations:
pyarrow
+ Java version of libhdfs
= IOError :no_entry: + Core Dump :rage:pydoopfsspec
+ Java version of libhdfs
= Core Dump (same as test 1) :sob:pydoopfsspec
+ Golang version of libhdfs
= Job Succeeded :satisfied:Dumb inference according to above equations:
pyarrow
leads to the IOError I encounteredcore dump
is caused by the Incompatibility between Ray and the JVM of Hadoop's Java fs code.More info for wise men who are using ray.train
and ray.data
and encountering the same errors:
libhdfs
binary according to HDFS for Go and install Pydoop FSSpec Interface.export LD_PRELOAD="path/to/your/golang/libhdfs.so"
export HADOOP_USER_NAME=<your HDFS username>
export LIBHDFS_DEFAULT_FS=<HDFS host>:<HDFS port> (e.g. "hdfs:8020")
export LIBHDFS_DEFAULT_USER=<your HDFS username>
export HADOOP_CONF_DIR="/path/to/your/config/dir"
try the following code:
def train(
framework: str, data_path: str, num_workers: int, cpus_per_worker: int
) -> ray.train.Result:
import fsspec
import pyarrow.fs
from pydoopfsspec import HadoopFileSystem
fsspec.register_implementation("pydoop", HadoopFileSystem)
hdfs = fsspec.filesystem("pydoop")
fs = pyarrow.fs.PyFileSystem(pyarrow.fs.FSSpecHandler(hdfs))
ds = data.read_parquet(data_path, filesystem=fs)
framework_params = _FRAMEWORK_PARAMS[framework]
trainer_cls = framework_params["trainer_cls"]
trainer = trainer_cls(
params=framework_params["params"],
scaling_config=ScalingConfig(
num_workers=num_workers,
resources_per_worker={"CPU": cpus_per_worker},
trainer_resources={"CPU": 0},
),
label_column="Label",
datasets={"train": ds},
run_config=RunConfig(
storage_filesystem=fs,
storage_path=<path to your persistent storage>,
name=f"{framework}_benchmark",
),
)
result = trainer.fit()
return result
same error here. is it possible to handle or fix it in ray[data]?
When using pyarrow outside ray, I create a seperate process to connect to HDFS. This allows me to completely isolate the HDFS logic from Ray's usage which solve the conflict between the two.
When using pyarrow outside ray, I create a seperate process to connect to HDFS. This allows me to completely isolate the HDFS logic from Ray's usage which solve the conflict between the two.
@tanmc123 Could you please explain a bit more about your solution? How did you isolate the HDFS logic from Ray's usage? I would appreciate it!
When using pyarrow outside ray, I create a seperate process to connect to HDFS. This allows me to completely isolate the HDFS logic from Ray's usage which solve the conflict between the two.
@tanmc123 Could you please explain a bit more about your solution? How did you isolate the HDFS logic from Ray's usage? I would appreciate it!
I did something like this:
` def _run_in_subprocess(self, target, *args, **kwargs): """Run isolate process for specific func and catch errors""" result_queue = mp.Queue() p = mp.Process(target=target, args=args + (result_queue,), kwargs=kwargs) p.start() p.join()
if not result_queue.empty():
result = result_queue.get()
if isinstance(result, Exception):
main_process_trace = traceback.format_stack()
main_process_trace = "".join(main_process_trace)
raise HDFSSubProcessException(result, main_process_trace)
return result
def _upload(self, local_path, hdfs_path, overwrite, result_queue):
try:
if self.exist(path=hdfs_path) and not overwrite:
raise HDFSUtilException(f"File {hdfs_path} already exists")
hdfs_dir = os.path.dirname(hdfs_path)
if not self.exist(path=hdfs_dir):
self.create_dir(path=hdfs_dir)
if overwrite and self.exist(path=hdfs_path):
self.delete(path=hdfs_path, recursive=True)
self.logger.info(f"Old file {hdfs_path} deleted for overwrite")
with open(local_path, 'rb') as local_file:
with self.client.open_output_stream(hdfs_path) as f:
f.write(local_file.read())
self.logger.info(f"File {local_path} uploaded to {hdfs_path}")
result_queue.put(None)
except Exception as e:
result_queue.put(e)
def upload(self, **kwargs):
hdfs_path = kwargs.get("hdfs_path", "")
local_path = kwargs.get("local_path", "")
overwrite = kwargs.get("overwrite", False)
self._run_in_subprocess(self._upload, local_path, hdfs_path, overwrite)
`
I located where the segfault occurred in my use case, the bug was caused by the threads spawned within pyarrow.fs.copy_files()
function. I tested using a thread pool to concurrently upload each file in the directory as a whole rather than chunks, the bug rarely occurred again but still existed (when I tried to reduce the CPU limit for ray workers). Finally, I just implemented the method that copies files in a loop, then I haven't met the segfault again. I only modified the python/ray/train/_internal/storage.py
, here is my modification: https://github.com/pwang697/ray/commit/9c766cea211369d091add43044169915abbabfa1
It's worth noting that I am using my customised pydoop filesystem instance as shown in https://github.com/ray-project/ray/issues/36415#issuecomment-2241786012 , so that I detected the instance type of pyarrow.fs.PyFileSystem
in line 138 of my code, I haven't tested my method on the default pyarrow.fs.HadoopFileSystem
.
What happened + What you expected to happen
Using PyArrow fs with HDFS works fine outside a ray session:
However, after
ray.init()
, the same code results in a segmentation fault:Here is the log dump from java:
hs_err_pid9716.log
The segfault occurs almost every time, but not always.
It never occurs when ray is not initialized. Thus there is probably some interference between the ray session/global state and the java/pyarrow/hdfs connection.
Versions / Dependencies
Ray latest master, hadoop 3.2.4, java openjdk version "1.8.0_362"
Reproduction script
./ci/env-install-hdfs.sh
/opt/hadoop-3.2.4/bin/hdfs dfs -put /tmp/somewhere hdfs://[host]:8020/somewhere
Issue Severity
High: It blocks me from completing my task.