ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.6k stars 5.71k forks source link

[data] support passing a FileSystem factory function for read APIs. #37484

Open raulchen opened 1 year ago

raulchen commented 1 year ago

Today, we can pass a pyarrow.fs.FileSystem to the read_xxx APIs. However, some FileSystem objects have native code and doesn't work well with Python serialization. We worked around pa.fs.S3FileSystem with a custom wrapper class _S3FileSystemWrapper. But this workaround is not extensible for other user-defined FileSystems. We should allow users to pass a factory function that produces a FileSystem object to prevent this issue.

bveeramani commented 1 year ago

@raulchen would it be possible to extend _S3FileSystemWrapper to work with arbitrary file systems?

raulchen commented 1 year ago

@raulchen would it be possible to extend _S3FileSystemWrapper to work with arbitrary file systems?

@bveeramani There is an issue that cannot be solved with the current approach. The problem is that the FileSystem object itself is serializable. but when it get deserialized on the read_task worker, memory address of some native code gets messed up, and causes segfault. Currently, this issue is worked around by delaying the construction of the actual FileSystem object with a custom wrapper that only carries the constructor arguments.

bveeramani commented 1 year ago

I see. Makes sense