Open raulchen opened 1 year ago
@raulchen would it be possible to extend _S3FileSystemWrapper
to work with arbitrary file systems?
@raulchen would it be possible to extend
_S3FileSystemWrapper
to work with arbitrary file systems?
@bveeramani There is an issue that cannot be solved with the current approach. The problem is that the FileSystem object itself is serializable. but when it get deserialized on the read_task worker, memory address of some native code gets messed up, and causes segfault. Currently, this issue is worked around by delaying the construction of the actual FileSystem object with a custom wrapper that only carries the constructor arguments.
I see. Makes sense
Today, we can pass a
pyarrow.fs.FileSystem
to theread_xxx
APIs. However, some FileSystem objects have native code and doesn't work well with Python serialization. We worked aroundpa.fs.S3FileSystem
with a custom wrapper class_S3FileSystemWrapper
. But this workaround is not extensible for other user-defined FileSystems. We should allow users to pass a factory function that produces a FileSystem object to prevent this issue.