uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

S3FSWrapper is deprecated as of s3fs 0.5.0 #609

Closed dmcguire81 closed 4 years ago

dmcguire81 commented 4 years ago

According to the s3fs maintainer, wrapping s3fs.S3FileSystems with pyarrow.filesystem.S3FSWrapper is deprecated, and even harmful (hence the defect). In other words, s3fs<0.5.0 requires the wrapper be used, and s3fs>=0.5.0 requires that it not be used. Since petastorm can't do both, it should just choose to support s3fs>=0.5.0 and drop the wrapper.

dmcguire81 commented 4 years ago
import pyarrow.parquet as pq
from pyarrow.filesystem import S3FSWrapper
from s3fs import S3FileSystem

fs = S3FileSystem()
wrapped_fs = S3FSWrapper(fs)

dataset_url = "s3://some/small/partitioned/dataset"

try:
    print("Trying with wrapper...")
    dataset = pq.ParquetDataset(dataset_url, filesystem=wrapped_fs, validate_schema=False)
    print("succeeded")
except TypeError:
    print("failed.")
    print("Trying without wrapper...")
    dataset = pq.ParquetDataset(dataset_url, filesystem=fs, validate_schema=False)
    print("succeeded.")

With 0.4.2:

Trying with wrapper...
succeeded

With 0.5.0:

Trying with wrapper...
./env/lib/python3.7/site-packages/pyarrow/filesystem.py:394: RuntimeWarning: coroutine 'S3FileSystem._ls' was never awaited
  for key in list(self.fs._ls(path, refresh=refresh)):
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
failed.
Trying without wrapper...
succeeded.