uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

Replaced s3 and gcs connectors with fsspec to support additional filesystems #665

Closed tgaddair closed 3 years ago

tgaddair commented 3 years ago

fsspec is a library used by Pandas and Dask to enable reading and writing to arbitrary filesystems (see here for a list of filesystems supported out of the box).

One of the key features of fsspec is its interoperability with pyarrow filesystems by design. As such, any filesystem created by fsspec can be used to read the ParquetDataset provided by pyarrow.

This PR replaces existing implementations of S3 and GCS filesystems in Petastorm by delegating to fsspec instead, adding additional support for other similar object storage filesystems like ADLS gen2 (MS Azure equivalent). It also replaces the s3_config_kwargs with the more general storage_options used by Pandas and Dask.

codecov[bot] commented 3 years ago

Codecov Report

Merging #665 (f9ddbd3) into master (e89a7fe) will increase coverage by 0.55%. The diff coverage is 90.90%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #665      +/-   ##
==========================================
+ Coverage   85.23%   85.79%   +0.55%     
==========================================
  Files          85       84       -1     
  Lines        4985     4928      -57     
  Branches      792      779      -13     
==========================================
- Hits         4249     4228      -21     
+ Misses        596      561      -35     
+ Partials      140      139       -1     
Impacted Files Coverage Δ
petastorm/reader.py 89.67% <ø> (ø)
setup.py 0.00% <ø> (ø)
petastorm/fs_utils.py 91.46% <90.00%> (-0.29%) :arrow_down:
petastorm/etl/dataset_metadata.py 87.33% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update e89a7fe...f9ddbd3. Read the comment docs.

tgaddair commented 3 years ago

Thanks for taking a look @selitvin! I'm testing this with the following OS:

And the following filesystems:

I would suggest holding off on merging (if it looks good) until Monday, as I still want to run a few more tests this weekend to make sure everything is correct.

tgaddair commented 3 years ago

Hey @selitvin, thanks I verified that all of the environments worked as expected. Please feel free to merge whenever you're ready. Also, creating a new release would be awesome.