uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

petastorm.make_reader from s3 bucket path fails #587

Open xb478 opened 4 years ago

xb478 commented 4 years ago

Versions: petastorm: 0.9.4 pyarrow: 1.0.1 python: 3.6.9

I am trying to create a torch Dataloader via petastorm from a petastorm dataset stored on s3. For this purpose I ran the hello_world_dataset locally and then uploaded the generated folder "hello_world_dataset" to my s3 bucket.

This is a "as simple version as it gets" where I try to make a petastorm reader directly from the s3 bucket:

S3_DATASET_URL = \'s3:///hello_world_dataset\'

with make_reader(S3_DATASET_URL) as reader:
    # Pure python
    for sample in reader:
        print(sample.id)

When I run this example I get the following output: traceback (most recent call last): File "python_hello_world.py", line 6, in <module> with make_reader(S3_DATASET_URL) as reader: File "/usr/local/lib/python3.6/dist-packages/petastorm/reader.py", line 138, in make_reader dataset_metadata.get_schema_from_dataset_url(dataset_url, hdfs_driver=hdfs_driver) File "/usr/local/lib/python3.6/dist-packages/petastorm/etl/dataset_metadata.py", line 395, in get_schema_from_dataset_url dataset = pq.ParquetDataset(path_or_paths, filesystem=fs, validate_schema=False, metadata_nthreads=10) File "/usr/local/lib/python3.6/dist-packages/pyarrow/parquet.py", line 1170, in __init__ open_file_func=partial(_open_dataset_file, self._metadata) File "/usr/local/lib/python3.6/dist-packages/pyarrow/parquet.py", line 1348, in _make_manifest metadata_nthreads=metadata_nthreads) File "/usr/local/lib/python3.6/dist-packages/pyarrow/parquet.py", line 927, in __init__ self._visit_level(0, self.dirpath, []) File "/usr/local/lib/python3.6/dist-packages/pyarrow/parquet.py", line 942, in _visit_level _, directories, files = next(fs.walk(base_path)) File "/usr/local/lib/python3.6/dist-packages/pyarrow/filesystem.py", line 394, in walk for key in list(self.fs._ls(path, refresh=refresh)): TypeError: 'coroutine' object is not iterable

When I try to access the parquet data directly with q.ParquetDataset( f"s3://{bucket}/{path}", filesystem=fs )

it works. I also tried using the S3FsWrapper examples as shown here: https://stackoverflow.com/questions/56135465/python-reading-parquet-files-stored-on-s3-using-petastorm-generates-connection

but had no luck and it fails with the same error/traceback.

selitvin commented 4 years ago

Wasn't able to reproduce the failure. I tried pyarrow 0.15.1 and 1.0.1

Here is what I did:

(.petastorm3.7) yevgeni@yevgeni-7530:~/uatc/dataset-toolkit$ git diff
diff --git a/examples/hello_world/petastorm_dataset/python_hello_world.py b/examples/hello_world/petastorm_dataset/python_hello_world.py
index eb0bb96..5cab7d6 100644
--- a/examples/hello_world/petastorm_dataset/python_hello_world.py
+++ b/examples/hello_world/petastorm_dataset/python_hello_world.py
@@ -20,7 +20,7 @@ from __future__ import print_function
 from petastorm import make_reader

-def python_hello_world(dataset_url='file:///tmp/hello_world_dataset'):
+def python_hello_world(dataset_url='s3://petastorm-test-storage/'):
     with make_reader(dataset_url) as reader:
         # Pure python
         for sample in reader:

and here is the content of my bucket:

$ aws s3 ls s3://petastorm-test-storage/
2020-08-26 18:10:09          0 _SUCCESS
2020-08-26 18:10:09       7438 _common_metadata
2020-08-26 18:10:11     725084 part-00000-0813a2ac-344c-4a6a-aae7-3414a12d3719-c000.snappy.parquet
2020-08-26 18:10:11     725084 part-00001-0813a2ac-344c-4a6a-aae7-3414a12d3719-c000.snappy.parquet

Do you see any difference between our setups?

xb478 commented 4 years ago

This is strange. My current setup is running in a docker container horovod/horovod:0.19.3-tf2.1.0-torch-mxnet1.6.0-py3.6-gpu, so in order to be sure that I have control over what is installed I tried the same experiment by setting up petastorm in a clean venv using python 3.8.2 this time. Code is the same as above, only that I uploaded the parquet files generated by the hello world example directly to the root of the s3 bucket, just as you did. Result is the same as in my first message: TypeError: 'coroutine' object is not iterable

I wonder whether it is related to this: https://github.com/s3tools/s3cmd/issues/402

I am using AWS_REGION="eu-central-1" which was introduced in 2014 and hence does not support the v2 api anymore. I tested this also with dragondisk which throws an error that it is unable to authenticate because of lacking support for AWS4-HMAC-SHA256.

Could it be that this is also the case for s3 support via pyarrow in petastorm?

dmcguire81 commented 4 years ago

@selitvin this is related to the issue I opened with s3fs where, between versions 0.4.2 and 0.5.1, it became inappropriate to wrap s3fs.S3FileSystem with an instance of pyarrow.filesystem.S3FSWrapper, because it now implements the relevant pyarrow.filesystem.FileSystem interface, directly.

I don't think that it's possible to fix this without making the transitive dependency on s3fs that petastorm is leveraging explicit, by making it an explicit extra, like petastorm[s3]. Then, either the version needs to be pinned to <0.5 or the wrapper needs to be removed and the version pinned to >=0.5.

@xb478 pin to s3fs==0.4.2, in the meantime.

dmcguire81 commented 4 years ago

Filed #609.

westfly commented 1 year ago

I change s3::// prefix to s3a::// works