OSError: Passed non-file path when passing an s3 path to materialize_dateset

uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

Apache License 2.0

1.78k stars 285 forks source link

This issue could be related to this one, however I am attempting to use materialize_dataset with an s3 path, not hdfs.

Given output_url='s3://my-bucket/petastorm/test' the following error occurs:

An error was encountered:
Passed non-file path: my-bucket/petastorm/test
Traceback (most recent call last):
  File "/usr/lib/environs/e-a-2019.03-py-3.7.3/lib/python3.7/site-packages/pyarrow/parquet.py", line 1182, in __init__
    open_file_func=partial(_open_dataset_file, self._metadata)
  File "/usr/lib/environs/e-a-2019.03-py-3.7.3/lib/python3.7/site-packages/pyarrow/parquet.py", line 1377, in _make_manifest
    .format(path))
OSError: Passed non-file path: my-bucket/petastorm/test

Although I can see that the error is traced to _make_manifest, seemingly when ParquetDataset is initiated here, I haven't been able to figure out the root cause.

Strangely, while I was scoping the use of petastorm with multiple iterations of using materialize_dataset over the past couple weeks, it worked without a hitch, then this error popped up seemingly out of the blue. I found some empathy here, but thus far nothing to shake the error.

def add_petastorm_metadata(): with materialize_dataset(spark, output_url, schema, row_group_size_mb): print('_common_metadata file added') add_petastorm_metadata()

uber / petastorm

OSError: Passed non-file path when passing an s3 path to materialize_dateset #645