uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

OSError: Passed non-file path when passing an s3 path to materialize_dateset #645

Closed Jomonsugi closed 3 years ago

Jomonsugi commented 3 years ago

This issue could be related to this one, however I am attempting to use materialize_dataset with an s3 path, not hdfs.

Given output_url='s3://my-bucket/petastorm/test' the following error occurs:

An error was encountered:
Passed non-file path: my-bucket/petastorm/test
Traceback (most recent call last):
  File "/usr/lib/environs/e-a-2019.03-py-3.7.3/lib/python3.7/site-packages/pyarrow/parquet.py", line 1182, in __init__
    open_file_func=partial(_open_dataset_file, self._metadata)
  File "/usr/lib/environs/e-a-2019.03-py-3.7.3/lib/python3.7/site-packages/pyarrow/parquet.py", line 1377, in _make_manifest
    .format(path))
OSError: Passed non-file path: my-bucket/petastorm/test

Although I can see that the error is traced to _make_manifest, seemingly when ParquetDataset is initiated here, I haven't been able to figure out the root cause.

Strangely, while I was scoping the use of petastorm with multiple iterations of using materialize_dataset over the past couple weeks, it worked without a hitch, then this error popped up seemingly out of the blue. I found some empathy here, but thus far nothing to shake the error.

Jomonsugi commented 3 years ago

I was able to "solve" this problem by writing a spark dataframe in parquet format outside of materialize_dataset and then running the materialize_dataset with nothing but a print statement so that the _common_metadata file is added.

def add_petastorm_metadata():

        with materialize_dataset(spark, 
                                 output_url, 
                                 schema, 
                                 row_group_size_mb):

            print('_common_metadata file added')

add_petastorm_metadata()

As far as the root cause of s3fs throwing the error within the context manager materialize_dataset, no idea.