uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

walk method in GCSFSWrapper returns empty string as one of filenames #558

Open alekswithakayy opened 4 years ago

alekswithakayy commented 4 years ago

To recreate:

import gcsfs
from petastorm.gcsfs_helpers.gcsfs_wrapper import GCSFSWrapper
path = "gs://your/bucket/path"
fs = GCSFSWrapper(gcsfs.GCSFileSystem())
_, directories, files = next(fs.walk(path))
print(files)
# returns ['', 'file1', 'file2']

This becomes a problem in petastorm.utils.add_to_dataset_metadata where we have the following line:

arrow_metadata = compat_get_metadata(dataset.pieces[0], dataset.fs.open)

The empty string ends up as pieces[0] and pyarrow ultimately throws the following error since this is not a valid filename:

Traceback (most recent call last):                                              
  File "build_petastorm_dataset.py", line 103, in <module>
    run(args)
  File "build_petastorm_dataset.py", line 79, in run
    .parquet(args.output_url)
  File "/opt/conda/default/lib/python3.6/contextlib.py", line 88, in __exit__
    next(self.gen)
  File "/opt/conda/default/lib/python3.6/site-packages/petastorm/etl/dataset_metadata.py", line 113, in materialize_dataset
    _generate_unischema_metadata(dataset, schema)
  File "/opt/conda/default/lib/python3.6/site-packages/petastorm/etl/dataset_metadata.py", line 206, in _generate_unischema_metadata
    utils.add_to_dataset_metadata(dataset, UNISCHEMA_KEY, serialized_schema)
  File "/opt/conda/default/lib/python3.6/site-packages/petastorm/utils.py", line 115, in add_to_dataset_metadata
    arrow_metadata = compat_get_metadata(dataset.pieces[0], dataset.fs.open)
  File "/opt/conda/default/lib/python3.6/site-packages/petastorm/compat.py", line 31, in compat_get_metadata
    arrow_metadata = piece.get_metadata()
  File "/opt/conda/default/lib/python3.6/site-packages/pyarrow/parquet.py", line 676, in get_metadata
    f = self.open()
  File "/opt/conda/default/lib/python3.6/site-packages/pyarrow/parquet.py", line 683, in open
    reader = self.open_file_func(self.path)
  File "/opt/conda/default/lib/python3.6/site-packages/pyarrow/parquet.py", line 1054, in _open_dataset_file
    buffer_size=dataset.buffer_size
  File "/opt/conda/default/lib/python3.6/site-packages/pyarrow/parquet.py", line 210, in __init__
    read_dictionary=read_dictionary, metadata=metadata)
  File "pyarrow/_parquet.pyx", line 1023, in pyarrow._parquet.ParquetReader.open
  File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Parquet file size is 0 bytes

@megaserg @selitvin

megaserg commented 4 years ago

Yes, I recently realized the version I merged was full of bugs. I've fixed it, let me upstream the patch.

alekswithakayy commented 4 years ago

@megaserg any updates on this? Willing to help if needed...