uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

`parquet file size 0 bytes` when materializing dataset #671

Open ckchow opened 3 years ago

ckchow commented 3 years ago

I'm trying out petastorm on a google dataproc cluster, and when I try to materialize a dataset like the below:

Schema = Unischema('Schema', [
    UnischemaField('features', np.float32, (310,), NdarrayCodec(), False)
])

def make_dataset(output_uri):
    rowgroup_size_mb = 256
    with materialize_dataset(spark, output_uri, Schema, rowgroup_size_mb, use_summary_metadata=True):
        rows_rdd = blah.select('features').limit(1000) \
          .rdd\
          .map(lambda x: {'features': np.array(x['features'].toArray(), dtype=np.float32) }) \
          .map(lambda x: dict_to_spark_row(Schema, x))

        spark.createDataFrame(rows_rdd, Schema.as_spark_schema())\
          .write \
          .mode('overwrite') \
          .parquet(output_uri)

I get pyarrow errors like "ArrowInvalid: Parquet file size is 0 bytes" when executing the above on a google storage URI like "gs://bucket/path/petastorm" . Can anybody tell if this is a petastorm issue, a pyarrow issue, or maybe something else?

library versions:

fsspec==0.9.0
gcsfs==0.8.0
petastorm==0.10.0
pyarrow==0.17.1
# Editable install with no version control (pyspark==3.1.1)

The files are created and appear to be valid upon inspection with the regular spark parquet reader. Trying to make a reader on the files via

with make_reader("gs://bucket/petastorm") as reader:
  pass

causes the same error ArrowInvalid: Parquet file size is 0 bytes, and I assume it's the same root cause, whatever it is.

ckchow commented 3 years ago

raw = spark.read.parquet('gs://bucket/petastorm') yields

[Row(features=bytearray(b"\x93NUMPY\x01\x00v\x00{\'descr\': \'<f4\', \'fortran_order\': False, \'shape\': (310,), }                                                          \n\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x <...>

which looks like the NdArrayCodec worked.

selitvin commented 3 years ago

Would running this locally and writing either to local fs or gs works worth the same code?

ckchow commented 3 years ago

When using an inmemory local cluster:

This feels like it's related to https://stackoverflow.com/questions/58646728/pyarrow-lib-arrowioerror-invalid-parquet-file-size-is-0-bytes, but I get the same size 0 issue even if I disable _SUCCESS files and don't write metadata.


More strange details

with make_reader('gs://bucket/blah/part-00000-629b3fcc-6ee4-40b0-8b84-31513272952f-c000.snappy.parquet') as reader:
    pass

yields OSError: Passed non-file path: bucket/blah/petastorm/part-00000-629b3fcc-6ee4-40b0-8b84-31513272952f-c000.snappy.parquet


with make_reader('gs://churn_dev/v4/predict/2021-04-28/petastorm/') as reader:
    pass

yields ArrowInvalid: Parquet file size is 0 bytes

selitvin commented 3 years ago

Tried reproducing your issue using examples/hello_world/petastorm_dataset/ scripts against a gs storage. I was not able to reproduce the ArrowInvalid exception. I did observe a misleading OSError: Passed non-file path exception when in fact there was some permission issues accessing the bucket.

Can you try opening that parquet store using pyarrow (without petastorm?)? That way, we would know if the problem stems from software layers under petastorm or the problem is with petastorm itself?