Open ckchow opened 3 years ago
raw = spark.read.parquet('gs://bucket/petastorm')
yields
[Row(features=bytearray(b"\x93NUMPY\x01\x00v\x00{\'descr\': \'<f4\', \'fortran_order\': False, \'shape\': (310,), } \n\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x <...>
which looks like the NdArrayCodec worked.
Would running this locally and writing either to local fs or gs works worth the same code?
When using an inmemory local cluster:
file:///<blah>/temp/
URI worksgs://bucket/petastorm
as above does not.This feels like it's related to https://stackoverflow.com/questions/58646728/pyarrow-lib-arrowioerror-invalid-parquet-file-size-is-0-bytes, but I get the same size 0 issue even if I disable _SUCCESS files and don't write metadata.
More strange details
with make_reader('gs://bucket/blah/part-00000-629b3fcc-6ee4-40b0-8b84-31513272952f-c000.snappy.parquet') as reader:
pass
yields OSError: Passed non-file path: bucket/blah/petastorm/part-00000-629b3fcc-6ee4-40b0-8b84-31513272952f-c000.snappy.parquet
with make_reader('gs://churn_dev/v4/predict/2021-04-28/petastorm/') as reader:
pass
yields ArrowInvalid: Parquet file size is 0 bytes
Tried reproducing your issue using examples/hello_world/petastorm_dataset/
scripts against a gs storage. I was not able to reproduce the ArrowInvalid
exception. I did observe a misleading OSError: Passed non-file path
exception when in fact there was some permission issues accessing the bucket.
Can you try opening that parquet store using pyarrow (without petastorm?)? That way, we would know if the problem stems from software layers under petastorm or the problem is with petastorm itself?
I'm trying out petastorm on a google dataproc cluster, and when I try to materialize a dataset like the below:
I get pyarrow errors like
"ArrowInvalid: Parquet file size is 0 bytes"
when executing the above on a google storage URI like "gs://bucket/path/petastorm
" . Can anybody tell if this is a petastorm issue, a pyarrow issue, or maybe something else?library versions:
The files are created and appear to be valid upon inspection with the regular spark parquet reader. Trying to make a reader on the files via
causes the same error
ArrowInvalid: Parquet file size is 0 bytes
, and I assume it's the same root cause, whatever it is.