Closed Jomonsugi closed 3 years ago
I was able to "solve" this problem by writing a spark dataframe in parquet format outside of materialize_dataset
and then running the materialize_dataset
with nothing but a print statement so that the _common_metadata
file is added.
def add_petastorm_metadata():
with materialize_dataset(spark,
output_url,
schema,
row_group_size_mb):
print('_common_metadata file added')
add_petastorm_metadata()
As far as the root cause of s3fs throwing the error within the context manager materialize_dataset
, no idea.
This issue could be related to this one, however I am attempting to use
materialize_dataset
with an s3 path, not hdfs.Given
output_url='s3://my-bucket/petastorm/test'
the following error occurs:Although I can see that the error is traced to _make_manifest, seemingly when
ParquetDataset
is initiated here, I haven't been able to figure out the root cause.Strangely, while I was scoping the use of petastorm with multiple iterations of using
materialize_dataset
over the past couple weeks, it worked without a hitch, then this error popped up seemingly out of the blue. I found some empathy here, but thus far nothing to shake the error.