Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
I'm trying to run the code examples to understand how exactly to use Petastorm with PyTorch but the generate_petastorm_mnist.py script fails with the following error:
IndexError: list index out of range
This is the stack trace:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<command-6237180> in <module>
1 # Make a temp dir that we'll clean up afterward
2 download_dir = tempfile.mkdtemp()
----> 3 mnist_data_to_petastorm_dataset(download_dir, "file:///tmp/daniel.haviv@databricks.com/mnistps")
4
5 if args.download_dir is None:
<command-6237171> in mnist_data_to_petastorm_dataset(download_dir, output_url, spark_master, parquet_files_count, mnist_data)
129 .write \
130 .option('compression', 'none') \
--> 131 .parquet(dset_output_url)
132
133
/databricks/python/lib/python3.7/contextlib.py in __exit__(self, type, value, traceback)
117 if type is None:
118 try:
--> 119 next(self.gen)
120 except StopIteration:
121 return False
/databricks/python/lib/python3.7/site-packages/petastorm/etl/dataset_metadata.py in materialize_dataset(spark, dataset_url, schema, row_group_size_mb, use_summary_metadata, filesystem_factory)
110 validate_schema=False)
111
--> 112 _generate_unischema_metadata(dataset, schema)
113 if not use_summary_metadata:
114 _generate_num_row_groups_per_file(dataset, spark.sparkContext, filesystem_factory)
/databricks/python/lib/python3.7/site-packages/petastorm/etl/dataset_metadata.py in _generate_unischema_metadata(dataset, schema)
190 assert schema
191 serialized_schema = pickle.dumps(schema)
--> 192 utils.add_to_dataset_metadata(dataset, UNISCHEMA_KEY, serialized_schema)
193
194
/databricks/python/lib/python3.7/site-packages/petastorm/utils.py in add_to_dataset_metadata(dataset, key, value)
113 arrow_metadata = pyarrow.parquet.read_metadata(f)
114 else:
--> 115 arrow_metadata = compat_get_metadata(dataset.pieces[0], dataset.fs.open)
116
117 base_schema = arrow_metadata.schema.to_arrow_schema()
IndexError: list index out of range
I'm trying to run the code examples to understand how exactly to use Petastorm with PyTorch but the generate_petastorm_mnist.py script fails with the following error:
IndexError: list index out of range
This is the stack trace: