uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

IndexError: list index out of range #527

Closed danielhaviv closed 4 years ago

danielhaviv commented 4 years ago

I'm trying to run the code examples to understand how exactly to use Petastorm with PyTorch but the generate_petastorm_mnist.py script fails with the following error: IndexError: list index out of range

This is the stack trace:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<command-6237180> in <module>
      1 # Make a temp dir that we'll clean up afterward
      2 download_dir = tempfile.mkdtemp()
----> 3 mnist_data_to_petastorm_dataset(download_dir, "file:///tmp/daniel.haviv@databricks.com/mnistps")
      4 
      5 if args.download_dir is None:

<command-6237171> in mnist_data_to_petastorm_dataset(download_dir, output_url, spark_master, parquet_files_count, mnist_data)
    129                 .write \
    130                 .option('compression', 'none') \
--> 131                 .parquet(dset_output_url)
    132 
    133 

/databricks/python/lib/python3.7/contextlib.py in __exit__(self, type, value, traceback)
    117         if type is None:
    118             try:
--> 119                 next(self.gen)
    120             except StopIteration:
    121                 return False

/databricks/python/lib/python3.7/site-packages/petastorm/etl/dataset_metadata.py in materialize_dataset(spark, dataset_url, schema, row_group_size_mb, use_summary_metadata, filesystem_factory)
    110         validate_schema=False)
    111 
--> 112     _generate_unischema_metadata(dataset, schema)
    113     if not use_summary_metadata:
    114         _generate_num_row_groups_per_file(dataset, spark.sparkContext, filesystem_factory)

/databricks/python/lib/python3.7/site-packages/petastorm/etl/dataset_metadata.py in _generate_unischema_metadata(dataset, schema)
    190     assert schema
    191     serialized_schema = pickle.dumps(schema)
--> 192     utils.add_to_dataset_metadata(dataset, UNISCHEMA_KEY, serialized_schema)
    193 
    194 

/databricks/python/lib/python3.7/site-packages/petastorm/utils.py in add_to_dataset_metadata(dataset, key, value)
    113             arrow_metadata = pyarrow.parquet.read_metadata(f)
    114     else:
--> 115         arrow_metadata = compat_get_metadata(dataset.pieces[0], dataset.fs.open)
    116 
    117     base_schema = arrow_metadata.schema.to_arrow_schema()

IndexError: list index out of range
danielhaviv commented 4 years ago

I've mistaken /tmp with /dbfs/tmp on Databricks