uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

ModuleNotFoundError: No module named 'petastorm.codecs'; 'petastorm' is not a package #712

Closed aseembits93 closed 3 years ago

aseembits93 commented 3 years ago

I get the following error when I try to run this ->

import numpy as np
import petastorm
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType

from petastorm.codecs import ScalarCodec, CompressedImageCodec, NdarrayCodec
from petastorm.etl.dataset_metadata import materialize_dataset
from petastorm.unischema import dict_to_spark_row, Unischema, UnischemaField

# The schema defines how the dataset schema looks like
HelloWorldSchema = Unischema('HelloWorldSchema', [
    UnischemaField('id', np.int32, (), ScalarCodec(IntegerType()), False),
    UnischemaField('image1', np.uint8, (128, 256, 3), CompressedImageCodec('png'), False),
    UnischemaField('array_4d', np.uint8, (None, 128, 30, None), NdarrayCodec(), False),
])

def row_generator(x):
    """Returns a single entry in the generated dataset. Return a bunch of random values as an example."""
    return {'id': x,
            'image1': np.random.randint(0, 255, dtype=np.uint8, size=(128, 256, 3)),
            'array_4d': np.random.randint(0, 255, dtype=np.uint8, size=(4, 128, 30, 3))}

def generate_petastorm_dataset(output_url='file:///home/razerbladestealth/helloworld'):
    rowgroup_size_mb = 256

    spark = SparkSession.builder.config('spark.driver.memory', '2g').master('local[2]').getOrCreate()
    sc = spark.sparkContext

    # Wrap dataset materialization portion. Will take care of setting up spark environment variables as
    # well as save petastorm specific metadata
    rows_count = 10
    with materialize_dataset(spark, output_url, HelloWorldSchema, rowgroup_size_mb):

        rows_rdd = sc.parallelize(range(rows_count))\
            .map(row_generator)\
            .map(lambda x: dict_to_spark_row(HelloWorldSchema, x))

        spark.createDataFrame(rows_rdd, HelloWorldSchema.as_spark_schema()) \
            .coalesce(10) \
            .write \
            .mode('overwrite') \
            .parquet(output_url)

generate_petastorm_dataset()
  File "petastorm.py", line 2, in <module>
    import petastorm
  File "/home/razerbladestealth/petastorm.py", line 6, in <module>
    from petastorm.codecs import ScalarCodec, CompressedImageCodec, NdarrayCodec
ModuleNotFoundError: No module named 'petastorm.codecs'; 'petastorm' is not a package

I'm suspecting it has something to do with the order of importing petastorm and pyspark as mentioned in another issue. Thanks!

selitvin commented 3 years ago

How did you install petastorm? Seems like it's not installed in your python environment. Would python -c "import petastorm" work? What does pip list | grep petastorm shows?

aseembits93 commented 3 years ago

pip list | grep petastorm shows petastorm 0.11.1 python -c "import petastorm" works fine

selitvin commented 3 years ago

What about python -c "import petastorm.codecs"?

aseembits93 commented 3 years ago

I figured out the problem. I naively named my file as petastorm.py, after renaming it to anything else, everything works perfectly. Sorry for the trouble, Cheers!