ModuleNotFoundError: No module named 'petastorm.codecs'; 'petastorm' is not a package

aseembits93 commented 3 years ago

I get the following error when I try to run this ->

import numpy as np
import petastorm
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType

from petastorm.codecs import ScalarCodec, CompressedImageCodec, NdarrayCodec
from petastorm.etl.dataset_metadata import materialize_dataset
from petastorm.unischema import dict_to_spark_row, Unischema, UnischemaField

# The schema defines how the dataset schema looks like
HelloWorldSchema = Unischema('HelloWorldSchema', [
    UnischemaField('id', np.int32, (), ScalarCodec(IntegerType()), False),
    UnischemaField('image1', np.uint8, (128, 256, 3), CompressedImageCodec('png'), False),
    UnischemaField('array_4d', np.uint8, (None, 128, 30, None), NdarrayCodec(), False),
])

def row_generator(x):
    """Returns a single entry in the generated dataset. Return a bunch of random values as an example."""
    return {'id': x,
            'image1': np.random.randint(0, 255, dtype=np.uint8, size=(128, 256, 3)),
            'array_4d': np.random.randint(0, 255, dtype=np.uint8, size=(4, 128, 30, 3))}

def generate_petastorm_dataset(output_url='file:///home/razerbladestealth/helloworld'):
    rowgroup_size_mb = 256

    spark = SparkSession.builder.config('spark.driver.memory', '2g').master('local[2]').getOrCreate()
    sc = spark.sparkContext

    # Wrap dataset materialization portion. Will take care of setting up spark environment variables as
    # well as save petastorm specific metadata
    rows_count = 10
    with materialize_dataset(spark, output_url, HelloWorldSchema, rowgroup_size_mb):

        rows_rdd = sc.parallelize(range(rows_count))\
            .map(row_generator)\
            .map(lambda x: dict_to_spark_row(HelloWorldSchema, x))

        spark.createDataFrame(rows_rdd, HelloWorldSchema.as_spark_schema()) \
            .coalesce(10) \
            .write \
            .mode('overwrite') \
            .parquet(output_url)

generate_petastorm_dataset()

  File "petastorm.py", line 2, in <module>
    import petastorm
  File "/home/razerbladestealth/petastorm.py", line 6, in <module>
    from petastorm.codecs import ScalarCodec, CompressedImageCodec, NdarrayCodec
ModuleNotFoundError: No module named 'petastorm.codecs'; 'petastorm' is not a package

I'm suspecting it has something to do with the order of importing petastorm and pyspark as mentioned in another issue. Thanks!

selitvin commented 3 years ago

How did you install petastorm? Seems like it's not installed in your python environment. Would python -c "import petastorm" work? What does pip list | grep petastorm shows?

aseembits93 commented 3 years ago

pip list | grep petastorm shows petastorm 0.11.1 python -c "import petastorm" works fine

selitvin commented 3 years ago

What about python -c "import petastorm.codecs"?

aseembits93 commented 3 years ago

I figured out the problem. I naively named my file as petastorm.py, after renaming it to anything else, everything works perfectly. Sorry for the trouble, Cheers!

uber / petastorm

ModuleNotFoundError: No module named 'petastorm.codecs'; 'petastorm' is not a package #712