Open leonardozcm opened 3 years ago
Actually, when testing my code followed:
# from petastorm.codecs import CompressedImageCodec, NdarrayCodec, ScalarCodec
# from petastorm.etl.dataset_metadata import materialize_dataset
# from petastorm.unischema import Unischema, UnischemaField, dict_to_spark_row
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark.sql.types import *
import os
se = SparkSession.builder.config(
'spark.driver.memory', '2g').getOrCreate()
sc = se.sparkContext
dir = '/path/to/my/data'
file_list = os.listdir(dir)
data = sc.parallelize(file_list)
print(data.take(2))
It only works without the commented line.
So maybe import petastorm change the behavior of pyspark?
Thanks a lot for the report. I updated the example in #676. You can also launch examples/hello_world/petastorm_dataset/generate_petastorm_dataset.py
to run the same example.
I am not sure what's the root cause, but the import order seems to make a difference. I am not sure what's the root cause - it is most likely to be some deficiency in the newer versions of pyspark pickle implementation since the same issue did not occur before.
Yes, by adjusting the import position of petastorm and pyspark solves this problem, thanks a lot!
hi, when i am testing your generating dummy dataset, this error occurs:
envs: