ValueError: Cell is empty

leonardozcm commented 3 years ago

hi, when i am testing your generating dummy dataset, this error occurs:

(test37) root@5f137735679c:~/workspace/orca-lite-poc/examples# python offical_ut.py 
21/05/05 07:57:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Traceback (most recent call last):
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/serializers.py", line 437, in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 101, in dumps
    cp.dump(obj)
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 540, in dump
    return Pickler.dump(self, obj)
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 437, in dump
    self.save(obj)
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 789, in save_tuple
    save(element)
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 722, in save_function
    *self._dynamic_function_reduce(obj), obj=obj
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 659, in _save_reduce_pickle5
    dictitems=dictitems, obj=obj
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 638, in save_reduce
    save(args)
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 789, in save_tuple
    save(element)
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 774, in save_tuple
    save(element)
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/dill/_dill.py", line 1177, in save_cell
    f = obj.cell_contents
ValueError: Cell is empty
Traceback (most recent call last):
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/serializers.py", line 437, in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 101, in dumps
    cp.dump(obj)
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 540, in dump
    return Pickler.dump(self, obj)
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 437, in dump
    self.save(obj)
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 789, in save_tuple
    save(element)
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 722, in save_function
    *self._dynamic_function_reduce(obj), obj=obj
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 659, in _save_reduce_pickle5
    dictitems=dictitems, obj=obj
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 638, in save_reduce
    save(args)
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 789, in save_tuple
    save(element)
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 774, in save_tuple
    save(element)
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/dill/_dill.py", line 1177, in save_cell
    f = obj.cell_contents
ValueError: Cell is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "offical_ut.py", line 16, in <module>
    print(data.take(2))
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/rdd.py", line 1566, in take
    res = self.context.runJob(self, takeUpToNumLeft, p)
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/context.py", line 1233, in runJob
    sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/rdd.py", line 2950, in _jrdd
    self._jrdd_deserializer, profiler)
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/rdd.py", line 2828, in _wrap_function
    pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/rdd.py", line 2814, in _prepare_for_python_RDD
    pickled_command = ser.dumps(command)
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/serializers.py", line 447, in dumps
    raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: ValueError: Cell is empty

envs:

OS                       Linux 5f137735679c 3.10.0-1160.24.1.el7.x86_64
python                3.7
java                     openjdk-8-amd64
petastorm            0.10.0
pyspark                3.1.1

leonardozcm commented 3 years ago

Actually, when testing my code followed:

# from petastorm.codecs import CompressedImageCodec, NdarrayCodec, ScalarCodec
# from petastorm.etl.dataset_metadata import materialize_dataset
# from petastorm.unischema import Unischema, UnischemaField, dict_to_spark_row
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark.sql.types import *
import os
se = SparkSession.builder.config(
    'spark.driver.memory', '2g').getOrCreate()
sc = se.sparkContext
dir = '/path/to/my/data'
file_list = os.listdir(dir)
data = sc.parallelize(file_list)
print(data.take(2))

It only works without the commented line.

leonardozcm commented 3 years ago

So maybe import petastorm change the behavior of pyspark？

selitvin commented 3 years ago

Thanks a lot for the report. I updated the example in #676. You can also launch examples/hello_world/petastorm_dataset/generate_petastorm_dataset.py to run the same example.

I am not sure what's the root cause, but the import order seems to make a difference. I am not sure what's the root cause - it is most likely to be some deficiency in the newer versions of pyspark pickle implementation since the same issue did not occur before.

leonardozcm commented 3 years ago

Yes, by adjusting the import position of petastorm and pyspark solves this problem, thanks a lot!

uber / petastorm

ValueError: Cell is empty #675