uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

ValueError: Cell is empty #675

Open leonardozcm opened 3 years ago

leonardozcm commented 3 years ago

hi, when i am testing your generating dummy dataset, this error occurs:

(test37) root@5f137735679c:~/workspace/orca-lite-poc/examples# python offical_ut.py 
21/05/05 07:57:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Traceback (most recent call last):
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/serializers.py", line 437, in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 101, in dumps
    cp.dump(obj)
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 540, in dump
    return Pickler.dump(self, obj)
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 437, in dump
    self.save(obj)
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 789, in save_tuple
    save(element)
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 722, in save_function
    *self._dynamic_function_reduce(obj), obj=obj
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 659, in _save_reduce_pickle5
    dictitems=dictitems, obj=obj
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 638, in save_reduce
    save(args)
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 789, in save_tuple
    save(element)
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 774, in save_tuple
    save(element)
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/dill/_dill.py", line 1177, in save_cell
    f = obj.cell_contents
ValueError: Cell is empty
Traceback (most recent call last):
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/serializers.py", line 437, in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 101, in dumps
    cp.dump(obj)
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 540, in dump
    return Pickler.dump(self, obj)
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 437, in dump
    self.save(obj)
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 789, in save_tuple
    save(element)
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 722, in save_function
    *self._dynamic_function_reduce(obj), obj=obj
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 659, in _save_reduce_pickle5
    dictitems=dictitems, obj=obj
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 638, in save_reduce
    save(args)
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 789, in save_tuple
    save(element)
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 774, in save_tuple
    save(element)
  File "/opt/conda/envs/test37/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/dill/_dill.py", line 1177, in save_cell
    f = obj.cell_contents
ValueError: Cell is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "offical_ut.py", line 16, in <module>
    print(data.take(2))
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/rdd.py", line 1566, in take
    res = self.context.runJob(self, takeUpToNumLeft, p)
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/context.py", line 1233, in runJob
    sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/rdd.py", line 2950, in _jrdd
    self._jrdd_deserializer, profiler)
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/rdd.py", line 2828, in _wrap_function
    pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/rdd.py", line 2814, in _prepare_for_python_RDD
    pickled_command = ser.dumps(command)
  File "/opt/conda/envs/test37/lib/python3.7/site-packages/pyspark/serializers.py", line 447, in dumps
    raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: ValueError: Cell is empty

envs:

OS                       Linux 5f137735679c 3.10.0-1160.24.1.el7.x86_64
python                3.7
java                     openjdk-8-amd64
petastorm            0.10.0
pyspark                3.1.1
leonardozcm commented 3 years ago

Actually, when testing my code followed:

# from petastorm.codecs import CompressedImageCodec, NdarrayCodec, ScalarCodec
# from petastorm.etl.dataset_metadata import materialize_dataset
# from petastorm.unischema import Unischema, UnischemaField, dict_to_spark_row
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark.sql.types import *
import os
se = SparkSession.builder.config(
    'spark.driver.memory', '2g').getOrCreate()
sc = se.sparkContext
dir = '/path/to/my/data'
file_list = os.listdir(dir)
data = sc.parallelize(file_list)
print(data.take(2))

It only works without the commented line.

leonardozcm commented 3 years ago

So maybe import petastorm change the behavior of pyspark?

selitvin commented 3 years ago

Thanks a lot for the report. I updated the example in #676. You can also launch examples/hello_world/petastorm_dataset/generate_petastorm_dataset.py to run the same example.

I am not sure what's the root cause, but the import order seems to make a difference. I am not sure what's the root cause - it is most likely to be some deficiency in the newer versions of pyspark pickle implementation since the same issue did not occur before.

leonardozcm commented 3 years ago

Yes, by adjusting the import position of petastorm and pyspark solves this problem, thanks a lot!