passionke / starry

fast spark local mode
Apache License 2.0
35 stars 5 forks source link

java.io.NotSerializableException: com.github.passionke.starry.StarrySparkContext #9

Open 2efPer opened 6 years ago

2efPer commented 6 years ago

org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.StarryClosureCleaner$.ensureSerializable(StarryClosureCleaner.scala:46) ~[classes/:2.3.1] at org.apache.spark.util.StarryClosureCleaner$.clean(StarryClosureCleaner.scala:40) ~[classes/:2.3.1] at org.apache.spark.util.StarryClosureCleaner$.clean(StarryClosureCleaner.scala:23) ~[classes/:2.3.1] at com.github.passionke.starry.StarrySparkContext.clean(StarrySparkContext.scala:9) ~[classes/:na] at org.apache.spark.rdd.HadoopRDD.(HadoopRDD.scala:105) ~[spark-core_2.11-2.3.1.jar:2.3.1] at org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1031) ~[spark-core_2.11-2.3.1.jar:2.3.1] at org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1021) ~[spark-core_2.11-2.3.1.jar:2.3.1] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) ~[spark-core_2.11-2.3.1.jar:2.3.1] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) ~[spark-core_2.11-2.3.1.jar:2.3.1] at org.apache.spark.SparkContext.withScope(SparkContext.scala:693) ~[spark-core_2.11-2.3.1.jar:2.3.1] at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1021) ~[spark-core_2.11-2.3.1.jar:2.3.1] at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:824) ~[spark-core_2.11-2.3.1.jar:2.3.1] at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:822) ~[spark-core_2.11-2.3.1.jar:2.3.1] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) ~[spark-core_2.11-2.3.1.jar:2.3.1] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) ~[spark-core_2.11-2.3.1.jar:2.3.1] at org.apache.spark.SparkContext.withScope(SparkContext.scala:693) ~[spark-core_2.11-2.3.1.jar:2.3.1] at org.apache.spark.SparkContext.textFile(SparkContext.scala:822) ~[spark-core_2.11-2.3.1.jar:2.3.1] at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:387) ~[spark-mllib_2.11-2.3.1.jar:2.3.1] at org.apache.spark.ml.feature.CountVectorizerModel$CountVectorizerModelReader.load(CountVectorizer.scala:306) ~[spark-mllib_2.11-2.3.1.jar:2.3.1] at org.apache.spark.ml.feature.CountVectorizerModel$CountVectorizerModelReader.load(CountVectorizer.scala:301) ~[spark-mllib_2.11-2.3.1.jar:2.3.1] at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:223) ~[spark-mllib_2.11-2.3.1.jar:2.3.1] at org.apache.spark.ml.feature.CountVectorizerModel$.load(CountVectorizer.scala:322) ~[spark-mllib_2.11-2.3.1.jar:2.3.1] at org.ml.cvmodel.Similarity.(Similarity.scala:46) ~[classes/:na] at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[na:1.8.0_171] at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[na:1.8.0_171] at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[na:1.8.0_171] at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ~[na:1.8.0_171] at org.springframework.beans.BeanUtils.instantiateClass(BeanUtils.java:142) ~[spring-beans-4.3.10.RELEASE.jar:4.3.10.RELEASE] ... 32 common frames omitted Caused by: java.io.NotSerializableException: com.github.passionke.starry.StarrySparkContext Serialization stack:

  • object not serializable (class: com.github.passionke.starry.StarrySparkContext, value: com.github.passionke.starry.StarrySparkContext@3a8cea24)
  • field (class: org.apache.spark.SparkContext$$anonfun$hadoopFile$1, name: $outer, type: class org.apache.spark.SparkContext)
  • object (class org.apache.spark.SparkContext$$anonfun$hadoopFile$1, )
  • field (class: org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$30, name: $outer, type: class org.apache.spark.SparkContext$$anonfun$hadoopFile$1)
  • object (class org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$30, ) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) ~[spark-core_2.11-2.3.1.jar:2.3.1] at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) ~[spark-core_2.11-2.3.1.jar:2.3.1] at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) ~[spark-core_2.11-2.3.1.jar:2.3.1] at org.apache.spark.util.StarryClosureCleaner$.ensureSerializable(StarryClosureCleaner.scala:48) ~[classes/:2.3.1] ... 59 common frames omitted

code:

   val sparkContext:StarrySparkContext = new StarrySparkContext(sparkConf)
  val spark: SparkSession =
    SparkSession.builder
      .getOrCreate
  val cvModel: CountVectorizerModel = CountVectorizerModel.load("file:///Users/2efper/model/cvmodel")

Enviroment:

passionke commented 6 years ago

starry works fine in memory dataset, In your case, it calls sc.textFile(path) ,

In normal spark runtime, spark use complex clean function.

def clean[F <: AnyRef](f: F, checkSerializable: Boolean = true): F = {
...
}

it will set unused field to null in closure f for widely serialization task. But it cost too much in some case. so In starry , we overwrite clean function。 That's why your sample code throws NotSerializableException