locationtech / geotrellis

GeoTrellis is a geographic data processing engine for high performance applications.
http://geotrellis.io
Other
1.34k stars 360 forks source link

hadoopMultibandGeoTiffRDD throwed an error on the remote spark cluster : java. lang. ClassCastException #3510

Closed CaiCaiXian closed 1 year ago

CaiCaiXian commented 1 year ago

Describe the bug

When I was using hadoopMultibandGeoTiffRDD , he throwed an error on the remote spark cluster : java. lang. ClassCastException: cannot assign instance of java. lang. invoke. SerializedLambda to field org. apache. park. rdd. MapPartitionsRDD. f of type scala. Function3 in instance of org. apache. park. rdd. MapPartitionsRDD, but it worked fine in local[*]

To Reproduce

Provide as able:

Code Example:

val inputRDD:RDD[(ProjectedExtent, MultibandTile)] = sc.hadoopMultibandGeoTiffRDD(formatInputPath)

i use SparkSubmit to submit my jar with client to the remote spark

val args = Array(
  "--class", "com.xxx.geotrellis.RasterUtils",
  "--master", "spark://172.xx.xx.x:7077",
  "--deploy-mode", "client",
  "--executor-memory", "1g",
  "--total-executor-cores", "1",
  "--conf", "spark.driver.host=172.xx.xx.xx",
  "--conf", "spark.executor.memoryOverhead=512m",
  "file:/F://myjar.jar",
  "--inputPath hdfs://172.xx.xx.xxx:9000/xxx.tif",
  "--outputPath hdfs://172.xx.xx.xxx:9000/test"
)
SparkSubmit.main(args)

Expected behavior

I hope it can work fine in the remote spark

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

pomadchin commented 1 year ago

Hey @CaiCaiXian, could you post the whole file with the code? Most likely Spark forces serialization of the application parts that should not be serialized. Due to the way code is written they are serialized and it requires some adjustments.

CaiCaiXian commented 1 year ago

Hey @CaiCaiXian, could you post the whole file with the code? Most likely Spark forces serialization of the application parts that should not be serialized. Due to the way code is written they are serialized and it requires some adjustments.

i think i know the reason is because i didn‘t use maven-assembly-plugin to package the jar , so when it was submited to the remote spark it throwed that error . now i fix it. But unfortunately, it threw a new error again. Is it still a packaging issue ?

Caused by: java.lang.NoClassDefFoundError: Could not initialize class geotrellis.raster.io.geotiff.TiffType$

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2249) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2268) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2293) at org.apache.spark.rdd.RDD.count(RDD.scala:1274) at geotrellis.spark.store.GeoTiffInfoReader.readWindows(GeoTiffInfoReader.scala:87) at geotrellis.spark.store.GeoTiffInfoReader.readWindows$(GeoTiffInfoReader.scala:64) at geotrellis.spark.store.hadoop.HadoopGeoTiffInfoReader.readWindows(HadoopGeoTiffInfoReader.scala:30) at geotrellis.spark.store.hadoop.HadoopGeoTiffRDD$.apply(HadoopGeoTiffRDD.scala:129) at geotrellis.spark.store.hadoop.HadoopGeoTiffRDD$.apply(HadoopGeoTiffRDD.scala:160) at geotrellis.spark.store.hadoop.HadoopGeoTiffRDD$.multiband(HadoopGeoTiffRDD.scala:203) at geotrellis.spark.store.hadoop.HadoopGeoTiffRDD$.spatialMultiband(HadoopGeoTiffRDD.scala:252) at geotrellis.spark.store.hadoop.HadoopSparkContextMethods.hadoopMultibandGeoTiffRDD(HadoopSparkContextMethods.scala:96) at geotrellis.spark.store.hadoop.HadoopSparkContextMethods.hadoopMultibandGeoTiffRDD$(HadoopSparkContextMethods.scala:91) at geotrellis.spark.store.hadoop.Implicits$HadoopSparkContextMethodsWrapper.hadoopMultibandGeoTiffRDD(Implicits.scala:41) at geotrellis.spark.store.hadoop.HadoopSparkContextMethods.hadoopMultibandGeoTiffRDD(HadoopSparkContextMethods.scala:83) at geotrellis.spark.store.hadoop.HadoopSparkContextMethods.hadoopMultibandGeoTiffRDD$(HadoopSparkContextMethods.scala:82) at geotrellis.spark.store.hadoop.Implicits$HadoopSparkContextMethodsWrapper.hadoopMultibandGeoTiffRDD(Implicits.scala:41) at com.cjx.geospark.RasterUtils$.pyramid(RasterUtils.scala:52) at com.cjx.geospark.RasterUtils$.main(RasterUtils.scala:103) at com.cjx.geospark.RasterUtils.main(RasterUtils.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055) at com.cjx.geospark.SubmitUtil$.submitClientMode(SubmitUtil.scala:60) at com.cjx.geospark.SubmitUtil$.main(SubmitUtil.scala:10) at com.cjx.geospark.SubmitUtil.main(SubmitUtil.scala) Caused by: java.lang.NoClassDefFoundError: Could not initialize class geotrellis.raster.io.geotiff.TiffType$ at geotrellis.raster.io.geotiff.reader.GeoTiffInfo$.read(GeoTiffInfo.scala:141) at geotrellis.spark.store.hadoop.HadoopGeoTiffInfoReader.getGeoTiffInfo(HadoopGeoTiffInfoReader.scala:53) at geotrellis.spark.store.GeoTiffInfoReader.$anonfun$readWindows$1(GeoTiffInfoReader.scala:77) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:223) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:302) at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1508) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1435) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1499) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1322) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376) at org.apache.spark.rdd.RDD.iterator(RDD.scala:327) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750)

here are my function

def pyramid(inputPath:String,outputPath:String)(implicit sc:SparkContext): Unit ={ if (StrUtil.isBlankIfStr(inputPath)||StrUtil.isBlankIfStr(outputPath)){ println(inputPath) println(outputPath) throw new IllegalArgumentException("check the path!") } //read RDD val formatInputPath = SparkFileUtil.correctPath(SparkFileUtil.getFormatPath(inputPath)) val formatOutputPath = SparkFileUtil.correctPath(SparkFileUtil.getFormatPath(outputPath)) //get layerName val layerName = FileUtil.getPrefix(new File(formatInputPath)) val inputRDD:RDD[(ProjectedExtent, MultibandTile)] = sc.hadoopMultibandGeoTiffRDD(formatInputPath) //get metadata val (_, rasterMetaData) = CollectTileLayerMetadata.fromRDD(inputRDD, FloatingLayoutScheme(512)) val tiledRDD: RDD[(SpatialKey, MultibandTile)] = inputRDD .tileToLayout(rasterMetaData.cellType, rasterMetaData.layout, Bilinear) .repartition(100) val layoutScheme = ZoomedLayoutScheme(WebMercator, tileSize = 256) val contextRDD = ContextRDD(tiledRDD,rasterMetaData) val reprojected: TileRDDReprojectMethods[SpatialKey, MultibandTile] = new TileRDDReprojectMethods(contextRDD) val (zoom,reprojectedRDD): (Int, RDD[(SpatialKey, MultibandTile)] with Metadata[TileLayerMetadata[SpatialKey]]) = reprojected.reproject(WebMercator, layoutScheme) val dirOutputPath = formatOutputPath + "/" + layerName val attributeStore = AttributeStore(dirOutputPath) val writer = LayerWriter(dirOutputPath) Pyramid.upLevels(reprojectedRDD, layoutScheme, zoom, Bilinear) { (rdd, z) => val layerId = LayerId(layerName, z) if(attributeStore.layerExists(layerId)) { attributeStore match { case store: HadoopAttributeStore => new HadoopLayerManager(store).delete(layerId) case store: FileAttributeStore => new FileLayerManager(store).delete(layerId) } } writer.write(layerId, rdd, ZCurveKeyIndexMethod) } }

plugin: `

org.apache.maven.plugins
            <artifactId>maven-assembly-plugin</artifactId>
            <version>3.2.0</version>
            <configuration>
                <archive>
                    <manifest>
                        <mainClass>com.cjx.geospark.RasterUtils</mainClass>
                    </manifest>
                </archive>
                <descriptorRefs>
                    <descriptorRef>jar-with-dependencies</descriptorRef>
                </descriptorRefs>
            </configuration>
            <executions>
                <execution>
                    <id>make-assembly</id>
                    <phase>package</phase>
                    <goals>
                        <goal>single</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>

`

pomadchin commented 1 year ago

@CaiCaiXian yea, it is a packaging issue Caused by: java.lang.NoClassDefFoundError: Could not initialize class geotrellis.raster.io.geotiff.TiffType$

Most likely it fails on circe codecs derivation: https://github.com/locationtech/geotrellis/blob/master/raster/src/main/scala/geotrellis/raster/io/geotiff/TiffType.scala#L32-L40

Check shapeless / circe deps in the classpath!

pomadchin commented 1 year ago

There is also an example of shading / assembly merge strategy rules in the repo: https://github.com/locationtech/geotrellis/blob/master/project/Settings.scala#L512-L532

CaiCaiXian commented 1 year ago

There is also an example of shading / assembly merge strategy rules in the repo: https://github.com/locationtech/geotrellis/blob/master/project/Settings.scala#L512-L532

thank u ! I use ‘maven-shade-plugin’ to rename the dependency :cats-kernel_2.12:2.9.0 which conflicts with the dependency on the remote spark cluster named :cats-kernel_2.12:2.1.1 . it works!!

by the way. Do u know how to fix this problem?

` at 'geotrellis':

"define 1 overlapping resource:

pomadchin commented 1 year ago

@CaiCaiXian the merge strategy for .conf files should be merge, I think that's what is happening.

CaiCaiXian commented 1 year ago

@CaiCaiXian the merge strategy for .conf files should be merge, I think that's what is happening.

thanks it works , I'm sorry to have been bothering you. this maybe the last question. when i write layer it throwed a error. `Exception in thread "main" geotrellis.store.package$LayerWriteError: Failed to write Layer(name = "3199.00-614.00", zoom = 21) at geotrellis.spark.store.hadoop.HadoopLayerWriter._write(HadoopLayerWriter.scala:122) at geotrellis.spark.store.hadoop.HadoopLayerWriter._write(HadoopLayerWriter.scala:38) at geotrellis.spark.store.LayerWriter.write(LayerWriter.scala:152) at geotrellis.spark.store.LayerWriter.write$(LayerWriter.scala:144) at geotrellis.spark.store.hadoop.HadoopLayerWriter.write(HadoopLayerWriter.scala:38) at com.cjx.geospark.RasterUtils$.$anonfun$pyramid$8(RasterUtils.scala:87) at com.cjx.geospark.RasterUtils$.$anonfun$pyramid$8$adapted(RasterUtils.scala:74) at geotrellis.spark.pyramid.Pyramid$.runLevel$1(Pyramid.scala:337) at geotrellis.spark.pyramid.Pyramid$.upLevels(Pyramid.scala:345) at geotrellis.spark.pyramid.Pyramid$.upLevels(Pyramid.scala:368) at com.cjx.geospark.RasterUtils$.pyramid(RasterUtils.scala:74) at com.cjx.geospark.RasterUtils$.main(RasterUtils.scala:103)

Caused by: java.io.InvalidClassException: geotrellis.layer.TileLayerMetadata; local class incompatible: stream classdesc serialVersionUID = 3142813742075090433, local class serialVersionUID = -468075711590230574 at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:699) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461) at scala.collection.immutable.List$SerializationProxy.readObject(List.scala:527) at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1184) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2322) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:87) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:129) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:85) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) `

Is this issue related to the JDK version?

local jdk:1.8.0_291 remote spark jdk:1.8.0_352

pomadchin commented 1 year ago

Hey @CaiCaiXian, took me half a year to reply; in addition to the JDK mismatch it could also be a Scala versions mismatch.

pomadchin commented 1 year ago

I'll close this issue for now, but don't hesistate to reopen it!