locationtech / geotrellis

GeoTrellis is a geographic data processing engine for high performance applications.
http://geotrellis.io
Other
1.34k stars 361 forks source link

Error occurred when using Etl to slice the grid tiff file in hdfs #3508

Open KiktMa opened 1 year ago

KiktMa commented 1 year ago

An error occurred while using Etl from Geotrellis to build a pyramid model for raster data in hdfs and store it in the accumulo database

I am using geotrellis2.1.0Scala2.11.7hadoop2.7.7spark2.3.4jdk1.8

After I have written input.json, output.json, and backend-profiles.json, I use spark-submit to submit the task geotrellis. spark. etl. SinglebandIngest

./bin/spark-submit --class geotrellis.spark.etl.SinglebandIngest --master yarn /usr/local/app/spark/spark-2.3.4/jars/geotrellis-spark-etl_2.11-2.1.0.jar --input file:///app/tif/json/input.json --output file:///app/tif/json/output.json --backend-profiles file:///app/tif/json/backend-profiles.json

Error Reporting Results:

TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, node1, executor 2): java.lang.NegativeArraySizeException
        atscala.reflect.ManifestFactory$$anon$6.newArray(Manifest.scala:93)
        at scala.reflect.ManifestFactory$$anon$6.newArray(Manifest.scala:91)
        at scala.Array$.ofDim(Array.scala:218)
        at geotrellis.raster.UByteArrayTile$.ofDim(UByteArrayTile.scala:239)
        at geotrellis.raster.UByteArrayTile$.empty(UByteArrayTile.scala:267)
        at geotrellis.raster.ArrayTile$.empty(ArrayTile.scala:431)
        at geotrellis.raster.io.geotiff.GeoTiffTile.mutable(GeoTiffTile.scala:698)
        at geotrellis.raster.io.geotiff.GeoTiffTile.toArrayTile(GeoTiffTile.scala:690)
        at geotrellis.spark.io.RasterReader$$anon$1.readFully(RasterReader.scala:67)
        at geotrellis.spark.io.RasterReader$$anon$1.readFully(RasterReader.scala:63)
        at geotrellis.spark.io.hadoop.HadoopGeoTiffRDD$$anonfun$apply$5$$anonfun$apply$6.apply(HadoopGeoTiffRDD.scala:148)
        at geotrellis.spark.io.hadoop.HadoopGeoTiffRDD$$anonfun$apply$5$$anonfun$apply$6.apply(HadoopGeoTiffRDD.scala:147)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:185)
        at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1336)
        at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$14.apply(RDD.scala:1021)
        at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$14.apply(RDD.scala:1019)
        at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2130)
        at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2130)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

My tiff file only has 180Mb. How can I solve this problem,I increased the driver memory to 2G, but I still couldn't resolve this error

pomadchin commented 1 year ago

Hello @KiktMa, we dropped the etl module support and it lives in a stale state in a separate repo https://github.com/geotrellis/spark-etl

Many of the old GeoTrellis issues have been already addressed, I’d recommend you trying with one of the most up to date versions.

Could I also ask you to drop here gdalinfo output of the TIFF? (GIS / other sensitive data can be omitted)

KiktMa commented 1 year ago

Thank you for your reply, as this tiff is a confidential file, I'm sorry I can't publish its @pomadchin

pomadchin commented 1 year ago

@KiktMa gdalinfo with no sensitive data is needed; no tags / extent / etc.

The point of the the gdalinfo output is to understand the TIFF segments structure. The data I need to try to help is Size and Band metadata (size & type):

gdalinfo file.tif
Driver: GTiff/GeoTIFF
Files: file.tif
Size is 6000, 6000
Coordinate System is <removed>
Metadata:
  <removed>
Image Structure Metadata:
  <removed>
Corner Coordinates:
<removed>
Band 1 Block=6000x1 Type=Int16, ColorInterp=<removed>
KiktMa commented 1 year ago
Files: D:\test_tif\caijian.tif
Size is 63472, 61105
Coordinate System is:
Metadata:
Image Structure Metadata:
Corner Coordinates:
Band 1 Block=63472x1 Type=Byte, ColorInterp=

I also want to ask, if I use geotrellis version 3.5, how do I read the grid tiff in hdfs to build a pyramid model and upload the results to Accumulo @pomadchin

pomadchin commented 1 year ago

@KiktMa I think this is related to https://github.com/locationtech/geotrellis/issues/1691 and it is a known dup issue.

The solution to that is to try using GDALRasterSource and / or re-tile TIFF to make it TILED via the gdal_translate in.tif out.tif -co BIGTIFF=YES -co TILED=YES -co COMPRESS=LZW command

The example of reading TIFFs via the RasterSource API and building a Pyramid: https://github.com/pomadchin/vlm-performance/blob/feature/gt-3.x/src/main/scala/geotrellis/contrib/performance/IngestRasterSource.scala#L52-L72

KiktMa commented 1 year ago

@pomadchin Hello, I'm sorry to bother you again. I have a question to ask you. I have already stored the pyramid model in Accumulo, but I cannot understand the structure of the table

\x00\x00\x00\x00\x00\x1E"\xCB layer_slope:11: []    x\x9C\xED\xD1\xB1\x0D\xC2@\x00\x04A\xCB\x11\x01\xE4H$TBK\xDF\x82k\xA0\x18\xDA{L\x11\xE8\x83\x9D\xB9\x06N\xDA\xFD}\xFF\xDC\xAE\xC7\xE5\xDC\xF1\xDC\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xFE\xE01\xC6X\xFD\x81u^s\xCE\xD5\x1FXG\xFF6\

This is part of the table. I know that \x00\x00\x00\x00\x00\x1E"\xCB represents rowid, but I am not very familiar with this encoding. How should I parse the encoding of value, and there is a timestamp in the table. I have asked chatgpt, but it gave a different answer, and now I am very confused