locationtech / geotrellis

GeoTrellis is a geographic data processing engine for high performance applications.
http://geotrellis.io
Other
1.32k stars 360 forks source link

GDAL errors when reading repeatedly from one GDALRasterSource #3184

Closed metasim closed 4 years ago

metasim commented 4 years ago

This error originated in some RasterFrames work. We have a table where one column is predominantly the same file and the analysis fails with one of a number of errors from GDALDataset, such as:

geotrellis.raster.gdal.GDALIOException: Unable to read in data. GDAL Error Code: 3
    at geotrellis.raster.gdal.GDALDataset$.readTile$extension(GDALDataset.scala:324)
...

or

geotrellis.raster.gdal.MalformedDataTypeException: Unable to determine NoData value. GDAL Exception Code: 3
    at geotrellis.raster.gdal.GDALDataset$.noDataValue$extension1(GDALDataset.scala:247)
...

(See below for extended output)

I removed RasterFrames from the mix, resulting in the test case below. (At this point I have not further reduced to get Spark out of mix with, say, Futures instead.) It should be noted that some of the reads complete successfully.

When I run it on my laptop is completes successfully, but when I run it on a beefier EC2 instance (m5a.2xlarge) it fails. Suspect concurrency level and I/O throughput set the conditions. It appears to work when setting --master=local[1].

Edit: my laptop is MacOS, whereas the EC2 instance is Linux. That may be the pertinent variable instead of instance size. Ran in docker locally with 4 cores and the job succeeded. Edit: Configured docker to run with 8 cores on my laptop and it failed!

Test Case

RSRead.scala

import org.apache.spark.sql.SparkSession
import geotrellis.raster._
import geotrellis.raster.gdal.GDALRasterSource

// implicit val spark = SparkSession.builder().
//    master("local[*]").appName("Hit me").getOrCreate()

val path = "https://s22s-rasterframes-integration-tests.s3.amazonaws.com/B08.jp2"

spark.range(1000).rdd.
    map(_ => path).
    flatMap(uri => {
      val rs = GDALRasterSource(uri)
      val grid = GridBounds(0, 0, rs.cols - 1, rs.rows - 1)
      val tileBounds = grid.split(256, 256).toSeq
      rs.readBounds(tileBounds)
    }).
    foreach(r => ())

Execution Command

Using Spark 2.4.4, Scala 2.11.12, GDAL 2.4.3 (released 2019/10/28)

spark-shell --packages org.locationtech.geotrellis:geotrellis-gdal_2.11:3.2.0 --repositories https://dl.bintray.com/azavea/geotrellis -I RSRead.scala

Sample Backtrace

Full log output

```java org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost, executor driver): geotrellis.raster.gdal.MalformedDataTypeException: Unable to deterime the min/max values in order to calculate CellType. GDAL Error Code: 3 at geotrellis.raster.gdal.GDALDataset$.cellType$extension1(GDALDataset.scala:299) at geotrellis.raster.gdal.GDALDataset$.readTile$extension(GDALDataset.scala:315) at geotrellis.raster.gdal.GDALDataset$$anonfun$readMultibandTile$extension$1.apply(GDALDataset.scala:333) at geotrellis.raster.gdal.GDALDataset$$anonfun$readMultibandTile$extension$1.apply(GDALDataset.scala:333) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at geotrellis.raster.gdal.GDALDataset$.readMultibandTile$extension(GDALDataset.scala:333) at geotrellis.raster.gdal.GDALRasterSource$$anonfun$readBounds$2.apply(GDALRasterSource.scala:107) at geotrellis.raster.gdal.GDALRasterSource$$anonfun$readBounds$2.apply(GDALRasterSource.scala:106) at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at geotrellis.raster.gdal.GDALRasterSource.read(GDALRasterSource.scala:156) at geotrellis.raster.RasterSource$$anonfun$readBounds$2.apply(RasterSource.scala:164) at geotrellis.raster.RasterSource$$anonfun$readBounds$2.apply(RasterSource.scala:164) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:927) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:925) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.foreach(RDD.scala:925) ... 94 elided Caused by: geotrellis.raster.gdal.MalformedDataTypeException: Unable to deterime the min/max values in order to calculate CellType. GDAL Error Code: 3 at geotrellis.raster.gdal.GDALDataset$.cellType$extension1(GDALDataset.scala:299) at geotrellis.raster.gdal.GDALDataset$.readTile$extension(GDALDataset.scala:315) at geotrellis.raster.gdal.GDALDataset$$anonfun$readMultibandTile$extension$1.apply(GDALDataset.scala:333) at geotrellis.raster.gdal.GDALDataset$$anonfun$readMultibandTile$extension$1.apply(GDALDataset.scala:333) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at geotrellis.raster.gdal.GDALDataset$.readMultibandTile$extension(GDALDataset.scala:333) at geotrellis.raster.gdal.GDALRasterSource$$anonfun$readBounds$2.apply(GDALRasterSource.scala:107) at geotrellis.raster.gdal.GDALRasterSource$$anonfun$readBounds$2.apply(GDALRasterSource.scala:106) at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at geotrellis.raster.gdal.GDALRasterSource.read(GDALRasterSource.scala:156) at geotrellis.raster.RasterSource$$anonfun$readBounds$2.apply(RasterSource.scala:164) at geotrellis.raster.RasterSource$$anonfun$readBounds$2.apply(RasterSource.scala:164) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ```

cc: @vpipkt

metasim commented 4 years ago

GDAL formats in environment

``` $ gdalinfo --formats Supported Formats: VRT -raster- (rw+v): Virtual Raster DERIVED -raster- (ro): Derived datasets using VRT pixel functions GTiff -raster- (rw+vs): GeoTIFF NITF -raster- (rw+vs): National Imagery Transmission Format RPFTOC -raster- (rovs): Raster Product Format TOC format ECRGTOC -raster- (rovs): ECRG TOC format HFA -raster- (rw+v): Erdas Imagine Images (.img) SAR_CEOS -raster- (rov): CEOS SAR Image CEOS -raster- (rov): CEOS Image JAXAPALSAR -raster- (rov): JAXA PALSAR Product Reader (Level 1.1/1.5) GFF -raster- (rov): Ground-based SAR Applications Testbed File Format (.gff) ELAS -raster- (rw+v): ELAS AIG -raster- (rov): Arc/Info Binary Grid AAIGrid -raster- (rwv): Arc/Info ASCII Grid GRASSASCIIGrid -raster- (rov): GRASS ASCII Grid SDTS -raster- (rov): SDTS Raster DTED -raster- (rwv): DTED Elevation Raster PNG -raster- (rwv): Portable Network Graphics JPEG -raster- (rwv): JPEG JFIF MEM -raster- (rw+): In Memory Raster JDEM -raster- (rov): Japanese DEM (.mem) GIF -raster- (rwv): Graphics Interchange Format (.gif) BIGGIF -raster- (rov): Graphics Interchange Format (.gif) ESAT -raster- (rov): Envisat Image Format FITS -raster- (rw+): Flexible Image Transport System BSB -raster- (rov): Maptech BSB Nautical Charts XPM -raster- (rwv): X11 PixMap Format BMP -raster- (rw+v): MS Windows Device Independent Bitmap DIMAP -raster- (rov): SPOT DIMAP AirSAR -raster- (rov): AirSAR Polarimetric Image RS2 -raster- (rovs): RadarSat 2 XML Product SAFE -raster- (rov): Sentinel-1 SAR SAFE Product PCIDSK -raster,vector- (rw+v): PCIDSK Database File PCRaster -raster- (rw+): PCRaster Raster File ILWIS -raster- (rw+v): ILWIS Raster Map SGI -raster- (rw+v): SGI Image File Format 1.0 SRTMHGT -raster- (rwv): SRTMHGT File Format Leveller -raster- (rw+v): Leveller heightfield Terragen -raster- (rw+v): Terragen heightfield GMT -raster- (rw): GMT NetCDF Grid Format netCDF -raster,vector- (rw+s): Network Common Data Format HDF4 -raster- (ros): Hierarchical Data Format Release 4 HDF4Image -raster- (rw+): HDF4 Dataset ISIS3 -raster- (rw+v): USGS Astrogeology ISIS cube (Version 3) ISIS2 -raster- (rw+v): USGS Astrogeology ISIS cube (Version 2) PDS -raster- (rov): NASA Planetary Data System PDS4 -raster- (rw+vs): NASA Planetary Data System 4 VICAR -raster- (rov): MIPL VICAR file TIL -raster- (rov): EarthWatch .TIL ERS -raster- (rw+v): ERMapper .ers Labelled JP2OpenJPEG -raster,vector- (rwv): JPEG-2000 driver based on OpenJPEG library L1B -raster- (rovs): NOAA Polar Orbiter Level 1b Data Set FIT -raster- (rwv): FIT Image GRIB -raster- (rwv): GRIdded Binary (.grb, .grb2) RMF -raster- (rw+v): Raster Matrix Format WCS -raster- (rovs): OGC Web Coverage Service WMS -raster- (rwvs): OGC Web Map Service MSGN -raster- (rov): EUMETSAT Archive native (.nat) RST -raster- (rw+v): Idrisi Raster A.1 INGR -raster- (rw+v): Intergraph Raster GSAG -raster- (rwv): Golden Software ASCII Grid (.grd) GSBG -raster- (rw+v): Golden Software Binary Grid (.grd) GS7BG -raster- (rw+v): Golden Software 7 Binary Grid (.grd) COSAR -raster- (rov): COSAR Annotated Binary Matrix (TerraSAR-X) TSX -raster- (rov): TerraSAR-X Product COASP -raster- (ro): DRDC COASP SAR Processor Raster R -raster- (rwv): R Object Data Store MAP -raster- (rov): OziExplorer .MAP KMLSUPEROVERLAY -raster- (rwv): Kml Super Overlay PDF -raster,vector- (rw+vs): Geospatial PDF Rasterlite -raster- (rwvs): Rasterlite MBTiles -raster,vector- (rw+v): MBTiles PLMOSAIC -raster- (ro): Planet Labs Mosaics API CALS -raster- (rwv): CALS (Type 1) WMTS -raster- (rwv): OGC Web Map Tile Service SENTINEL2 -raster- (rovs): Sentinel 2 MRF -raster- (rw+v): Meta Raster Format PNM -raster- (rw+v): Portable Pixmap Format (netpbm) DOQ1 -raster- (rov): USGS DOQ (Old Style) DOQ2 -raster- (rov): USGS DOQ (New Style) PAux -raster- (rw+v): PCI .aux Labelled MFF -raster- (rw+v): Vexcel MFF Raster MFF2 -raster- (rw+): Vexcel MFF2 (HKV) Raster FujiBAS -raster- (rov): Fuji BAS Scanner Image GSC -raster- (rov): GSC Geogrid FAST -raster- (rov): EOSAT FAST Format BT -raster- (rw+v): VTP .bt (Binary Terrain) 1.3 Format LAN -raster- (rw+v): Erdas .LAN/.GIS CPG -raster- (rov): Convair PolGASP IDA -raster- (rw+v): Image Data and Analysis NDF -raster- (rov): NLAPS Data Format EIR -raster- (rov): Erdas Imagine Raw DIPEx -raster- (rov): DIPEx LCP -raster- (rwv): FARSITE v.4 Landscape File (.lcp) GTX -raster- (rw+v): NOAA Vertical Datum .GTX LOSLAS -raster- (rov): NADCON .los/.las Datum Grid Shift NTv1 -raster- (rov): NTv1 Datum Grid Shift NTv2 -raster- (rw+vs): NTv2 Datum Grid Shift CTable2 -raster- (rw+v): CTable2 Datum Grid Shift ACE2 -raster- (rov): ACE2 SNODAS -raster- (rov): Snow Data Assimilation System KRO -raster- (rw+v): KOLOR Raw ROI_PAC -raster- (rw+v): ROI_PAC raster RRASTER -raster- (rw+v): R Raster BYN -raster- (rw+v): Natural Resources Canada's Geoid ARG -raster- (rwv): Azavea Raster Grid format RIK -raster- (rov): Swedish Grid RIK (.rik) USGSDEM -raster- (rwv): USGS Optional ASCII DEM (and CDED) GXF -raster- (rov): GeoSoft Grid Exchange Format DODS -raster- (ro): DAP 3.x servers KEA -raster- (rw+): KEA Image Format (.kea) BAG -raster- (rwv): Bathymetry Attributed Grid HDF5 -raster- (rovs): Hierarchical Data Format Release 5 HDF5Image -raster- (rov): HDF5 Dataset NWT_GRD -raster- (rw+v): Northwood Numeric Grid Format .grd/.tab NWT_GRC -raster- (rov): Northwood Classified Grid Format .grc/.tab ADRG -raster- (rw+vs): ARC Digitized Raster Graphics SRP -raster- (rovs): Standard Raster Product (ASRP/USRP) BLX -raster- (rwv): Magellan topo (.blx) PostGISRaster -raster- (rws): PostGIS Raster driver SAGA -raster- (rw+v): SAGA GIS Binary Grid (.sdat, .sg-grd-z) IGNFHeightASCIIGrid -raster- (rov): IGN France height correction ASCII Grid XYZ -raster- (rwv): ASCII Gridded XYZ HF2 -raster- (rwv): HF2/HFZ heightfield raster OZI -raster- (rov): OziExplorer Image File CTG -raster- (rov): USGS LULC Composite Theme Grid E00GRID -raster- (rov): Arc/Info Export E00 GRID ZMap -raster- (rwv): ZMap Plus Grid NGSGEOID -raster- (rov): NOAA NGS Geoid Height Grids IRIS -raster- (rov): IRIS data (.PPI, .CAPPi etc) PRF -raster- (rov): Racurs PHOTOMOD PRF RDA -raster- (ro): DigitalGlobe Raster Data Access driver EEDAI -raster- (ros): Earth Engine Data API Image SIGDEM -raster- (rwv): Scaled Integer Gridded DEM .sigdem GPKG -raster,vector- (rw+vs): GeoPackage CAD -raster,vector- (rovs): AutoCAD Driver PLSCENES -raster,vector- (ro): Planet Labs Scenes API NGW -raster,vector- (rw+s): NextGIS Web GenBin -raster- (rov): Generic Binary (.hdr Labelled) ENVI -raster- (rw+v): ENVI .hdr Labelled EHdr -raster- (rw+v): ESRI .hdr Labelled ISCE -raster- (rw+v): ISCE raster HTTP -raster,vector- (ro): HTTP Fetching Wrapper ```
pomadchin commented 4 years ago

@metasim just to confirm, have you tried GDAL 2.4.4? https://github.com/OSGeo/gdal/issues/1244 Is the same issue happens with TIFFs or only with JP2K?

metasim commented 4 years ago

@metasim just to confirm, have you tried GDAL 2.4.4? OSGeo/gdal#1244

Not sure... I'll try that later today.

Is the same issue happens with TIFFs or only with JP2K?

Don't know. Took me a week to get to a repeatable test case, so those sorts of refinements are needed.

metasim commented 4 years ago

@pomadchin Confirmed the bug occurs under GDAL 2.4.4, released 2020/01/08

``` 19:15:34 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3) geotrellis.raster.gdal.MalformedDataException: Unable to construct a RasterExtent from the Transformation given. GDAL Error Code: 3 at geotrellis.raster.gdal.GDALDataset$.rasterExtent$extension1(GDALDataset.scala:143) at geotrellis.raster.gdal.GDALRasterSource.gridExtent$lzycompute(GDALRasterSource.scala:93) at geotrellis.raster.gdal.GDALRasterSource.gridExtent(GDALRasterSource.scala:93) at geotrellis.raster.RasterMetadata$class.cols(RasterMetadata.scala:52) at geotrellis.raster.RasterSource.cols(RasterSource.scala:44) at $line20.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$2.apply(RSRead.scala:36) at $line20.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$2.apply(RSRead.scala:34) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 19:15:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) geotrellis.raster.gdal.MalformedDataException: A bandCount of <= 0 was found. GDAL Error Code: 3 at geotrellis.raster.gdal.GDALDataset$.bandCount$extension1(GDALDataset.scala:206) at geotrellis.raster.gdal.GDALDataset$.bandCount$extension0(GDALDataset.scala:196) at geotrellis.raster.gdal.GDALRasterSource.bandCount$lzycompute(GDALRasterSource.scala:82) at geotrellis.raster.gdal.GDALRasterSource.bandCount(GDALRasterSource.scala:82) at geotrellis.raster.RasterSource$$anonfun$readBounds$2.apply(RasterSource.scala:164) at geotrellis.raster.RasterSource$$anonfun$readBounds$2.apply(RasterSource.scala:164) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 19:15:34 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, localhost, executor driver): geotrellis.raster.gdal.MalformedDataException: Unable to construct a RasterExtent from the Transformation given. GDAL Error Code: 3 at geotrellis.raster.gdal.GDALDataset$.rasterExtent$extension1(GDALDataset.scala:143) at geotrellis.raster.gdal.GDALRasterSource.gridExtent$lzycompute(GDALRasterSource.scala:93) at geotrellis.raster.gdal.GDALRasterSource.gridExtent(GDALRasterSource.scala:93) at geotrellis.raster.RasterMetadata$class.cols(RasterMetadata.scala:52) at geotrellis.raster.RasterSource.cols(RasterSource.scala:44) at $line20.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$2.apply(RSRead.scala:36) at $line20.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$2.apply(RSRead.scala:34) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 19:15:34 ERROR TaskSetManager: Task 3 in stage 0.0 failed 1 times; aborting job 19:15:34 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): geotrellis.raster.gdal.MalformedDataException: A bandCount of <= 0 was found. GDAL Error Code: 3 at geotrellis.raster.gdal.GDALDataset$.bandCount$extension1(GDALDataset.scala:206) at geotrellis.raster.gdal.GDALDataset$.bandCount$extension0(GDALDataset.scala:196) at geotrellis.raster.gdal.GDALRasterSource.bandCount$lzycompute(GDALRasterSource.scala:82) at geotrellis.raster.gdal.GDALRasterSource.bandCount(GDALRasterSource.scala:82) at geotrellis.raster.RasterSource$$anonfun$readBounds$2.apply(RasterSource.scala:164) at geotrellis.raster.RasterSource$$anonfun$readBounds$2.apply(RasterSource.scala:164) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) [Stage 0:> (0 + 6) / 8]org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 0.0 failed 1 times, most recent failure: Lost task 3.0 in stage 0.0 (TID 3, localhost, executor driver): geotrellis.raster.gdal.MalformedDataException: Unable to construct a RasterExtent from the Transformation given. GDAL Error Code: 3 at geotrellis.raster.gdal.GDALDataset$.rasterExtent$extension1(GDALDataset.scala:143) at geotrellis.raster.gdal.GDALRasterSource.gridExtent$lzycompute(GDALRasterSource.scala:93) at geotrellis.raster.gdal.GDALRasterSource.gridExtent(GDALRasterSource.scala:93) at geotrellis.raster.RasterMetadata$class.cols(RasterMetadata.scala:52) at geotrellis.raster.RasterSource.cols(RasterSource.scala:44) at $anonfun$2.apply(RSRead.scala:36) at $anonfun$2.apply(RSRead.scala:34) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:927) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:925) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.foreach(RDD.scala:925) ... 94 elided Caused by: geotrellis.raster.gdal.MalformedDataException: Unable to construct a RasterExtent from the Transformation given. GDAL Error Code: 3 at geotrellis.raster.gdal.GDALDataset$.rasterExtent$extension1(GDALDataset.scala:143) at geotrellis.raster.gdal.GDALRasterSource.gridExtent$lzycompute(GDALRasterSource.scala:93) at geotrellis.raster.gdal.GDALRasterSource.gridExtent(GDALRasterSource.scala:93) at geotrellis.raster.RasterMetadata$class.cols(RasterMetadata.scala:52) at geotrellis.raster.RasterSource.cols(RasterSource.scala:44) at $anonfun$2.apply(RSRead.scala:36) at $anonfun$2.apply(RSRead.scala:34) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ```
pomadchin commented 4 years ago

@metasim perfetct (in terms of debugging) :D

metasim commented 4 years ago

Just ran test against this GeoTIFF:

https://s3-us-west-2.amazonaws.com/landsat-pds/c1/L8/017/033/LC08_L1TP_017033_20181010_20181030_01_T1/LC08_L1TP_017033_20181010_20181030_01_T1_B4.TIF

And it does complete successfully. Perhaps it's a GDAL JP2 issue?

pomadchin commented 4 years ago

@metasim ‾_(ツ)_/‾ requires a bit more investigations; can be just beacuase there is some random nature of this issue. I wish we could reproduce it on a laptop :/

metasim commented 4 years ago

In RasterFrames, added global thread lock to GDALRasterSource when JP2 files are being read and job completes (albeit extremely slowly). Another mark pointing toward a race condition.

Screen Shot 2020-02-10 at 11 17 24 AM
pomadchin commented 4 years ago

@metasim sounds really sad, slow, and not too reliable

metasim commented 4 years ago

Looking to try to reproduce at a lower level.

metasim commented 4 years ago

Wondering if this might be the cause (fixed in 3.0.2):

https://github.com/OSGeo/gdal/blob/ee535a1a3f5b35b0d231e1faac89ac1f889f7988/gdal/NEWS#L232-L238

pomadchin commented 4 years ago

@metasim I think it makes sense to try to use GDAL 3.0.4

metasim commented 4 years ago

Working on it.

metasim commented 4 years ago

@pomadchin gdal-warp-bindings won't link against 3.0.4... looks like it's requiring the 2.x line.

java.lang.UnsatisfiedLinkError: /tmp/nativeutils837692180397/libgdalwarp_bindings.so: libgdal.so.20: cannot open shared object file: No such file or directory
metasim commented 4 years ago

I was able to hack together a new gdal-warp-bindings for Linux linked against GDAL 3.0.4. Good news is that they link:

from pyrasterframes.utils import gdal_version
gdal_version()
...
'GDAL 3.0.4, released 2020/01/28'

Bad news is that the bug is still there. 😢

geotrellis.raster.gdal.MalformedDataException: Unable to construct a RasterExtent from the Transformation given. GDAL Error Code: 3
    at geotrellis.raster.gdal.GDALDataset$.rasterExtent$extension1(GDALDataset.scala:143)
    at geotrellis.raster.gdal.GDALRasterSource.gridExtent$lzycompute(GDALRasterSource.scala:93)
    at geotrellis.raster.gdal.GDALRasterSource.gridExtent(GDALRasterSource.scala:93)

JupyterLab.pdf

metasim commented 4 years ago

BTW, it may be worth trying to run the Test Case on a non-AWS Linux machine or Docker container. My laptop is MacOS, so OS is a variable changed between local vs remote execution. It may not have to do with it being EC2 or a particular instance size.

metasim commented 4 years ago

Test case using custom build

Custom gdal-warp-bindings built against GDAL 3.0.4, Custom GeoTrellis 3.2.x build.

First create a shell in the environment:

$ docker run -it s22s/rasterframes-notebook:0.9.0-astraea.452747b4  bash
wget https://gist.githubusercontent.com/metasim/5332ac959d97d9747921197cd4307948/raw/662687c9b5c52083b007b451b6530f0505b2c9fc/ParallelJP2.scala && echo ':load ParallelJP2.scala' | spark-shell --jars /opt/conda/lib/python3.7/site-packages/pyrasterframes/jars/pyrasterframes-assembly-0.9.0-astraea.452747b4.jar 

Note: Running this locally does not fail. Maybe 8 or more cores are needed?

Edit: With Docker on MacOS configured with all 8 cores, the job above does indeed fail.

metasim commented 4 years ago

Custom gdal-warp-bindings

Create the filetesting.list in gdal-warp-bindings/Docker with this:

deb  [ allow-insecure=yes ] http://http.us.debian.org/debian testing main non-free contrib
deb-src  [ allow-insecure=yes ] http://http.us.debian.org/debian testing main non-free contrib

Replace the # Build GDAL 2.4.3 Linux section of gdal-warp-bindings/Docker/Dockerfile.environment with this:

COPY unstable.list  /etc/apt/sources.list.d/
RUN apt-get update -q && apt-get install -y -q --allow-unauthenticated libgdal-dev=3.0.4+dfsg-1

Build the image. Note ID or tag it.

In the gdal-warp-bindings directory run

docker run -it --rm \
      -v $(pwd):/workdir \
      -e CC=gcc -e CXX=g++ \
      -e CFLAGS="-Wall -Wno-sign-compare -Werror -O0 -ggdb3 -DSO_FINI -D_GNU_SOURCE" \
      -e BOOST_ROOT="/usr/local/include/boost_1_69_0" \
      -e JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64" \
      <image tag or ID from above> make -j4 -C src tests

Note the location of file gdal-warp-bindings/src/main/gdalwarp.jar.

Edit geotrellis/project/Dependencies.scala and replace

val gdalWarp            = "com.azavea.gdal"              % "gdal-warp-bindings"      % Version.gdalWarp

with

val gdalWarp = "com.azavea.gdal" % "gdal-warp-bindings"  % Version.gdalWarp from("file:/path/to/gdal-warp-bindings/src/main/gdalwarp.jar")

Build GeoTrellis.

metasim commented 4 years ago

Tweaking Number of Cores in Docker

Edit: I was running this at home over mediocre WiFi. The office environment is 1Gbps wired.

vpipkt commented 4 years ago

Update on the script to reproduce it. From within the docker container:

 $ PROJ_LIB=/opt/conda/share/proj spark-shell --master local[8] --jars /opt/conda/lib/python3.7/site-packages/pyrasterframes/jars/pyrasterframes-assembly-0.9.0-astraea.452747b4.jar 
 scala> :load ParallelJP2.scala

Although I do not reproduce the failure with 8 cores.

Test case using custom build

Custom gdal-warp-bindings built against GDAL 3.0.4, Custom GeoTrellis 3.2.x build.

First create a shell in the environment:

$ docker run -it s22s/rasterframes-notebook:0.9.0-astraea.452747b4  bash
$ wget https://gist.githubusercontent.com/metasim/5332ac959d97d9747921197cd4307948/raw/662687c9b5c52083b007b451b6530f0505b2c9fc/ParallelJP2.scala
$ spark-shell --jars /opt/conda/lib/python3.7/site-packages/pyrasterframes/jars/pyrasterframes-assembly-0.9.0-astraea.452747b4.jar 
scala> :load ParallelJP2.scala

Note: Running this locally does not fail. Maybe 8 or more cores are needed?

Edit: With Docker on MacOS configured with all 8 cores, the job above does indeed fail.

metasim commented 4 years ago

@vpipkt What happens if you leave out the --master local[8]? I did not specify the number of cores that way.... I just left it to Spark defaults, but configured Docker to have 8 cores.

metasim commented 4 years ago

@vpipkt Also, if you re-run it, can you do docker pull s22s/rasterframes-notebook:0.9.0-astraea.452747b4 first? I updated it to have the PROJ_LIB done for you.

vpipkt commented 4 years ago

I pulled the image again (image id 26d9771deb79), and ran again omitting the explicit --master local[8] and did not reproduce the bug. :-(

metasim commented 4 years ago

Same.... on wired internet at work it's passing. 😠 These results were from running it at home on mediocre WiFi.

metasim commented 4 years ago

When using my phone's hot spot using 8 cores it fails.

metasim commented 4 years ago

Bandwidth Limiting on MacOS

The "Additional Tools for Xcode 11" package includes a tool called Network Link Conditioner that simulates slow or error prone networks:

Screen Shot 2020-02-12 at 10 08 21 AM

When using this tool (and remembering to filp the "On" switch) results in the test fails.

Edit: If it disappears from your System Preferences after install, do this: https://agilewarrior.wordpress.com/2018/10/31/trouble-installing-link-conditioner/

pomadchin commented 4 years ago

It's also possible to reproduce it on EC2 m4.4xlarge:

$ sudo yum install tc
$ sudo tc qdisc add dev eth0 root netem delay 500ms
$ docker run -it --cpus=8 -u root s22s/rasterframes-notebook:0.9.0-astraea.452747b4  bash
$ wget https://gist.githubusercontent.com/metasim/5332ac959d97d9747921197cd4307948/raw/662687c9b5c52083b007b451b6530f0505b2c9fc/ParallelJP2.scala
$ spark-shell --jars /opt/conda/lib/python3.7/site-packages/pyrasterframes/jars/pyrasterframes-assembly-0.9.0-astraea.452747b4.jar 
> :load ParallelJP2.scala

P.S. if you want you fast connection back:

sudo tc qdisc del dev eth0 root netem

P.P.S. It is weird that I could not reproduce it as a part of a unit test

pomadchin commented 4 years ago

I could not make it fail for the TIFF case in the same environment as well.

pomadchin commented 4 years ago

Hey @metasim @vpipkt , check out these steps please:

$ docker run -it --cpus 8 -v ${PWD}/geotrellis:/geotrellis daunnc/gdalwarpenv:0.2 bash
// also throw some aws credentials into the container
$ spark-shell --packages org.locationtech.geotrellis:geotrellis-gdal_2.11:3.2.1-SNAPSHOT --repositories https://dl.bintray.com/azavea/geotrellis --jars /geotrellis/gdalwarp.jar 

The programm:

import org.apache.spark.sql.SparkSession
import geotrellis.raster._
import geotrellis.raster.gdal.GDALRasterSource
import geotrellis.raster.gdal.config._

// this one is optional
GDALOptionsConfig.registerOptions(
  "CPL_DEBUG" -> "ON",
   "GDAL_DISABLE_READDIR_ON_OPEN" -> "YES",
   "CPL_VSIL_CURL_ALLOWED_EXTENSIONS" -> ".tif"
)

val path = "https://s22s-rasterframes-integration-tests.s3.amazonaws.com/B08.jp2"

spark.range(1000).rdd.
    map(_ => path).
    flatMap(uri => {
      val rs = GDALRasterSource(uri)
      val grid = GridBounds(0, 0, rs.cols - 1, rs.rows - 1)
      val tileBounds = grid.split(256, 256).toSeq
      rs.readBounds(tileBounds)
    }).
    foreach(r => ())

TLDR; There is GT with bindings built against GDAL 3.0.4 in the container I also noticed a typo in the bidnings code (this fix is applied) https://github.com/geotrellis/gdal-warp-bindings/pull/81

Let me know does it work for you or not. It looks like on EC2 it worked. ¯\_(ツ)_/¯

P/P/S ouch I remembered that it was a mounted volume so nothing probably is persisted ): will rebuild image today / tomorrow / or you’ll check that fix without it by rebuilding bindings with an applied fix

vpipkt commented 4 years ago

@pomadchin @metasim looking at it now...

pomadchin commented 4 years ago

@vpipkt in parallel wil try to provide you a container with all inbuilt deps

vpipkt commented 4 years ago

@pomadchin it is probably for the best if you can provide that.

pomadchin commented 4 years ago

@vpipkt

$ docker run -it --cpus 8 daunnc/gdalwarpenv:0.5 bash
// also throw some aws credentials into the container
$ spark-shell --packages org.locationtech.geotrellis:geotrellis-gdal_2.11:3.2.1-SNAPSHOT --repositories https://dl.bintray.com/azavea/geotrellis --jars /home/jovyan/gdalwarp.jar
import org.apache.spark.sql.SparkSession
import geotrellis.raster._
import geotrellis.raster.gdal.GDALRasterSource
import geotrellis.raster.gdal.config._

// this one is optional
GDALOptionsConfig.registerOptions(
  "CPL_DEBUG" -> "ON",
   "GDAL_DISABLE_READDIR_ON_OPEN" -> "YES",
   "CPL_VSIL_CURL_ALLOWED_EXTENSIONS" -> ".tif,.jp2"
)

val path = "s3://geotrellis-test/daunnc/B08.jp2"

spark.range(1000).rdd.
    map(_ => path).
    flatMap(uri => {
      val rs = GDALRasterSource(uri)
      val grid = GridBounds(0, 0, rs.cols - 1, rs.rows - 1)
      val tileBounds = grid.split(256, 256).toSeq
      rs.readBounds(tileBounds)
    }).
    foreach(r => ())

Check it out in your test env; this time everything is included

P.S. I don't expect that it would fix all the problems actually but can produce a new stacktrace as well

metasim commented 4 years ago

@pomadchin Is gdalwarpenv private?

$ docker pull  daunnc/gdalwarpenv:0.5
Error response from daemon: manifest for daunnc/gdalwarpenv:0.5 not found: manifest unknown: manifest unknown
pomadchin commented 4 years ago

@metasim oops, fixed! (accidentally deleted it from the registry)

metasim commented 4 years ago

Passes locally with 8 cores and DSL level networking...

pomadchin commented 4 years ago

@metasim I also noticed that jp2k is performing reads somehow different (taking about both network and reads themselves). Would await until you'll submit some prod / staging jobs!

metasim commented 4 years ago

Would await until you'll submit some prod / staging jobs!

WDYM? Looking for us to test with a real job?

pomadchin commented 4 years ago

@metasim yep; as I still feel that there can be a window for some problems (just want to be sure)

metasim commented 4 years ago

Will do.

metasim commented 4 years ago

@pomadchin Not looking good. Original job failed in the same way. Doing some extra checking to make sure I deployed the right thing. The md5sum of libgdalwarp_bindings.so is 00ecbde671e5cd93ebbba0aa4967ef3b. What we'd need to check next is to use the exact same GDAL distribution you are. Where/when/how did you get the one you had in the Docker image?

pomadchin commented 4 years ago

@metasim it is a proper checksum. The GDAL version it 3.0.4 and it is from the image s22s/rasterframes-notebook:0.9.0-astraea.452747b4 (I just commited a running conatiner). Can you show me a new stacktrace? I think it is slightly different.

Is it smth like the exception below?

geotrellis.raster.gdal.GDALIOException: Unable to read in data. GDAL Error Code: 3
    at geotrellis.raster.gdal.GDALDataset$.readTile$extension(GDALDataset.scala:324)
metasim commented 4 years ago

@pomadchin This is what I'm seeing:

geotrellis.raster.gdal.MalformedDataException: Unable to construct a RasterExtent from the Transformation given. GDAL Error Code: 3
    at geotrellis.raster.gdal.GDALDataset$.rasterExtent$extension1(GDALDataset.scala:143)
```java Py4JJavaError: An error occurred while calling o132.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 64 in stage 1.0 failed 1 times, most recent failure: Lost task 64.0 in stage 1.0 (TID 198, localhost, executor driver): geotrellis.raster.gdal.MalformedDataException: Unable to construct a RasterExtent from the Transformation given. GDAL Error Code: 3 at geotrellis.raster.gdal.GDALDataset$.rasterExtent$extension1(GDALDataset.scala:143) at geotrellis.raster.gdal.GDALRasterSource.gridExtent$lzycompute(GDALRasterSource.scala:93) at geotrellis.raster.gdal.GDALRasterSource.gridExtent(GDALRasterSource.scala:93) at geotrellis.raster.RasterMetadata$class.cols(RasterMetadata.scala:52) at geotrellis.raster.RasterSource.cols(RasterSource.scala:44) at geotrellis.raster.RasterSource.cols(RasterSource.scala:44) at geotrellis.raster.Grid.dimensions(Grid.scala:26) at geotrellis.raster.gdal.GDALRasterSource$$anonfun$readBounds$1.apply(GDALRasterSource.scala:105) at geotrellis.raster.gdal.GDALRasterSource$$anonfun$readBounds$1.apply(GDALRasterSource.scala:105) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at scala.collection.Iterator$$anon$12.next(Iterator.scala:445) at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at org.locationtech.rasterframes.ref.RFRasterSource.read(RFRasterSource.scala:61) at org.locationtech.rasterframes.ref.RasterRef.realizedTile$lzycompute(RasterRef.scala:57) at org.locationtech.rasterframes.ref.RasterRef.realizedTile(RasterRef.scala:55) at org.locationtech.rasterframes.ref.RasterRef$RasterRefTile.delegate(RasterRef.scala:72) at org.locationtech.rasterframes.tiles.FixedDelegatingTile.combine(FixedDelegatingTile.scala:31) at geotrellis.raster.Tile.dualCombine(Tile.scala:93) at geotrellis.raster.mapalgebra.local.LocalTileBinaryOp$class.apply(LocalTileBinaryOp.scala:56) at geotrellis.raster.mapalgebra.local.Subtract$.apply(Subtract.scala:29) at geotrellis.raster.mapalgebra.local.SubtractMethods$class.localSubtract(Subtract.scala:57) at geotrellis.raster.mapalgebra.local.Implicits$withTileLocalMethods.localSubtract(Implicits.scala:25) at org.locationtech.rasterframes.expressions.localops.NormalizedDifference.op(NormalizedDifference.scala:47) at org.locationtech.rasterframes.expressions.BinaryRasterOp$class.nullSafeEval(BinaryRasterOp.scala:66) at org.locationtech.rasterframes.expressions.localops.NormalizedDifference.nullSafeEval(NormalizedDifference.scala:44) at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:484) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_6_1$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:256) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.collect(RDD.scala:944) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299) at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3263) at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3260) at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369) at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3260) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:745) Caused by: geotrellis.raster.gdal.MalformedDataException: Unable to construct a RasterExtent from the Transformation given. GDAL Error Code: 3 at geotrellis.raster.gdal.GDALDataset$.rasterExtent$extension1(GDALDataset.scala:143) at geotrellis.raster.gdal.GDALRasterSource.gridExtent$lzycompute(GDALRasterSource.scala:93) at geotrellis.raster.gdal.GDALRasterSource.gridExtent(GDALRasterSource.scala:93) at geotrellis.raster.RasterMetadata$class.cols(RasterMetadata.scala:52) at geotrellis.raster.RasterSource.cols(RasterSource.scala:44) at geotrellis.raster.RasterSource.cols(RasterSource.scala:44) at geotrellis.raster.Grid.dimensions(Grid.scala:26) at geotrellis.raster.gdal.GDALRasterSource$$anonfun$readBounds$1.apply(GDALRasterSource.scala:105) at geotrellis.raster.gdal.GDALRasterSource$$anonfun$readBounds$1.apply(GDALRasterSource.scala:105) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at scala.collection.Iterator$$anon$12.next(Iterator.scala:445) at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at org.locationtech.rasterframes.ref.RFRasterSource.read(RFRasterSource.scala:61) at org.locationtech.rasterframes.ref.RasterRef.realizedTile$lzycompute(RasterRef.scala:57) at org.locationtech.rasterframes.ref.RasterRef.realizedTile(RasterRef.scala:55) at org.locationtech.rasterframes.ref.RasterRef$RasterRefTile.delegate(RasterRef.scala:72) at org.locationtech.rasterframes.tiles.FixedDelegatingTile.combine(FixedDelegatingTile.scala:31) at geotrellis.raster.Tile.dualCombine(Tile.scala:93) at geotrellis.raster.mapalgebra.local.LocalTileBinaryOp$class.apply(LocalTileBinaryOp.scala:56) at geotrellis.raster.mapalgebra.local.Subtract$.apply(Subtract.scala:29) at geotrellis.raster.mapalgebra.local.SubtractMethods$class.localSubtract(Subtract.scala:57) at geotrellis.raster.mapalgebra.local.Implicits$withTileLocalMethods.localSubtract(Implicits.scala:25) at org.locationtech.rasterframes.expressions.localops.NormalizedDifference.op(NormalizedDifference.scala:47) at org.locationtech.rasterframes.expressions.BinaryRasterOp$class.nullSafeEval(BinaryRasterOp.scala:66) at org.locationtech.rasterframes.expressions.localops.NormalizedDifference.nullSafeEval(NormalizedDifference.scala:44) at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:484) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_6_1$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:256) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ... 1 more ```
pomadchin commented 4 years ago

Ok, sounds pretty sad. I'll continue looking into the reasons of it.

metasim commented 4 years ago

Can you reproduce?

pomadchin commented 4 years ago

I could reproduce at geotrellis.raster.gdal.GDALDataset$.readTile$extension a couple of times, but after applying a fix from the PR https://github.com/geotrellis/gdal-warp-bindings/pull/81 it is not that constant behavior anymore.

metasim commented 4 years ago

@pomadchin Got any ideas on what to try next?

pomadchin commented 4 years ago

@metasim going to try out to make it more reproducible (without traffic slowdown); also I have an idea of trying to write a C/C++ unit test for this case, mb it would be more concurrent than the JVM version.

If there would be a C++ unit test for this case, this would be amazing; it will allow me to get a normal backtrace (hopefully)

pomadchin commented 4 years ago

~ removed this message since it was uselsess and confusing ~ next steps are to print the error message in some readble fashion, since the error code is not enough.

pomadchin commented 4 years ago

hey @metasim it can be the case that in the dockerfile I sent you here is stored a bad gdalwarp.jar

Will publish gdalbindings version soon to avoid the confusion. It seem to me that I can't reproduce this error anymore.