locationtech / geotrellis

GeoTrellis is a geographic data processing engine for high performance applications.
http://geotrellis.io
Other
1.33k stars 360 forks source link

GDAL errors when reading repeatedly from one GDALRasterSource #3184

Closed metasim closed 4 years ago

metasim commented 4 years ago

This error originated in some RasterFrames work. We have a table where one column is predominantly the same file and the analysis fails with one of a number of errors from GDALDataset, such as:

geotrellis.raster.gdal.GDALIOException: Unable to read in data. GDAL Error Code: 3
    at geotrellis.raster.gdal.GDALDataset$.readTile$extension(GDALDataset.scala:324)
...

or

geotrellis.raster.gdal.MalformedDataTypeException: Unable to determine NoData value. GDAL Exception Code: 3
    at geotrellis.raster.gdal.GDALDataset$.noDataValue$extension1(GDALDataset.scala:247)
...

(See below for extended output)

I removed RasterFrames from the mix, resulting in the test case below. (At this point I have not further reduced to get Spark out of mix with, say, Futures instead.) It should be noted that some of the reads complete successfully.

When I run it on my laptop is completes successfully, but when I run it on a beefier EC2 instance (m5a.2xlarge) it fails. Suspect concurrency level and I/O throughput set the conditions. It appears to work when setting --master=local[1].

Edit: my laptop is MacOS, whereas the EC2 instance is Linux. That may be the pertinent variable instead of instance size. Ran in docker locally with 4 cores and the job succeeded. Edit: Configured docker to run with 8 cores on my laptop and it failed!

Test Case

RSRead.scala

import org.apache.spark.sql.SparkSession
import geotrellis.raster._
import geotrellis.raster.gdal.GDALRasterSource

// implicit val spark = SparkSession.builder().
//    master("local[*]").appName("Hit me").getOrCreate()

val path = "https://s22s-rasterframes-integration-tests.s3.amazonaws.com/B08.jp2"

spark.range(1000).rdd.
    map(_ => path).
    flatMap(uri => {
      val rs = GDALRasterSource(uri)
      val grid = GridBounds(0, 0, rs.cols - 1, rs.rows - 1)
      val tileBounds = grid.split(256, 256).toSeq
      rs.readBounds(tileBounds)
    }).
    foreach(r => ())

Execution Command

Using Spark 2.4.4, Scala 2.11.12, GDAL 2.4.3 (released 2019/10/28)

spark-shell --packages org.locationtech.geotrellis:geotrellis-gdal_2.11:3.2.0 --repositories https://dl.bintray.com/azavea/geotrellis -I RSRead.scala

Sample Backtrace

Full log output

```java org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost, executor driver): geotrellis.raster.gdal.MalformedDataTypeException: Unable to deterime the min/max values in order to calculate CellType. GDAL Error Code: 3 at geotrellis.raster.gdal.GDALDataset$.cellType$extension1(GDALDataset.scala:299) at geotrellis.raster.gdal.GDALDataset$.readTile$extension(GDALDataset.scala:315) at geotrellis.raster.gdal.GDALDataset$$anonfun$readMultibandTile$extension$1.apply(GDALDataset.scala:333) at geotrellis.raster.gdal.GDALDataset$$anonfun$readMultibandTile$extension$1.apply(GDALDataset.scala:333) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at geotrellis.raster.gdal.GDALDataset$.readMultibandTile$extension(GDALDataset.scala:333) at geotrellis.raster.gdal.GDALRasterSource$$anonfun$readBounds$2.apply(GDALRasterSource.scala:107) at geotrellis.raster.gdal.GDALRasterSource$$anonfun$readBounds$2.apply(GDALRasterSource.scala:106) at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at geotrellis.raster.gdal.GDALRasterSource.read(GDALRasterSource.scala:156) at geotrellis.raster.RasterSource$$anonfun$readBounds$2.apply(RasterSource.scala:164) at geotrellis.raster.RasterSource$$anonfun$readBounds$2.apply(RasterSource.scala:164) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:927) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:925) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.foreach(RDD.scala:925) ... 94 elided Caused by: geotrellis.raster.gdal.MalformedDataTypeException: Unable to deterime the min/max values in order to calculate CellType. GDAL Error Code: 3 at geotrellis.raster.gdal.GDALDataset$.cellType$extension1(GDALDataset.scala:299) at geotrellis.raster.gdal.GDALDataset$.readTile$extension(GDALDataset.scala:315) at geotrellis.raster.gdal.GDALDataset$$anonfun$readMultibandTile$extension$1.apply(GDALDataset.scala:333) at geotrellis.raster.gdal.GDALDataset$$anonfun$readMultibandTile$extension$1.apply(GDALDataset.scala:333) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at geotrellis.raster.gdal.GDALDataset$.readMultibandTile$extension(GDALDataset.scala:333) at geotrellis.raster.gdal.GDALRasterSource$$anonfun$readBounds$2.apply(GDALRasterSource.scala:107) at geotrellis.raster.gdal.GDALRasterSource$$anonfun$readBounds$2.apply(GDALRasterSource.scala:106) at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at geotrellis.raster.gdal.GDALRasterSource.read(GDALRasterSource.scala:156) at geotrellis.raster.RasterSource$$anonfun$readBounds$2.apply(RasterSource.scala:164) at geotrellis.raster.RasterSource$$anonfun$readBounds$2.apply(RasterSource.scala:164) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ```

cc: @vpipkt

pomadchin commented 4 years ago

@metasim @vpipkt I published https://bintray.com/azavea/geotrellis/gdal-warp-bindings/33.f513dc4

Could you try this binary on your production cluster? It looks like I am not able to reproduce it anymore. (probably some dependencies dirt could have happend)

And all the issues I could reproduce were gone after I did a clean rebuilt of everything (with GDAL 3.0.4).

Checking everything out with https://bintray.com/azavea/geotrellis/gdal-warp-bindings/33.f513dc4

EDIT: I could do it again ):

pomadchin commented 4 years ago

The biggest mystery now, is that I can't catch the incorrect retval / CPLError on the C side. All messages are clear.

pomadchin commented 4 years ago

There is a chance that there are no errors on the GDAL side, probably it is an incorrect error interpretation via DOIT macros

pomadchin commented 4 years ago

@metasim @vpipkt it turned out that GDAL just has to spend more time to operate with a Datasets backed by OpenJPEG driver;

What happened: we have our own LRU cache with datasets and function calls may exceed the number of attempts to obtain a resource (a dataset) (read as number of attempts to get access to the datasets that is being locked).

Try to set the number of attempts higher, for the test case above I set it to Int.MaxValue:

geotrellis.raster.gdal {
  acceptable-datasets = ["SOURCE", "WARPED"]
  number-of-attempts  = 2147483647
}

Yes, the processing time could be a bit slower now but it doesn't fail. Could you check this setting on a driver?

Also it looks like we don't need https://github.com/geotrellis/gdal-warp-bindings/pull/81 since it is not the reason of failing JP2K reads.

metasim commented 4 years ago

@pomadchin If I'm following this correctly, it sounds like due to JP2 files taking longer (lack of range-reads probably is a part of it), some sort of cache times out before some operation is finished, and therefore triggers an error condition? If so, that's great!

@vpipkt May have a better handle on this, but I've forgotten if GT 3.2 depends on the version with the core dump fix. If not, would it be possible to release a version of GT against it? Also, have you considered switching to semantic versioning in gdal-warp-bindings? It might help us to keep track of this sort of thing.

pomadchin commented 4 years ago

@metasim yep, just to rephrase ~ read from the cache timeouts ~ it tries 1048576 times but the dataset is still busy so it can't read from it.

:+1: to the idea of switching to semver; we'll also try to release GT with an applied https://github.com/geotrellis/gdal-warp-bindings/pull/76 ASAP.

pomadchin commented 4 years ago

@metasim also added an issue about semver https://github.com/geotrellis/gdal-warp-bindings/issues/82

metasim commented 4 years ago

@pomadchin 🙏 🙇 💯

pomadchin commented 4 years ago

Also added a new issue to make this exception more trackable in the future https://github.com/geotrellis/gdal-warp-bindings/issues/83

metasim commented 4 years ago

Still getting this error in our environment, but need to confirm that dependencyOverrides propagated to assembly generation:

geotrellis.raster.gdal.MalformedDataException: Unable to construct a RasterExtent from the Transformation given. GDAL Error Code: 3
    at geotrellis.raster.gdal.GDALDataset$.rasterExtent$extension1(GDALDataset.scala:143)
    at geotrellis.raster.gdal.GDALRasterSource.gridExtent$lzycompute(GDALRasterSource.scala:93)
    at geotrellis.raster.gdal.GDALRasterSource.gridExtent(GDALRasterSource.scala:93)
metasim commented 4 years ago

Confirmed md5sum values are the same :-(

pomadchin commented 4 years ago

@metasim gotcha; I'll move it into in progress and will work on https://github.com/geotrellis/gdal-warp-bindings/issues/83 and https://github.com/geotrellis/gdal-warp-bindings/issues/84 next; so there would be a unique error thrown to detect that you're still can not aquire a locked dataset.

pomadchin commented 4 years ago

after releasing https://github.com/geotrellis/gdal-warp-bindings/issues/83 I will ask you to run tests again; if this error would happen again, we'll add some parametrized timeout setting (it is sleep(0) by default)

metasim commented 4 years ago

The error happens in this notebook Docker environment:

s22s/rasterframes-notebook:0.9.0-1ce1ff3

The test triggering it is attached:

gt-3184-test.zip

Edit: misinterpreted error message in this environment. Ignore for now while I revisit.

metasim commented 4 years ago

To be clear, we are still having the error identified here in our environment: https://github.com/locationtech/geotrellis/issues/3184#issuecomment-595326991

Just need to fix the RasterFrames notebook to reproduce.

pomadchin commented 4 years ago

@metasim could you also print all the availble GDALOptions from the application.conf file?

println(geotrellis.raster.gdal.config.GDALOptionsConfig.conf)
metasim commented 4 years ago
'GDALOptionsConfig(Map(CPL_VSIL_CURL_CHUNK_SIZE -> 1000000, CPL_VSIL_CURL_ALLOWED_EXTENSIONS -> .tif,.tiff,.jp2,.mrf,.idx,.lrc,.mrf.aux.xml,.vrt, AWS_REQUEST_PAYER -> requester, GDAL_HTTP_MAX_RETRY -> 4, GDAL_PAM_ENABLED -> NO, GDAL_DISABLE_READDIR_ON_OPEN -> YES, GDAL_CACHEMAX -> 512, GDAL_HTTP_RETRY_DELAY -> 1),List(SOURCE, WARPED),1048576)'

Also: running against GDAL 2.4.4

pomadchin commented 4 years ago

@metasim look into configuration:

GDALOptionsConfig(...,1048576)
metasim commented 4 years ago

Here's where the new setting is defined:

https://github.com/locationtech/rasterframes/blob/9560e454b696a2a9f497f82306d6b290bb09068c/core/src/main/resources/reference.conf#L27

pomadchin commented 4 years ago

@metasim it didn't pick up; are you sure that the assembly jar contains an appropriate configuration file?

metasim commented 4 years ago

Good question... I'll double check.

metasim commented 4 years ago

crap... the assembly merged the geotrellis reference.conf and the rasterframes reference.conf, with the former overriding the latter.

Suggestions on how to override a GT setting?... have an application.conf in the assembly?

pomadchin commented 4 years ago

@metasim application.conf can be the way, and you can also work on the merging strategies of a reference.conf file probably

pomadchin commented 4 years ago

Another option can be to leave only yours reference.conf ._. Or to decline the GDAL reference conf

metasim commented 4 years ago

Testing with -Dgeotrellis.raster.gdal.number-of-attempts=2147483647 and so far it's still running.

pomadchin commented 4 years ago

@metasim have you printed the GDALOptionsConf after adding this option?

metasim commented 4 years ago

I moved the geotrellis overrides to an application.conf (as they should be) and can see the value getting set properly:

'GDALOptionsConfig(Map(CPL_VSIL_CURL_CHUNK_SIZE -> 1000000, CPL_VSIL_CURL_ALLOWED_EXTENSIONS -> .tif,.tiff,.jp2,.mrf,.idx,.lrc,.mrf.aux.xml,.vrt, AWS_REQUEST_PAYER -> requester, GDAL_HTTP_MAX_RETRY -> 10, CPL_DEBUG -> ON, GDAL_PAM_ENABLED -> NO, GDAL_DISABLE_READDIR_ON_OPEN -> YES, GDAL_CACHEMAX -> 512, GDAL_HTTP_RETRY_DELAY -> 2),List(SOURCE, WARPED),2147483647)'
metasim commented 4 years ago

Sadly, after all this, still appears to be happening:

Caused by: geotrellis.raster.gdal.MalformedDataException: Unable to construct a RasterExtent from the Transformation given. GDAL Error Code: 4
    at geotrellis.raster.gdal.GDALDataset$.rasterExtent$extension1(GDALDataset.scala:143)
    at geotrellis.raster.gdal.GDALRasterSource.gridExtent$lzycompute(GDALRasterSource.scala:93)
Py4JJavaError: An error occurred while calling o123.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 135 in stage 1.0 failed 1 times, most recent failure: Lost task 135.0 in stage 1.0 (TID 192, localhost, executor driver): java.lang.IllegalArgumentException: Error fetching data for one of: GDALRasterSource(s3://sentinel-s2-l2a/tiles/22/L/EP/2019/5/31/0/R60m/B08.jp2), GDALRasterSource(s3://sentinel-s2-l2a/tiles/22/L/EP/2019/5/31/0/R60m/B12.jp2), GDALRasterSource(s3://sentinel-s2-l2a/tiles/22/L/EP/2019/9/13/0/R60m/B08.jp2), GDALRasterSource(s3://sentinel-s2-l2a/tiles/22/L/EP/2019/9/13/0/R60m/B12.jp2)
    at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs.eval(RasterSourceToRasterRefs.scala:81)
    at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:95)
    at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:92)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
    at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: geotrellis.raster.gdal.MalformedDataException: Unable to construct a RasterExtent from the Transformation given. GDAL Error Code: 4
    at geotrellis.raster.gdal.GDALDataset$.rasterExtent$extension1(GDALDataset.scala:143)
    at geotrellis.raster.gdal.GDALRasterSource.gridExtent$lzycompute(GDALRasterSource.scala:93)
    at geotrellis.raster.gdal.GDALRasterSource.gridExtent(GDALRasterSource.scala:93)
    at geotrellis.raster.RasterMetadata$class.cols(RasterMetadata.scala:52)
    at geotrellis.raster.RasterSource.cols(RasterSource.scala:44)
    at org.locationtech.rasterframes.ref.SimpleRasterInfo$.apply(SimpleRasterInfo.scala:71)
    at org.locationtech.rasterframes.ref.GDALRasterSource$$anonfun$tiffInfo$1.apply(GDALRasterSource.scala:53)
    at org.locationtech.rasterframes.ref.GDALRasterSource$$anonfun$tiffInfo$1.apply(GDALRasterSource.scala:53)
    at scala.compat.java8.functionConverterImpls.AsJavaFunction.apply(FunctionConverters.scala:262)
    at com.github.benmanes.caffeine.cache.LocalCache.lambda$statsAware$0(LocalCache.java:139)
    at com.github.benmanes.caffeine.cache.UnboundedLocalCache.lambda$computeIfAbsent$2(UnboundedLocalCache.java:238)
    at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
    at com.github.benmanes.caffeine.cache.UnboundedLocalCache.computeIfAbsent(UnboundedLocalCache.java:234)
    at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:108)
    at com.github.benmanes.caffeine.cache.LocalManualCache.get(LocalManualCache.java:62)
    at com.github.blemale.scaffeine.Cache.get(Cache.scala:40)
    at org.locationtech.rasterframes.ref.SimpleRasterInfo$.apply(SimpleRasterInfo.scala:49)
    at org.locationtech.rasterframes.ref.GDALRasterSource.tiffInfo(GDALRasterSource.scala:53)
    at org.locationtech.rasterframes.ref.GDALRasterSource.extent(GDALRasterSource.scala:57)
    at org.locationtech.rasterframes.ref.RFRasterSource.rasterExtent(RFRasterSource.scala:71)
    at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs$$anonfun$1.apply(RasterSourceToRasterRefs.scala:65)
    at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs$$anonfun$1.apply(RasterSourceToRasterRefs.scala:63)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs.eval(RasterSourceToRasterRefs.scala:63)
    ... 29 more

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
    at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299)
    at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3263)
    at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3260)
    at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
    at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3260)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Error fetching data for one of: GDALRasterSource(s3://sentinel-s2-l2a/tiles/22/L/EP/2019/5/31/0/R60m/B08.jp2), GDALRasterSource(s3://sentinel-s2-l2a/tiles/22/L/EP/2019/5/31/0/R60m/B12.jp2), GDALRasterSource(s3://sentinel-s2-l2a/tiles/22/L/EP/2019/9/13/0/R60m/B08.jp2), GDALRasterSource(s3://sentinel-s2-l2a/tiles/22/L/EP/2019/9/13/0/R60m/B12.jp2)
    at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs.eval(RasterSourceToRasterRefs.scala:81)
    at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:95)
    at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:92)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
    at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more
Caused by: geotrellis.raster.gdal.MalformedDataException: Unable to construct a RasterExtent from the Transformation given. GDAL Error Code: 4
    at geotrellis.raster.gdal.GDALDataset$.rasterExtent$extension1(GDALDataset.scala:143)
    at geotrellis.raster.gdal.GDALRasterSource.gridExtent$lzycompute(GDALRasterSource.scala:93)
    at geotrellis.raster.gdal.GDALRasterSource.gridExtent(GDALRasterSource.scala:93)
    at geotrellis.raster.RasterMetadata$class.cols(RasterMetadata.scala:52)
    at geotrellis.raster.RasterSource.cols(RasterSource.scala:44)
    at org.locationtech.rasterframes.ref.SimpleRasterInfo$.apply(SimpleRasterInfo.scala:71)
    at org.locationtech.rasterframes.ref.GDALRasterSource$$anonfun$tiffInfo$1.apply(GDALRasterSource.scala:53)
    at org.locationtech.rasterframes.ref.GDALRasterSource$$anonfun$tiffInfo$1.apply(GDALRasterSource.scala:53)
    at scala.compat.java8.functionConverterImpls.AsJavaFunction.apply(FunctionConverters.scala:262)
    at com.github.benmanes.caffeine.cache.LocalCache.lambda$statsAware$0(LocalCache.java:139)
    at com.github.benmanes.caffeine.cache.UnboundedLocalCache.lambda$computeIfAbsent$2(UnboundedLocalCache.java:238)
    at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
    at com.github.benmanes.caffeine.cache.UnboundedLocalCache.computeIfAbsent(UnboundedLocalCache.java:234)
    at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:108)
    at com.github.benmanes.caffeine.cache.LocalManualCache.get(LocalManualCache.java:62)
    at com.github.blemale.scaffeine.Cache.get(Cache.scala:40)
    at org.locationtech.rasterframes.ref.SimpleRasterInfo$.apply(SimpleRasterInfo.scala:49)
    at org.locationtech.rasterframes.ref.GDALRasterSource.tiffInfo(GDALRasterSource.scala:53)
    at org.locationtech.rasterframes.ref.GDALRasterSource.extent(GDALRasterSource.scala:57)
    at org.locationtech.rasterframes.ref.RFRasterSource.rasterExtent(RFRasterSource.scala:71)
    at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs$$anonfun$1.apply(RasterSourceToRasterRefs.scala:65)
    at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs$$anonfun$1.apply(RasterSourceToRasterRefs.scala:63)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs.eval(RasterSourceToRasterRefs.scala:63)
    ... 29 more

Reproducible in this notebook environment:

docker run -p 8888:8888 s22s/rasterframes-notebook:0.9.0-9560e45

with this notebook and data: gt-3184-test.zip

You can run the notebook on the command line with ipython if you prefer.

metasim commented 4 years ago

Notebook viewable here:

https://gist.github.com/metasim/1734fee3eefc4474a0f269aa976394a0

pomadchin commented 4 years ago

@metasim okay, this is another error code afterall; error code 4 do you have any coredumps / smth like that?

show to run this notebook?

Do I need ec2 m4.x4large?

metasim commented 4 years ago

I get the error running on my 8 core macbook.

To run, unpack gt-3184-test.zip and execute this:

docker run --rm -v $PWD:/home/jovyan s22s/rasterframes-notebook:0.9.0-9560e45 ipython gt-3184.ipynb

edit: no coredumps

metasim commented 4 years ago

@pomadchin I have a sneaking suspicion that "GDAL Error Code: 4" here might be triggered by an AWS identity error caused by reading a requester-pays bucket and not having a ~/.aws/credentials file or env vars to tell S3 who you are. I just added a .aws/credentials file and it's running much longer than usual (I expect the job to take 1.5hrs), so it's not definitive, but I bet if we had more error message context it would point that way.

pomadchin commented 4 years ago

@metasim Error code 4 means GDAL failed to open the dataset, so it can be the case; wating till confirmation from you than. Thanks for the update!

metasim commented 4 years ago

Pretty sure this last round was a false alarm due to:

  1. dependencyOverrides not being transitive
  2. reference.conf values not getting overridden
  3. Missing AWS credentials
  4. Error codes masking true error causes

With all those things addressed, the original test case now completes.

I suggest this ticket be closed once an updated GT referencing com.azavea.gdal:gdal-warp-bindings:33.f746890 is published to Maven Central.

pomadchin commented 4 years ago

@metasim :+1:

pomadchin commented 4 years ago

GDAL 1.0.0 is published! Also look into the CHANGELOG for the all changes that were also a part of this release. Closing it now, feel free to reopen / open a new issue if smth would happen with it again!