Closed metasim closed 4 years ago
GDAL formats in environment
@metasim just to confirm, have you tried GDAL 2.4.4? https://github.com/OSGeo/gdal/issues/1244 Is the same issue happens with TIFFs or only with JP2K?
@metasim just to confirm, have you tried GDAL 2.4.4? OSGeo/gdal#1244
Not sure... I'll try that later today.
Is the same issue happens with TIFFs or only with JP2K?
Don't know. Took me a week to get to a repeatable test case, so those sorts of refinements are needed.
@pomadchin Confirmed the bug occurs under GDAL 2.4.4, released 2020/01/08
@metasim perfetct (in terms of debugging) :D
Just ran test against this GeoTIFF:
And it does complete successfully. Perhaps it's a GDAL JP2 issue?
@metasim ‾_(ツ)_/‾ requires a bit more investigations; can be just beacuase there is some random nature of this issue. I wish we could reproduce it on a laptop :/
In RasterFrames, added global thread lock to GDALRasterSource
when JP2 files are being read and job completes (albeit extremely slowly). Another mark pointing toward a race condition.
@metasim sounds really sad, slow, and not too reliable
Looking to try to reproduce at a lower level.
Wondering if this might be the cause (fixed in 3.0.2):
https://github.com/OSGeo/gdal/blob/ee535a1a3f5b35b0d231e1faac89ac1f889f7988/gdal/NEWS#L232-L238
@metasim I think it makes sense to try to use GDAL 3.0.4
Working on it.
@pomadchin gdal-warp-bindings
won't link against 3.0.4... looks like it's requiring the 2.x line.
java.lang.UnsatisfiedLinkError: /tmp/nativeutils837692180397/libgdalwarp_bindings.so: libgdal.so.20: cannot open shared object file: No such file or directory
I was able to hack together a new gdal-warp-bindings
for Linux linked against GDAL 3.0.4. Good news is that they link:
from pyrasterframes.utils import gdal_version
gdal_version()
...
'GDAL 3.0.4, released 2020/01/28'
Bad news is that the bug is still there. 😢
geotrellis.raster.gdal.MalformedDataException: Unable to construct a RasterExtent from the Transformation given. GDAL Error Code: 3
at geotrellis.raster.gdal.GDALDataset$.rasterExtent$extension1(GDALDataset.scala:143)
at geotrellis.raster.gdal.GDALRasterSource.gridExtent$lzycompute(GDALRasterSource.scala:93)
at geotrellis.raster.gdal.GDALRasterSource.gridExtent(GDALRasterSource.scala:93)
BTW, it may be worth trying to run the Test Case on a non-AWS Linux machine or Docker container. My laptop is MacOS, so OS is a variable changed between local vs remote execution. It may not have to do with it being EC2 or a particular instance size.
Custom gdal-warp-bindings
built against GDAL 3.0.4, Custom GeoTrellis 3.2.x build.
First create a shell in the environment:
$ docker run -it s22s/rasterframes-notebook:0.9.0-astraea.452747b4 bash
wget https://gist.githubusercontent.com/metasim/5332ac959d97d9747921197cd4307948/raw/662687c9b5c52083b007b451b6530f0505b2c9fc/ParallelJP2.scala && echo ':load ParallelJP2.scala' | spark-shell --jars /opt/conda/lib/python3.7/site-packages/pyrasterframes/jars/pyrasterframes-assembly-0.9.0-astraea.452747b4.jar
Note: Running this locally does not fail. Maybe 8 or more cores are needed?
Edit: With Docker on MacOS configured with all 8 cores, the job above does indeed fail.
gdal-warp-bindings
Create the filetesting.list
in gdal-warp-bindings/Docker
with this:
deb [ allow-insecure=yes ] http://http.us.debian.org/debian testing main non-free contrib
deb-src [ allow-insecure=yes ] http://http.us.debian.org/debian testing main non-free contrib
Replace the # Build GDAL 2.4.3
Linux section of gdal-warp-bindings/Docker/Dockerfile.environment
with this:
COPY unstable.list /etc/apt/sources.list.d/
RUN apt-get update -q && apt-get install -y -q --allow-unauthenticated libgdal-dev=3.0.4+dfsg-1
Build the image. Note ID or tag it.
In the gdal-warp-bindings
directory run
docker run -it --rm \
-v $(pwd):/workdir \
-e CC=gcc -e CXX=g++ \
-e CFLAGS="-Wall -Wno-sign-compare -Werror -O0 -ggdb3 -DSO_FINI -D_GNU_SOURCE" \
-e BOOST_ROOT="/usr/local/include/boost_1_69_0" \
-e JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64" \
<image tag or ID from above> make -j4 -C src tests
Note the location of file gdal-warp-bindings/src/main/gdalwarp.jar
.
Edit geotrellis/project/Dependencies.scala
and replace
val gdalWarp = "com.azavea.gdal" % "gdal-warp-bindings" % Version.gdalWarp
with
val gdalWarp = "com.azavea.gdal" % "gdal-warp-bindings" % Version.gdalWarp from("file:/path/to/gdal-warp-bindings/src/main/gdalwarp.jar")
Build GeoTrellis.
Edit: I was running this at home over mediocre WiFi. The office environment is 1Gbps wired.
Update on the script to reproduce it. From within the docker container:
$ PROJ_LIB=/opt/conda/share/proj spark-shell --master local[8] --jars /opt/conda/lib/python3.7/site-packages/pyrasterframes/jars/pyrasterframes-assembly-0.9.0-astraea.452747b4.jar
scala> :load ParallelJP2.scala
Although I do not reproduce the failure with 8 cores.
Test case using custom build
Custom
gdal-warp-bindings
built against GDAL 3.0.4, Custom GeoTrellis 3.2.x build.First create a shell in the environment:
$ docker run -it s22s/rasterframes-notebook:0.9.0-astraea.452747b4 bash
$ wget https://gist.githubusercontent.com/metasim/5332ac959d97d9747921197cd4307948/raw/662687c9b5c52083b007b451b6530f0505b2c9fc/ParallelJP2.scala $ spark-shell --jars /opt/conda/lib/python3.7/site-packages/pyrasterframes/jars/pyrasterframes-assembly-0.9.0-astraea.452747b4.jar scala> :load ParallelJP2.scala
Note: Running this locally does not fail. Maybe 8 or more cores are needed?
Edit: With Docker on MacOS configured with all 8 cores, the job above does indeed fail.
@vpipkt What happens if you leave out the --master local[8]
? I did not specify the number of cores that way.... I just left it to Spark defaults, but configured Docker to have 8 cores.
@vpipkt Also, if you re-run it, can you do docker pull s22s/rasterframes-notebook:0.9.0-astraea.452747b4
first? I updated it to have the PROJ_LIB
done for you.
I pulled the image again (image id 26d9771deb79), and ran again omitting the explicit --master local[8]
and did not reproduce the bug. :-(
Same.... on wired internet at work it's passing. 😠 These results were from running it at home on mediocre WiFi.
When using my phone's hot spot using 8 cores it fails.
The "Additional Tools for Xcode 11" package includes a tool called Network Link Conditioner that simulates slow or error prone networks:
When using this tool (and remembering to filp the "On" switch) results in the test fails.
Edit: If it disappears from your System Preferences after install, do this: https://agilewarrior.wordpress.com/2018/10/31/trouble-installing-link-conditioner/
It's also possible to reproduce it on EC2 m4.4xlarge:
$ sudo yum install tc
$ sudo tc qdisc add dev eth0 root netem delay 500ms
$ docker run -it --cpus=8 -u root s22s/rasterframes-notebook:0.9.0-astraea.452747b4 bash
$ wget https://gist.githubusercontent.com/metasim/5332ac959d97d9747921197cd4307948/raw/662687c9b5c52083b007b451b6530f0505b2c9fc/ParallelJP2.scala
$ spark-shell --jars /opt/conda/lib/python3.7/site-packages/pyrasterframes/jars/pyrasterframes-assembly-0.9.0-astraea.452747b4.jar
> :load ParallelJP2.scala
P.S. if you want you fast connection back:
sudo tc qdisc del dev eth0 root netem
P.P.S. It is weird that I could not reproduce it as a part of a unit test
I could not make it fail for the TIFF case in the same environment as well.
Hey @metasim @vpipkt , check out these steps please:
$ docker run -it --cpus 8 -v ${PWD}/geotrellis:/geotrellis daunnc/gdalwarpenv:0.2 bash
// also throw some aws credentials into the container
$ spark-shell --packages org.locationtech.geotrellis:geotrellis-gdal_2.11:3.2.1-SNAPSHOT --repositories https://dl.bintray.com/azavea/geotrellis --jars /geotrellis/gdalwarp.jar
The programm:
import org.apache.spark.sql.SparkSession
import geotrellis.raster._
import geotrellis.raster.gdal.GDALRasterSource
import geotrellis.raster.gdal.config._
// this one is optional
GDALOptionsConfig.registerOptions(
"CPL_DEBUG" -> "ON",
"GDAL_DISABLE_READDIR_ON_OPEN" -> "YES",
"CPL_VSIL_CURL_ALLOWED_EXTENSIONS" -> ".tif"
)
val path = "https://s22s-rasterframes-integration-tests.s3.amazonaws.com/B08.jp2"
spark.range(1000).rdd.
map(_ => path).
flatMap(uri => {
val rs = GDALRasterSource(uri)
val grid = GridBounds(0, 0, rs.cols - 1, rs.rows - 1)
val tileBounds = grid.split(256, 256).toSeq
rs.readBounds(tileBounds)
}).
foreach(r => ())
TLDR; There is GT with bindings built against GDAL 3.0.4 in the container I also noticed a typo in the bidnings code (this fix is applied) https://github.com/geotrellis/gdal-warp-bindings/pull/81
Let me know does it work for you or not. It looks like on EC2 it worked. ¯\_(ツ)_/¯
P/P/S ouch I remembered that it was a mounted volume so nothing probably is persisted ): will rebuild image today / tomorrow / or you’ll check that fix without it by rebuilding bindings with an applied fix
@pomadchin @metasim looking at it now...
@vpipkt in parallel wil try to provide you a container with all inbuilt deps
@pomadchin it is probably for the best if you can provide that.
@vpipkt
$ docker run -it --cpus 8 daunnc/gdalwarpenv:0.5 bash
// also throw some aws credentials into the container
$ spark-shell --packages org.locationtech.geotrellis:geotrellis-gdal_2.11:3.2.1-SNAPSHOT --repositories https://dl.bintray.com/azavea/geotrellis --jars /home/jovyan/gdalwarp.jar
import org.apache.spark.sql.SparkSession
import geotrellis.raster._
import geotrellis.raster.gdal.GDALRasterSource
import geotrellis.raster.gdal.config._
// this one is optional
GDALOptionsConfig.registerOptions(
"CPL_DEBUG" -> "ON",
"GDAL_DISABLE_READDIR_ON_OPEN" -> "YES",
"CPL_VSIL_CURL_ALLOWED_EXTENSIONS" -> ".tif,.jp2"
)
val path = "s3://geotrellis-test/daunnc/B08.jp2"
spark.range(1000).rdd.
map(_ => path).
flatMap(uri => {
val rs = GDALRasterSource(uri)
val grid = GridBounds(0, 0, rs.cols - 1, rs.rows - 1)
val tileBounds = grid.split(256, 256).toSeq
rs.readBounds(tileBounds)
}).
foreach(r => ())
Check it out in your test env; this time everything is included
P.S. I don't expect that it would fix all the problems actually but can produce a new stacktrace as well
@pomadchin Is gdalwarpenv
private?
$ docker pull daunnc/gdalwarpenv:0.5
Error response from daemon: manifest for daunnc/gdalwarpenv:0.5 not found: manifest unknown: manifest unknown
@metasim oops, fixed! (accidentally deleted it from the registry)
Passes locally with 8 cores and DSL level networking...
@metasim I also noticed that jp2k
is performing reads somehow different (taking about both network and reads themselves). Would await until you'll submit some prod / staging jobs!
Would await until you'll submit some prod / staging jobs!
WDYM? Looking for us to test with a real job?
@metasim yep; as I still feel that there can be a window for some problems (just want to be sure)
Will do.
@pomadchin Not looking good. Original job failed in the same way. Doing some extra checking to make sure I deployed the right thing. The md5sum
of libgdalwarp_bindings.so
is 00ecbde671e5cd93ebbba0aa4967ef3b
. What we'd need to check next is to use the exact same GDAL distribution you are. Where/when/how did you get the one you had in the Docker image?
@metasim it is a proper checksum. The GDAL version it 3.0.4 and it is from the image s22s/rasterframes-notebook:0.9.0-astraea.452747b4
(I just commited a running conatiner). Can you show me a new stacktrace? I think it is slightly different.
Is it smth like the exception below?
geotrellis.raster.gdal.GDALIOException: Unable to read in data. GDAL Error Code: 3
at geotrellis.raster.gdal.GDALDataset$.readTile$extension(GDALDataset.scala:324)
@pomadchin This is what I'm seeing:
geotrellis.raster.gdal.MalformedDataException: Unable to construct a RasterExtent from the Transformation given. GDAL Error Code: 3
at geotrellis.raster.gdal.GDALDataset$.rasterExtent$extension1(GDALDataset.scala:143)
Ok, sounds pretty sad. I'll continue looking into the reasons of it.
Can you reproduce?
I could reproduce at geotrellis.raster.gdal.GDALDataset$.readTile$extension
a couple of times, but after applying a fix from the PR https://github.com/geotrellis/gdal-warp-bindings/pull/81 it is not that constant behavior anymore.
@pomadchin Got any ideas on what to try next?
@metasim going to try out to make it more reproducible (without traffic slowdown); also I have an idea of trying to write a C/C++ unit test for this case, mb it would be more concurrent than the JVM version.
If there would be a C++ unit test for this case, this would be amazing; it will allow me to get a normal backtrace (hopefully)
~ removed this message since it was uselsess and confusing ~ next steps are to print the error message in some readble fashion, since the error code is not enough.
This error originated in some RasterFrames work. We have a table where one column is predominantly the same file and the analysis fails with one of a number of errors from
GDALDataset
, such as:or
(See below for extended output)
I removed RasterFrames from the mix, resulting in the test case below. (At this point I have not further reduced to get Spark out of mix with, say,
Future
s instead.) It should be noted that some of the reads complete successfully.When I run it on my laptop is completes successfully, but when I run it on a beefier EC2 instance (
m5a.2xlarge
) it fails. Suspect concurrency level and I/O throughput set the conditions. It appears to work when setting--master=local[1]
.Edit:
my laptop is MacOS, whereas the EC2 instance is Linux. That may be the pertinent variable instead of instance size.Ran in docker locally with 4 cores and the job succeeded. Edit: Configured docker to run with 8 cores on my laptop and it failed!Test Case
RSRead.scala
Execution Command
Using Spark 2.4.4, Scala 2.11.12, GDAL 2.4.3 (released 2019/10/28)
Sample Backtrace
Full log output
```java org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost, executor driver): geotrellis.raster.gdal.MalformedDataTypeException: Unable to deterime the min/max values in order to calculate CellType. GDAL Error Code: 3 at geotrellis.raster.gdal.GDALDataset$.cellType$extension1(GDALDataset.scala:299) at geotrellis.raster.gdal.GDALDataset$.readTile$extension(GDALDataset.scala:315) at geotrellis.raster.gdal.GDALDataset$$anonfun$readMultibandTile$extension$1.apply(GDALDataset.scala:333) at geotrellis.raster.gdal.GDALDataset$$anonfun$readMultibandTile$extension$1.apply(GDALDataset.scala:333) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at geotrellis.raster.gdal.GDALDataset$.readMultibandTile$extension(GDALDataset.scala:333) at geotrellis.raster.gdal.GDALRasterSource$$anonfun$readBounds$2.apply(GDALRasterSource.scala:107) at geotrellis.raster.gdal.GDALRasterSource$$anonfun$readBounds$2.apply(GDALRasterSource.scala:106) at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at geotrellis.raster.gdal.GDALRasterSource.read(GDALRasterSource.scala:156) at geotrellis.raster.RasterSource$$anonfun$readBounds$2.apply(RasterSource.scala:164) at geotrellis.raster.RasterSource$$anonfun$readBounds$2.apply(RasterSource.scala:164) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:927) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:925) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.foreach(RDD.scala:925) ... 94 elided Caused by: geotrellis.raster.gdal.MalformedDataTypeException: Unable to deterime the min/max values in order to calculate CellType. GDAL Error Code: 3 at geotrellis.raster.gdal.GDALDataset$.cellType$extension1(GDALDataset.scala:299) at geotrellis.raster.gdal.GDALDataset$.readTile$extension(GDALDataset.scala:315) at geotrellis.raster.gdal.GDALDataset$$anonfun$readMultibandTile$extension$1.apply(GDALDataset.scala:333) at geotrellis.raster.gdal.GDALDataset$$anonfun$readMultibandTile$extension$1.apply(GDALDataset.scala:333) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at geotrellis.raster.gdal.GDALDataset$.readMultibandTile$extension(GDALDataset.scala:333) at geotrellis.raster.gdal.GDALRasterSource$$anonfun$readBounds$2.apply(GDALRasterSource.scala:107) at geotrellis.raster.gdal.GDALRasterSource$$anonfun$readBounds$2.apply(GDALRasterSource.scala:106) at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at geotrellis.raster.gdal.GDALRasterSource.read(GDALRasterSource.scala:156) at geotrellis.raster.RasterSource$$anonfun$readBounds$2.apply(RasterSource.scala:164) at geotrellis.raster.RasterSource$$anonfun$readBounds$2.apply(RasterSource.scala:164) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ```
cc: @vpipkt