locationtech / rasterframes

Geospatial Raster support for Spark DataFrames
http://rasterframes.io
Apache License 2.0
244 stars 45 forks source link

java.lang.UnsupportedOperationException: Reading gdal://vsihdfs/ not supported #524

Open yurigba opened 3 years ago

yurigba commented 3 years ago

I am using pyrasterframes in cluster mode, by invoking pyspark shell as specified here:

pyspark --conf spark.pyspark.virtualenv.enabled=true  
--conf spark.pyspark.virtualenv.type=native  
--conf spark.pyspark.virtualenv.requirements=/opt/requirements.txt 
--conf spark.pyspark.virtualenv.bin.path=/opt/python36/bin
 --master yarn
 --num-executors 1
 --driver-memory 8g --executor-memory 4g
 --conf spark.yarn.appMasterEnv.PYSPARK3_PYTHON=/opt/python36/bin/python3.6
 --conf spark.sql.catalogImplementation=in-memory
 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
 --py-files /opt/pyrasterframes_2.11-0.9.0-python.zip
 --packages org.locationtech.rasterframes:rasterframes_2.11:0.9.0,
org.locationtech.rasterframes:pyrasterframes_2.11:0.9.0,
org.locationtech.rasterframes:rasterframes-datasource_2.11:0.9.0
 --conf spark.jars.ivySettings=/home/user/.ivy2/ivysettings.xml

The tiff file used is from GLAD and can be downloaded freely here but needs registration.

By loading pyspark shell, we then run the following commands:

>>> import pyrasterframes
>>> from pyrasterframes.rasterfunctions import *
>>> spark.withRasterFrames()
>>> spark.read.raster('hdfs://DEVEL/user/884.tif').select(rf_crs("proj_raster").alias("value")).first()
Row(value=Row(crsProj4='+proj=longlat +datum=WGS84 +no_defs '))

We see that reading tiff files using HadoopGeoTiffRasterSource works. However, it is intended to read jp2 and zip files, and GDAL is needed.

To check if GDAL is recognized by pyrasterframes, we run:

>>> from pyrasterframes.utils import gdal_version
>>> gdal_version()
'GDAL 2.4.4, released 2020/01/08'

By trying to use GDAL driver (as stated here):

>>> spark.read.raster("gdal://vsihdfs/hdfs://DEVEL/user/884.tif").select(rf_crs("proj_raster").alias("value")).first()

We get this error:

>>> spark.read.raster("gdal://vsihdfs/hdfs://DEVEL/user/884.tif").select(rf_crs("proj_raster").alias("value")).first()
20/11/26 17:30:44 WARN TaskSetManager: Lost task 61.0 in stage 109.0 (TID 1611, xxxxxxxxxxxxx, executor 1): java.lang.IllegalArgumentException: Error fetching data for one of:
        at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs.eval(RasterSourceToRasterRefs.scala:81)
        at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:95)
        at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:92)
        at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
        at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:123)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.UnsupportedOperationException: Reading 'gdal://vsihdfs/hdfs://DEVEL/user/884.tif' not supported
        at org.locationtech.rasterframes.ref.RFRasterSource$$anonfun$apply$1.apply(RFRasterSource.scala:123)
        at org.locationtech.rasterframes.ref.RFRasterSource$$anonfun$apply$1.apply(RFRasterSource.scala:111)
        at scala.compat.java8.functionConverterImpls.AsJavaFunction.apply(FunctionConverters.scala:262)
        at com.github.benmanes.caffeine.cache.LocalCache.lambda$statsAware$0(LocalCache.java:139)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$doComputeIfAbsent$14(BoundedLocalCache.java:2379)
        at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2377)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2360)
        at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:108)
        at com.github.benmanes.caffeine.cache.LocalManualCache.get(LocalManualCache.java:62)
        at com.github.blemale.scaffeine.Cache.get(Cache.scala:40)
        at org.locationtech.rasterframes.ref.RFRasterSource$.apply(RFRasterSource.scala:110)
        at org.locationtech.rasterframes.expressions.transformers.URIToRasterSource.nullSafeEval(URIToRasterSource.scala:56)
        at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:393)
        at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs$$anonfun$1.apply(RasterSourceToRasterRefs.scala:64)
        at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs$$anonfun$1.apply(RasterSourceToRasterRefs.scala:63)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.AbstractTraversable.map(Traversable.scala:104)
        at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs.eval(RasterSourceToRasterRefs.scala:63)
        ... 32 more

20/11/26 17:30:45 ERROR TaskSetManager: Task 61 in stage 109.0 failed 4 times; aborting job
20/11/26 17:30:45 WARN TaskSetManager: Lost task 3.0 in stage 109.0 (TID 1618, xxxxxxxxxxxxxxxx, executor 1): TaskKilled (Stage cancelled)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/hdp/3.1.0.0-78/spark2/python/pyspark/sql/dataframe.py", line 1266, in first
    return self.head()
  File "/usr/hdp/3.1.0.0-78/spark2/python/pyspark/sql/dataframe.py", line 1254, in head
    rs = self.head(1)
  File "/usr/hdp/3.1.0.0-78/spark2/python/pyspark/sql/dataframe.py", line 1256, in head
    return self.take(n)
  File "/usr/hdp/3.1.0.0-78/spark2/python/pyspark/sql/dataframe.py", line 573, in take
    return self.limit(num).collect()
  File "/usr/hdp/3.1.0.0-78/spark2/python/pyspark/sql/dataframe.py", line 534, in collect
    sock_info = self._jdf.collectToPython()
  File "/usr/hdp/3.1.0.0-78/spark2/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/hdp/3.1.0.0-78/spark2/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/hdp/3.1.0.0-78/spark2/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o554.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 61 in stage 109.0 failed 4 times, most recent failure: Lost task 61.3 in stage 109.0 (TID 1617, xxxxxxxxxxxxxxxxxxx, executor 1): java.lang.IllegalArgumentException: Error fetching data for one of:
        at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs.eval(RasterSourceToRasterRefs.scala:81)
        at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:95)
        at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:92)
        at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
        at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:123)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.UnsupportedOperationException: Reading 'gdal://vsihdfs/hdfs://DEVEL/user/884.tif' not supported
        at org.locationtech.rasterframes.ref.RFRasterSource$$anonfun$apply$1.apply(RFRasterSource.scala:123)
        at org.locationtech.rasterframes.ref.RFRasterSource$$anonfun$apply$1.apply(RFRasterSource.scala:111)
        at scala.compat.java8.functionConverterImpls.AsJavaFunction.apply(FunctionConverters.scala:262)
        at com.github.benmanes.caffeine.cache.LocalCache.lambda$statsAware$0(LocalCache.java:139)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$doComputeIfAbsent$14(BoundedLocalCache.java:2379)
        at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2377)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2360)
        at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:108)
        at com.github.benmanes.caffeine.cache.LocalManualCache.get(LocalManualCache.java:62)
        at com.github.blemale.scaffeine.Cache.get(Cache.scala:40)
        at org.locationtech.rasterframes.ref.RFRasterSource$.apply(RFRasterSource.scala:110)
        at org.locationtech.rasterframes.expressions.transformers.URIToRasterSource.nullSafeEval(URIToRasterSource.scala:56)
        at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:393)
        at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs$$anonfun$1.apply(RasterSourceToRasterRefs.scala:64)
        at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs$$anonfun$1.apply(RasterSourceToRasterRefs.scala:63)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.AbstractTraversable.map(Traversable.scala:104)
        at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs.eval(RasterSourceToRasterRefs.scala:63)
        ... 32 more

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1891)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
        at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:365)
        at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
        at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3263)
        at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3260)
        at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
        at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
        at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3260)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Error fetching data for one of:
        at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs.eval(RasterSourceToRasterRefs.scala:81)
        at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:95)
        at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:92)
        at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
        at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:123)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        ... 1 more
Caused by: java.lang.UnsupportedOperationException: Reading 'gdal://vsihdfs/hdfs://DEVEL/user/884.tif' not supported
        at org.locationtech.rasterframes.ref.RFRasterSource$$anonfun$apply$1.apply(RFRasterSource.scala:123)
        at org.locationtech.rasterframes.ref.RFRasterSource$$anonfun$apply$1.apply(RFRasterSource.scala:111)
        at scala.compat.java8.functionConverterImpls.AsJavaFunction.apply(FunctionConverters.scala:262)
        at com.github.benmanes.caffeine.cache.LocalCache.lambda$statsAware$0(LocalCache.java:139)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$doComputeIfAbsent$14(BoundedLocalCache.java:2379)
        at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2377)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2360)
        at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:108)
        at com.github.benmanes.caffeine.cache.LocalManualCache.get(LocalManualCache.java:62)
        at com.github.blemale.scaffeine.Cache.get(Cache.scala:40)
        at org.locationtech.rasterframes.ref.RFRasterSource$.apply(RFRasterSource.scala:110)
        at org.locationtech.rasterframes.expressions.transformers.URIToRasterSource.nullSafeEval(URIToRasterSource.scala:56)
        at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:393)
        at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs$$anonfun$1.apply(RasterSourceToRasterRefs.scala:64)
        at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs$$anonfun$1.apply(RasterSourceToRasterRefs.scala:63)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.AbstractTraversable.map(Traversable.scala:104)
        at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs.eval(RasterSourceToRasterRefs.scala:63)
        ... 32 more
yurigba commented 3 years ago

Curiously, reading using gdal library in pyspark works:

>>> import gdal
>>> from osgeo import gdal, gdalconst
>>> ds = gdal.Open("/vsihdfs/hdfs://DEVEL/user/884.tif", gdalconst.GA_ReadOnly)
20/11/26 23:49:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/11/26 23:49:22 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
>>> print(ds.GetMetadata())
{'AREA_OR_POINT': 'Area'}
yurigba commented 3 years ago

Looking at the scala code, there may be something related to this:

https://github.com/locationtech/rasterframes/blob/68ebe65b25a99fa3cbd390c8f20a78162f7d9b5d/core/src/main/scala/org/locationtech/rasterframes/ref/RFRasterSource.scala#L134-L138

In spark-shell GDALRasterSource.hasGDAL returns true in the cluster, yet the function gdalOnly may be retuning a falseboolean when parsing the URI. The function returns true when giving a valid java.net.URI with a .jp2 extension, but false when giving something else:

scala> RFRasterSource.IsGDAL.gdalOnly(URI.create("a.jp2"))
 Boolean = true
scala> RFRasterSource.IsGDAL.gdalOnly(URI.create("a"))
 Boolean = false

Considering this, it doesn't checks directly if GDAL is present, but if the path is GDAL-readable. This is consistent to the error received when a JP2 file is sent to spark.read.raster.

My (un)educated guess is that the received URI is receiving something that is not expected by gdalOnly.

gerdansantos commented 3 years ago

Solved uploading libhdfs.so.0.0.0 with parameter --files=/usr/hdp/3.1.0.0-78/usr/lib/libhdfs.so.0.0.0 on pyspark

image

yurigba commented 3 years ago

Since this is very pontual yet very important, it would be a good idea to check if this library is loaded when reading inside HDFS then (with GDAL)