locationtech / rasterframes

Geospatial Raster support for Spark DataFrames
http://rasterframes.io
Apache License 2.0
243 stars 46 forks source link

Read multiple bands of different sizes using Rasterframes #450

Closed Atika9 closed 4 years ago

Atika9 commented 4 years ago

When I try to read multiple bands of different sizes I get this error:

y4JJavaError                             Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
    339                 pass
    340             else:
--> 341                 return printer(obj)
    342             # Finally look for special method names
    343             method = get_real_method(obj, self.print_method)

/opt/conda/lib/python3.7/site-packages/pyrasterframes/rf_ipython.py in spark_df_to_html(df, num_rows, truncate)
    178 def spark_df_to_html(df, num_rows=5, truncate=False):
    179     from pyrasterframes import RFContext
--> 180     return RFContext.active().call("_dfToHTML", df._jdf, num_rows, truncate)
    181 
    182 

/opt/conda/lib/python3.7/site-packages/pyrasterframes/rf_context.py in call(name, *args)
     75     def call(name, *args):
     76         f = RFContext.active().lookup(name)
---> 77         return f(*args)
     78 
     79     @staticmethod

/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling o58._dfToHTML.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 61 in stage 9.0 failed 1 times, most recent failure: Lost task 61.0 in stage 9.0 (TID 201, localhost, executor driver): java.lang.IllegalArgumentException: Error fetching data for one of: GDALRasterSource(https://datarabat.s3.us-east-2.amazonaws.com/b2.tif), GDALRasterSource(https://datarabat.s3.us-east-2.amazonaws.com/b3.tif), GDALRasterSource(https://datarabat.s3.us-east-2.amazonaws.com/b4.tif), GDALRasterSource(https://datarabat.s3.us-east-2.amazonaws.com/b8.tif), GDALRasterSource(https://datarabat.s3.us-east-2.amazonaws.com/b11.tif), GDALRasterSource(https://datarabat.s3.us-east-2.amazonaws.com/b12.tif)
    at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs.eval(RasterSourceToRasterRefs.scala:82)
    at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:95)
    at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:92)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
    at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: transpose requires all collections have the same size
    at scala.collection.generic.GenericTraversableTemplate$class.fail$1(GenericTraversableTemplate.scala:213)
    at scala.collection.generic.GenericTraversableTemplate$$anonfun$transpose$1.apply(GenericTraversableTemplate.scala:225)
    at scala.collection.generic.GenericTraversableTemplate$$anonfun$transpose$1.apply(GenericTraversableTemplate.scala:217)
    at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
    at scala.collection.generic.GenericTraversableTemplate$class.transpose(GenericTraversableTemplate.scala:217)
    at scala.collection.AbstractTraversable.transpose(Traversable.scala:104)
    at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs.eval(RasterSourceToRasterRefs.scala:75)
    ... 30 more

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
    at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:365)
    at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
    at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3389)
    at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2788)
    at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2788)
    at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
    at org.apache.spark.sql.Dataset.collect(Dataset.scala:2788)
    at org.locationtech.rasterframes.util.DataFrameRenderers$DFWithPrettyPrint.toHTML(DataFrameRenderers.scala:94)
    at org.locationtech.rasterframes.py.PyRFContext._dfToHTML(PyRFContext.scala:241)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Error fetching data for one of: GDALRasterSource(https://datarabat.s3.us-east-2.amazonaws.com/b2.tif), GDALRasterSource(https://datarabat.s3.us-east-2.amazonaws.com/b3.tif), GDALRasterSource(https://datarabat.s3.us-east-2.amazonaws.com/b4.tif), GDALRasterSource(https://datarabat.s3.us-east-2.amazonaws.com/b8.tif), GDALRasterSource(https://datarabat.s3.us-east-2.amazonaws.com/b11.tif), GDALRasterSource(https://datarabat.s3.us-east-2.amazonaws.com/b12.tif)
    at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs.eval(RasterSourceToRasterRefs.scala:82)
    at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:95)
    at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:92)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
    at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more
Caused by: java.lang.IllegalArgumentException: transpose requires all collections have the same size
    at scala.collection.generic.GenericTraversableTemplate$class.fail$1(GenericTraversableTemplate.scala:213)
    at scala.collection.generic.GenericTraversableTemplate$$anonfun$transpose$1.apply(GenericTraversableTemplate.scala:225)
    at scala.collection.generic.GenericTraversableTemplate$$anonfun$transpose$1.apply(GenericTraversableTemplate.scala:217)
    at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
    at scala.collection.generic.GenericTraversableTemplate$class.transpose(GenericTraversableTemplate.scala:217)
    at scala.collection.AbstractTraversable.transpose(Traversable.scala:104)
    at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs.eval(RasterSourceToRasterRefs.scala:75)
    ... 30 more

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
    339                 pass
    340             else:
--> 341                 return printer(obj)
    342             # Finally look for special method names
    343             method = get_real_method(obj, self.print_method)

/opt/conda/lib/python3.7/site-packages/pyrasterframes/rf_ipython.py in spark_df_to_markdown(df, num_rows, truncate)
    173 def spark_df_to_markdown(df, num_rows=5, truncate=False):
    174     from pyrasterframes import RFContext
--> 175     return RFContext.active().call("_dfToMarkdown", df._jdf, num_rows, truncate)
    176 
    177 

/opt/conda/lib/python3.7/site-packages/pyrasterframes/rf_context.py in call(name, *args)
     75     def call(name, *args):
     76         f = RFContext.active().lookup(name)
---> 77         return f(*args)
     78 
     79     @staticmethod

/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling o58._dfToMarkdown.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 61 in stage 19.0 failed 1 times, most recent failure: Lost task 61.0 in stage 19.0 (TID 403, localhost, executor driver): java.lang.IllegalArgumentException: Error fetching data for one of: GDALRasterSource(https://datarabat.s3.us-east-2.amazonaws.com/b2.tif), GDALRasterSource(https://datarabat.s3.us-east-2.amazonaws.com/b3.tif), GDALRasterSource(https://datarabat.s3.us-east-2.amazonaws.com/b4.tif), GDALRasterSource(https://datarabat.s3.us-east-2.amazonaws.com/b8.tif), GDALRasterSource(https://datarabat.s3.us-east-2.amazonaws.com/b11.tif), GDALRasterSource(https://datarabat.s3.us-east-2.amazonaws.com/b12.tif)
    at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs.eval(RasterSourceToRasterRefs.scala:82)
    at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:95)
    at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:92)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
    at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: transpose requires all collections have the same size
    at scala.collection.generic.GenericTraversableTemplate$class.fail$1(GenericTraversableTemplate.scala:213)
    at scala.collection.generic.GenericTraversableTemplate$$anonfun$transpose$1.apply(GenericTraversableTemplate.scala:225)
    at scala.collection.generic.GenericTraversableTemplate$$anonfun$transpose$1.apply(GenericTraversableTemplate.scala:217)
    at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
    at scala.collection.generic.GenericTraversableTemplate$class.transpose(GenericTraversableTemplate.scala:217)
    at scala.collection.AbstractTraversable.transpose(Traversable.scala:104)
    at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs.eval(RasterSourceToRasterRefs.scala:75)
    ... 30 more

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
    at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:365)
    at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
    at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3389)
    at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2788)
    at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2788)
    at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
    at org.apache.spark.sql.Dataset.collect(Dataset.scala:2788)
    at org.locationtech.rasterframes.util.DataFrameRenderers$DFWithPrettyPrint.toMarkdown(DataFrameRenderers.scala:74)
    at org.locationtech.rasterframes.py.PyRFContext._dfToMarkdown(PyRFContext.scala:236)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Error fetching data for one of: GDALRasterSource(https://datarabat.s3.us-east-2.amazonaws.com/b2.tif), GDALRasterSource(https://datarabat.s3.us-east-2.amazonaws.com/b3.tif), GDALRasterSource(https://datarabat.s3.us-east-2.amazonaws.com/b4.tif), GDALRasterSource(https://datarabat.s3.us-east-2.amazonaws.com/b8.tif), GDALRasterSource(https://datarabat.s3.us-east-2.amazonaws.com/b11.tif), GDALRasterSource(https://datarabat.s3.us-east-2.amazonaws.com/b12.tif)
    at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs.eval(RasterSourceToRasterRefs.scala:82)
    at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:95)
    at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:92)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
    at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more
Caused by: java.lang.IllegalArgumentException: transpose requires all collections have the same size
    at scala.collection.generic.GenericTraversableTemplate$class.fail$1(GenericTraversableTemplate.scala:213)
    at scala.collection.generic.GenericTraversableTemplate$$anonfun$transpose$1.apply(GenericTraversableTemplate.scala:225)
    at scala.collection.generic.GenericTraversableTemplate$$anonfun$transpose$1.apply(GenericTraversableTemplate.scala:217)
    at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
    at scala.collection.generic.GenericTraversableTemplate$class.transpose(GenericTraversableTemplate.scala:217)
    at scala.collection.AbstractTraversable.transpose(Traversable.scala:104)
    at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs.eval(RasterSourceToRasterRefs.scala:75)
    ... 30 more

DataFrame[bleu: struct<tile_context:struct<extent:struct<xmin:double,ymin:double,xmax:double,ymax:double>,crs:struct<crsProj4:string>>,tile:udt>, green: struct<tile_context:struct<extent:struct<xmin:double,ymin:double,xmax:double,ymax:double>,crs:struct<crsProj4:string>>,tile:udt>, red: struct<tile_context:struct<extent:struct<xmin:double,ymin:double,xmax:double,ymax:double>,crs:struct<crsProj4:string>>,tile:udt>, nir: struct<tile_context:struct<extent:struct<xmin:double,ymin:double,xmax:double,ymax:double>,crs:struct<crsProj4:string>>,tile:udt>, swir1: struct<tile_context:struct<extent:struct<xmin:double,ymin:double,xmax:double,ymax:double>,crs:struct<crsProj4:string>>,tile:udt>, swir2: struct<tile_context:struct<extent:struct<xmin:double,ymin:double,xmax:double,ymax:double>,crs:struct<crsProj4:string>>,tile:udt>]

Any help is appreciated.

vpipkt commented 4 years ago

@Atika9 Thanks for filing the issue!

The reading of different resolutions across bands is a known limitation in RasterFrames RasterSource.

We may be able to work around it, and I'd be interested to see if the work around is performant for your use case. Below is a brief snippet of code that does the following to work around:

1) Read the 10m bands, with known tile size and spatial partitioning (available on develop branch but not yet released) 2) Read the 20m bands

  1. half the 10m tile size
  2. Upsample by 2
  3. Use same spatial partitioning as 10m 3) join the two together

When I try this with the URI's in the stack trace you posted it results in an empty DataFrame, because the 10 and 20m bands have different spatial extents. Note band 11's upper coordinate is 10m higher than band 2.

The reading of a catalog does assume that the extents of the products is the same for each band, however I do not think there is anything to explicitly enforce it.

from pyspark.sql.functions import col

df2348 = spark.read.raster(
    [[f'https://datarabat.s3.us-east-2.amazonaws.com/b{b}.tif' for b in [2, 3, 4, 8]]],  
    tile_dimensions=(256, 256),
    spatial_index_partitions=200,
).select(
    # rename columns
    col('proj_raster_0').alias('b2'),
    col('proj_raster_1').alias('b3'),
    col('proj_raster_2').alias('b4'),
    col('proj_raster_3').alias('b8'),
)

df1112 = spark.read.raster(
    [[f'https://datarabat.s3.us-east-2.amazonaws.com/b{b}.tif' for b in [11, 12]]], 
    tile_dimensions=(128, 128),
    spatial_index_partitions=200,
).select(
    # UPSAMPLE and  rename columns
    rf_resample('proj_raster_0', lit(2)).alias('b11'),
    rf_resample('proj_raster_1', lit(2)).alias('b12'),
)

cond = [
    rf_crs(df2348.b2) == rf_crs(df1112.b11),
    rf_extent(df2348.b2) == rf_extent(df1112.b11)
]
df = df2348.join(df1112, cond)
df.printSchema()

Returns:

root
 |-- b2: struct (nullable = true)
 |    |-- tile_context: struct (nullable = true)
 |    |    |-- extent: struct (nullable = false)
 |    |    |    |-- xmin: double (nullable = false)
 |    |    |    |-- ymin: double (nullable = false)
 |    |    |    |-- xmax: double (nullable = false)
 |    |    |    |-- ymax: double (nullable = false)
 |    |    |-- crs: struct (nullable = false)
 |    |    |    |-- crsProj4: string (nullable = false)
 |    |-- tile: tile (nullable = false)
 |-- b3: struct (nullable = true)
 |    |-- tile_context: struct (nullable = true)
 |    |    |-- extent: struct (nullable = false)
 |    |    |    |-- xmin: double (nullable = false)
 |    |    |    |-- ymin: double (nullable = false)
 |    |    |    |-- xmax: double (nullable = false)
 |    |    |    |-- ymax: double (nullable = false)
 |    |    |-- crs: struct (nullable = false)
 |    |    |    |-- crsProj4: string (nullable = false)
 |    |-- tile: tile (nullable = false)
 |-- b4: struct (nullable = true)
 |    |-- tile_context: struct (nullable = true)
 |    |    |-- extent: struct (nullable = false)
 |    |    |    |-- xmin: double (nullable = false)
 |    |    |    |-- ymin: double (nullable = false)
 |    |    |    |-- xmax: double (nullable = false)
 |    |    |    |-- ymax: double (nullable = false)
 |    |    |-- crs: struct (nullable = false)
 |    |    |    |-- crsProj4: string (nullable = false)
 |    |-- tile: tile (nullable = false)
 |-- b8: struct (nullable = true)
 |    |-- tile_context: struct (nullable = true)
 |    |    |-- extent: struct (nullable = false)
 |    |    |    |-- xmin: double (nullable = false)
 |    |    |    |-- ymin: double (nullable = false)
 |    |    |    |-- xmax: double (nullable = false)
 |    |    |    |-- ymax: double (nullable = false)
 |    |    |-- crs: struct (nullable = false)
 |    |    |    |-- crsProj4: string (nullable = false)
 |    |-- tile: tile (nullable = false)
 |-- b11: struct (nullable = true)
 |    |-- tile_context: struct (nullable = true)
 |    |    |-- extent: struct (nullable = false)
 |    |    |    |-- xmin: double (nullable = false)
 |    |    |    |-- ymin: double (nullable = false)
 |    |    |    |-- xmax: double (nullable = false)
 |    |    |    |-- ymax: double (nullable = false)
 |    |    |-- crs: struct (nullable = false)
 |    |    |    |-- crsProj4: string (nullable = false)
 |    |-- tile: tile (nullable = false)
 |-- b12: struct (nullable = true)
 |    |-- tile_context: struct (nullable = true)
 |    |    |-- extent: struct (nullable = false)
 |    |    |    |-- xmin: double (nullable = false)
 |    |    |    |-- ymin: double (nullable = false)
 |    |    |    |-- xmax: double (nullable = false)
 |    |    |    |-- ymax: double (nullable = false)
 |    |    |-- crs: struct (nullable = false)
 |    |    |    |-- crsProj4: string (nullable = false)
 |    |-- tile: tile (nullable = false)
metasim commented 4 years ago

develop also has rasterJoin in it, whihc might be another option... it will upsample and handle extent offsets.

vpipkt commented 4 years ago

rasterJoin can be pretty expensive operation but may be the best option

The join on extents is more likely to result in rows getting dropped. Here is another example with data that "almost" matches up in terms of extents. There is a very small floating point difference in the extents.

The MODIS LST is a nominal 1000m product, and NBAR a nominal 500m product. They are laid out on the same grid.

from pyrasterframes.utils import create_rf_spark_session
spark = create_earthai_spark_session()
from pyspark.sql.functions import col

lst_url = '/vsis3/astraea-opendata/MOD11A1.006/21/06/2019349/MOD11A1.A2019349.h21v06.006.2019352050642_LSTD_B01.TIF'
nbar_url = '/vsis3/astraea-opendata/MCD43A4.006/21/06/2019357/MCD43A4.A2019357.h21v06.006.2020001035452_B01.TIF'

df_lst = spark.read.raster(lst_url,
                        tile_dimensions=(128, 128),
                        spatial_index_partitions=200) \
            .select(rf_resample('proj_raster', lit(2.0)).alias('LST'))

df_nbar = spark.read.raster(nbar_url,
                          tile_dimensions=(256, 256),
                          spatial_index_partitions=200) \
            .select(col("proj_raster").alias('NBAR'))

# Conditions to join on, 
# The st_intersects here is to work around the floating point precision of how the 500m and 1000m products are gridded

cond = [
    rf_crs(df_lst.LST) == rf_crs(df_nbar.NBAR),
    st_intersects(st_centroid(rf_geometry(df_lst.LST)), rf_geometry(df_nbar.NBAR))
       ]

df = df_lst.join(df_nbar, cond)

df.select(rf_extent('NBAR'), rf_extent('LST')).show()

Result:

+----------------------+----------------------+----------------------+----------------------+
|xmindiff              |ymindiff              |xmaxdiff              |ymaxdiff              |
+----------------------+----------------------+----------------------+----------------------+
|-2.999999560415745E-4 |-2.005002461373806E-4 |-3.1061330810189247E-4|-2.0448025315999985E-4|
|-3.1061330810189247E-4|-2.005002461373806E-4 |-3.2122666016221046E-4|-2.0448025315999985E-4|
|-2.999999560415745E-4 |-2.0448025315999985E-4|-3.1061330810189247E-4|-2.1509360522031784E-4|
|-3.1061330810189247E-4|-2.0448025315999985E-4|-3.2122666016221046E-4|-2.1509360522031784E-4|
|-2.999999560415745E-4 |-2.1509313955903053E-4|-3.1061330810189247E-4|-2.2570649161934853E-4|
|-3.1061330810189247E-4|-2.1509313955903053E-4|-3.2122666016221046E-4|-2.2570649161934853E-4|
|-3.2122666016221046E-4|-2.005002461373806E-4 |-3.3184001222252846E-4|-2.0448025315999985E-4|
|-3.3184001222252846E-4|-2.005002461373806E-4 |-3.4245336428284645E-4|-2.0448025315999985E-4|
|-3.4245336428284645E-4|-2.005002461373806E-4 |-3.5306671634316444E-4|-2.0448025315999985E-4|
|-3.2122666016221046E-4|-2.0448025315999985E-4|-3.3184001222252846E-4|-2.1509360522031784E-4|
+----------------------+----------------------+----------------------+----------------------+

Just out of curiousity what is the worst offset here. These are in meters per the MODIS MODLAND sinusoidal CRS....

from pyspark.sql.functions import abs as F_abs
from pyspark.sql.functions import max as F_max

df.select(F_max(F_abs(rf_extent('NBAR').xmin - rf_extent('LST').xmin)).alias('xmindiff'),
          F_max(F_abs(rf_extent('NBAR').ymin - rf_extent('LST').ymin)).alias('ymindiff'),
          F_max(F_abs(rf_extent('NBAR').xmax - rf_extent('LST').xmax)).alias('xmaxdiff'),
          F_max(F_abs(rf_extent('NBAR').ymax - rf_extent('LST').ymax)).alias('ymaxdiff'),
         ).show(10, False)

Result

+--------------------+--------------------+---------------------+--------------------+
|xmindiff            |ymindiff            |xmaxdiff             |ymaxdiff            |
+--------------------+--------------------+---------------------+--------------------+
|3.955205902457237E-4|2.893866039812565E-4|3.9950013160705566E-4|2.999999560415745E-4|
+--------------------+--------------------+---------------------+--------------------+

You can also check df_lst.count() versus df.count To verify.

Atika9 commented 4 years ago

@vpipkt Thank you for your reply.

Using your code, display() and show () functions returns empty arrays.

df.select('b2','b3','b4','b8','b11','b12')


b2 | b3 | b4 | b8 | b11 | b12
-- | -- | -- | -- | -- | --

display(df)


b2 | b3 | b4 | b8 | b11 | b12
-- | -- | -- | -- | -- | --
 

df.select(rf_extent('b2'), rf_extent('b11')).show()

+-------------+--------------+
|rf_extent(b2)|rf_extent(b11)|
+-------------+--------------+
+-------------+--------------+

Thank you in advance for any help you can provide.  

vpipkt commented 4 years ago

@atika9 this means that the join is empty, thus there are no rows where the cond column expression is true. That's because of the different spatial extents across the columns meant to be joined, as I mentioned earlier.

I would encourage you to take a look at the join expression in my later example and also consider rasterJoin, which sad to say is not yet documented https://github.com/locationtech/rasterframes/issues/323.

We do take from this two aspects of this use case that inform our future development:

  1. Reading RasterSourceDataSource with different extent across bands in a single catalog row
  2. Reading RasterSourceDataSource with different resolution across bands in a single catalog row

The first item we have experienced a related issue several times working with e.g. Landsat data acquired across different dates for the same WRS grid. In those situations the exact scene extent across dates varies by a few hundred meters

vpipkt commented 4 years ago

Stale issue going to close. @Atika9 please shout if there is more support you need. Note that we have recently merged changes allowing greater control of how rf_resample works under the hood.