Closed Atika9 closed 4 years ago
@Atika9 Thanks for filing the issue!
The reading of different resolutions across bands is a known limitation in RasterFrames RasterSource.
We may be able to work around it, and I'd be interested to see if the work around is performant for your use case. Below is a brief snippet of code that does the following to work around:
1) Read the 10m bands, with known tile size and spatial partitioning (available on develop
branch but not yet released)
2) Read the 20m bands
join
the two togetherWhen I try this with the URI's in the stack trace you posted it results in an empty DataFrame, because the 10 and 20m bands have different spatial extents. Note band 11's upper coordinate is 10m higher than band 2.
The reading of a catalog does assume that the extents of the products is the same for each band, however I do not think there is anything to explicitly enforce it.
from pyspark.sql.functions import col
df2348 = spark.read.raster(
[[f'https://datarabat.s3.us-east-2.amazonaws.com/b{b}.tif' for b in [2, 3, 4, 8]]],
tile_dimensions=(256, 256),
spatial_index_partitions=200,
).select(
# rename columns
col('proj_raster_0').alias('b2'),
col('proj_raster_1').alias('b3'),
col('proj_raster_2').alias('b4'),
col('proj_raster_3').alias('b8'),
)
df1112 = spark.read.raster(
[[f'https://datarabat.s3.us-east-2.amazonaws.com/b{b}.tif' for b in [11, 12]]],
tile_dimensions=(128, 128),
spatial_index_partitions=200,
).select(
# UPSAMPLE and rename columns
rf_resample('proj_raster_0', lit(2)).alias('b11'),
rf_resample('proj_raster_1', lit(2)).alias('b12'),
)
cond = [
rf_crs(df2348.b2) == rf_crs(df1112.b11),
rf_extent(df2348.b2) == rf_extent(df1112.b11)
]
df = df2348.join(df1112, cond)
df.printSchema()
Returns:
root
|-- b2: struct (nullable = true)
| |-- tile_context: struct (nullable = true)
| | |-- extent: struct (nullable = false)
| | | |-- xmin: double (nullable = false)
| | | |-- ymin: double (nullable = false)
| | | |-- xmax: double (nullable = false)
| | | |-- ymax: double (nullable = false)
| | |-- crs: struct (nullable = false)
| | | |-- crsProj4: string (nullable = false)
| |-- tile: tile (nullable = false)
|-- b3: struct (nullable = true)
| |-- tile_context: struct (nullable = true)
| | |-- extent: struct (nullable = false)
| | | |-- xmin: double (nullable = false)
| | | |-- ymin: double (nullable = false)
| | | |-- xmax: double (nullable = false)
| | | |-- ymax: double (nullable = false)
| | |-- crs: struct (nullable = false)
| | | |-- crsProj4: string (nullable = false)
| |-- tile: tile (nullable = false)
|-- b4: struct (nullable = true)
| |-- tile_context: struct (nullable = true)
| | |-- extent: struct (nullable = false)
| | | |-- xmin: double (nullable = false)
| | | |-- ymin: double (nullable = false)
| | | |-- xmax: double (nullable = false)
| | | |-- ymax: double (nullable = false)
| | |-- crs: struct (nullable = false)
| | | |-- crsProj4: string (nullable = false)
| |-- tile: tile (nullable = false)
|-- b8: struct (nullable = true)
| |-- tile_context: struct (nullable = true)
| | |-- extent: struct (nullable = false)
| | | |-- xmin: double (nullable = false)
| | | |-- ymin: double (nullable = false)
| | | |-- xmax: double (nullable = false)
| | | |-- ymax: double (nullable = false)
| | |-- crs: struct (nullable = false)
| | | |-- crsProj4: string (nullable = false)
| |-- tile: tile (nullable = false)
|-- b11: struct (nullable = true)
| |-- tile_context: struct (nullable = true)
| | |-- extent: struct (nullable = false)
| | | |-- xmin: double (nullable = false)
| | | |-- ymin: double (nullable = false)
| | | |-- xmax: double (nullable = false)
| | | |-- ymax: double (nullable = false)
| | |-- crs: struct (nullable = false)
| | | |-- crsProj4: string (nullable = false)
| |-- tile: tile (nullable = false)
|-- b12: struct (nullable = true)
| |-- tile_context: struct (nullable = true)
| | |-- extent: struct (nullable = false)
| | | |-- xmin: double (nullable = false)
| | | |-- ymin: double (nullable = false)
| | | |-- xmax: double (nullable = false)
| | | |-- ymax: double (nullable = false)
| | |-- crs: struct (nullable = false)
| | | |-- crsProj4: string (nullable = false)
| |-- tile: tile (nullable = false)
develop
also has rasterJoin
in it, whihc might be another option... it will upsample and handle extent offsets.
rasterJoin
can be pretty expensive operation but may be the best option
The join on extents is more likely to result in rows getting dropped. Here is another example with data that "almost" matches up in terms of extents. There is a very small floating point difference in the extents.
The MODIS LST is a nominal 1000m product, and NBAR a nominal 500m product. They are laid out on the same grid.
from pyrasterframes.utils import create_rf_spark_session
spark = create_earthai_spark_session()
from pyspark.sql.functions import col
lst_url = '/vsis3/astraea-opendata/MOD11A1.006/21/06/2019349/MOD11A1.A2019349.h21v06.006.2019352050642_LSTD_B01.TIF'
nbar_url = '/vsis3/astraea-opendata/MCD43A4.006/21/06/2019357/MCD43A4.A2019357.h21v06.006.2020001035452_B01.TIF'
df_lst = spark.read.raster(lst_url,
tile_dimensions=(128, 128),
spatial_index_partitions=200) \
.select(rf_resample('proj_raster', lit(2.0)).alias('LST'))
df_nbar = spark.read.raster(nbar_url,
tile_dimensions=(256, 256),
spatial_index_partitions=200) \
.select(col("proj_raster").alias('NBAR'))
# Conditions to join on,
# The st_intersects here is to work around the floating point precision of how the 500m and 1000m products are gridded
cond = [
rf_crs(df_lst.LST) == rf_crs(df_nbar.NBAR),
st_intersects(st_centroid(rf_geometry(df_lst.LST)), rf_geometry(df_nbar.NBAR))
]
df = df_lst.join(df_nbar, cond)
df.select(rf_extent('NBAR'), rf_extent('LST')).show()
Result:
+----------------------+----------------------+----------------------+----------------------+
|xmindiff |ymindiff |xmaxdiff |ymaxdiff |
+----------------------+----------------------+----------------------+----------------------+
|-2.999999560415745E-4 |-2.005002461373806E-4 |-3.1061330810189247E-4|-2.0448025315999985E-4|
|-3.1061330810189247E-4|-2.005002461373806E-4 |-3.2122666016221046E-4|-2.0448025315999985E-4|
|-2.999999560415745E-4 |-2.0448025315999985E-4|-3.1061330810189247E-4|-2.1509360522031784E-4|
|-3.1061330810189247E-4|-2.0448025315999985E-4|-3.2122666016221046E-4|-2.1509360522031784E-4|
|-2.999999560415745E-4 |-2.1509313955903053E-4|-3.1061330810189247E-4|-2.2570649161934853E-4|
|-3.1061330810189247E-4|-2.1509313955903053E-4|-3.2122666016221046E-4|-2.2570649161934853E-4|
|-3.2122666016221046E-4|-2.005002461373806E-4 |-3.3184001222252846E-4|-2.0448025315999985E-4|
|-3.3184001222252846E-4|-2.005002461373806E-4 |-3.4245336428284645E-4|-2.0448025315999985E-4|
|-3.4245336428284645E-4|-2.005002461373806E-4 |-3.5306671634316444E-4|-2.0448025315999985E-4|
|-3.2122666016221046E-4|-2.0448025315999985E-4|-3.3184001222252846E-4|-2.1509360522031784E-4|
+----------------------+----------------------+----------------------+----------------------+
Just out of curiousity what is the worst offset here. These are in meters per the MODIS MODLAND sinusoidal CRS....
from pyspark.sql.functions import abs as F_abs
from pyspark.sql.functions import max as F_max
df.select(F_max(F_abs(rf_extent('NBAR').xmin - rf_extent('LST').xmin)).alias('xmindiff'),
F_max(F_abs(rf_extent('NBAR').ymin - rf_extent('LST').ymin)).alias('ymindiff'),
F_max(F_abs(rf_extent('NBAR').xmax - rf_extent('LST').xmax)).alias('xmaxdiff'),
F_max(F_abs(rf_extent('NBAR').ymax - rf_extent('LST').ymax)).alias('ymaxdiff'),
).show(10, False)
Result
+--------------------+--------------------+---------------------+--------------------+
|xmindiff |ymindiff |xmaxdiff |ymaxdiff |
+--------------------+--------------------+---------------------+--------------------+
|3.955205902457237E-4|2.893866039812565E-4|3.9950013160705566E-4|2.999999560415745E-4|
+--------------------+--------------------+---------------------+--------------------+
You can also check df_lst.count()
versus df.count
To verify.
@vpipkt Thank you for your reply.
Using your code, display() and show () functions returns empty arrays.
df.select('b2','b3','b4','b8','b11','b12')
b2 | b3 | b4 | b8 | b11 | b12
-- | -- | -- | -- | -- | --
display(df)
b2 | b3 | b4 | b8 | b11 | b12
-- | -- | -- | -- | -- | --
df.select(rf_extent('b2'), rf_extent('b11')).show()
+-------------+--------------+
|rf_extent(b2)|rf_extent(b11)|
+-------------+--------------+
+-------------+--------------+
Thank you in advance for any help you can provide.
@atika9 this means that the join
is empty, thus there are no rows where the cond
column expression is true. That's because of the different spatial extents across the columns meant to be joined, as I mentioned earlier.
I would encourage you to take a look at the join expression in my later example and also consider rasterJoin
, which sad to say is not yet documented https://github.com/locationtech/rasterframes/issues/323.
We do take from this two aspects of this use case that inform our future development:
The first item we have experienced a related issue several times working with e.g. Landsat data acquired across different dates for the same WRS grid. In those situations the exact scene extent across dates varies by a few hundred meters
Stale issue going to close. @Atika9 please shout if there is more support you need. Note that we have recently merged changes allowing greater control of how rf_resample
works under the hood.
When I try to read multiple bands of different sizes I get this error:
Any help is appreciated.