rspatial / terra

R package for spatial data handling https://rspatial.github.io/terra/reference/terra-package.html
GNU General Public License v3.0
534 stars 88 forks source link

Support for flexible queries in `terra::query`? #1157

Closed cboettig closed 1 year ago

cboettig commented 1 year ago

@rhijmans thanks once again for this amazing package. Very impressed at the power of proxy objects when working with large remote vector layers, e.g. I find syntax like this compelling:

url <- "/vsicurl/https://minio.carlboettiger.info/biodiversity/World-Protected-Areas-May2023.gdb"
v <- terra::vect(url, proxy=TRUE)

letting us instantly see the file metadata without downloading the rather large object or bringing it all in to RAM. If I want to do some more exploration on the remote object, e.g. look at unique values of the "NAMES" variable, it seems the only option to do so is to start again with the query:

v_names <- terra::vect(url, query = "SELECT DISTINCT NAME FROM WDPA_WDOECM_poly_May2023_all")

In other contexts, we could instead have piped v to terra::query(), but it seems to have no ability to support arbitrary queries like SELECT DISTINCT. It would be great if query() could gain an additional argument to take an arbitrary OGR-SQL query, like you already allow us to do inside vect.

I also think a more complicated but more elegant solution would be to permit the use of common dplyr verbs that can be translated to SQL, the way dbplyr and arrow already do such meta-programming translations. Then the above query could be:

url <- "/vsicurl/https://minio.carlboettiger.info/biodiversity/World-Protected-Areas-May2023.gdb"
v <- terra::vect(url, proxy=TRUE)

v |> dplyr::distinct(NAMES) 

etc. Of course there would be edge cases that don't work (i.e. where some dplyr code that would work on a tibble cannot be translated into an OGR-SQL query), but that is already the case for users using dplyr to interact with, say, SQL databases or arrow/Parquet files) .

Relatedly, it looks like query() is always non-lazy in terra? I find the pattern used in dplyr::collect() to trigger evaluation (also used by most other paradigms with lazy-evaluation) to be really convenient. It's already nice that vect() supports lazy evaluation this way, but it might be nice to flush out the support for lazy operations more thoroughly. (e.g. for a more explicitly spatial data example, I find the lazy-evaluation in gdalcubes really thoughtful and clever).

And of course apologies if there are already ways to do this kind of lazy eval in terra that I'm overlooking!

mdsumner commented 1 year ago

Apologies to divert a little ... I came here to make a case for also including the SQL dialect, so that (say) SQLITE functions can be used no matter what the format ... but then remembered can do this by VRT construction (within limits of terra passing the VRT directly to GDAL, and in what it will accept as vect output):

url <- "/vsicurl/https://minio.carlboettiger.info/biodiversity/World-Protected-Areas-May2023.gdb"
terra::vect(url, proxy = TRUE)
# class       : SpatVectorProxy
# geometry    : polygons 
# dimensions  : 273868, 30  (geometries, attributes)
# extent      : 175.7782, 175.7809, -37.42824, -37.4256  (xmin, xmax, ymin, ymax)
# source      : World-Protected-Areas-May2023.gdb (WDPA_WDOECM_poly_May2023_all)
# coord. ref. : lon/lat WGS 84 (EPSG:4326) 
# names       : WDPAID WDPA_PID PA_DEF  NAME ORIG_NAME DESIG DESIG_ENG DESIG_TYPE IUCN_CAT INT_CRIT (and 20 more)
# type        :  <num>    <chr>  <chr> <chr>     <chr> <chr>     <chr>      <chr>    <chr>    <chr>              
#   Warning message:
#   [vect] Reading layer: WDPA_WDOECM_poly_May2023_all
# Other layers: WDPA_WDOECM_point_May2023_all, WDPA_WDOECM_source_May2023_all

## (SHAPE is what the geometry presents as the geometry column to GDAL from GDB)
vrt <- '<OGRVRTDataSource>
        <OGRVRTLayer name="limit10">
          <SrcDataSource>/vsicurl/https://minio.carlboettiger.info/biodiversity/World-Protected-Areas-May2023.gdb</SrcDataSource>
          <SrcSQL dialect="sqlite">SELECT WDPAID, SHAPE from WDPA_WDOECM_poly_May2023_all LIMIT 10</SrcSQL>
    </OGRVRTLayer>
</OGRVRTDataSource>'

terra::vect(vrt)
# class       : SpatVector 
# geometry    : polygons 
# dimensions  : 10, 1  (geometries, attributes)
# extent      : 170.4084, 176.8883, -45.79126, -37.4256  (xmin, xmax, ymin, ymax)
# source      : OGRVRTDataSource> (limit10)
# coord. ref. : lon/lat WGS 84 (EPSG:4326) 
# names       :    WDPAID
 ...

I think DISTINCT won't work well because of the format, and 'ConvexHull(SHAPE)' as an example is not accepted by terra - but there's a lot of scope to explore here. Users can expose what can and can't be done with GDAL via terra, which seems a positive seam for contributions here.

rhijmans commented 1 year ago

I have added a new argument "sql" to terra::query so that you can now do

url <- "/vsicurl/https://minio.carlboettiger.info/biodiversity/World-Protected-Areas-May2023.gdb"
v <- terra::vect(url, proxy=TRUE)
q <- query(v, sql="SELECT * FROM WDPA_WDOECM_poly_May2023_all LIMIT 3")
PMassicotte commented 1 year ago

Wow that was fast!

Trying it right now

PMassicotte commented 1 year ago

Indeed, works like a charm. Thank you @rhijmans

cboettig commented 1 year ago

wonderful, thanks @rhijmans . I feel this pretty much addresses the top issue so feel free to close unless you think it's worth further considerations of syntax for these queries.

rhijmans commented 1 year ago

It is now also possible, by specifying what="attributes" to do an "attribute-only" query as in @cboettig's example.

library(terra)
f <- system.file("ex/lux.shp", package="terra")
v <- vect(f, proxy=TRUE)
query(v, sql= "SELECT DISTINCT NAME_1 FROM lux", what="att")
#        NAME_1
#1     Diekirch
#2 Grevenmacher
#3   Luxembourg

Thank you much for the ideas for lazy evaluation, but I will leave pondering about that for another day.