exploring python pathways

mdsumner commented 6 months ago

Read the source NetCDF with osgeo.gdal, and (in a very basic vsimem tempfile way) transfer to xarray via rioxarray.

(code assumes GDAL_HTTP_HEADERS is set to your earthdata "Authorization: Header ", and we must use GDAL >= 3.7 for the simple vrt:// syntax to fix up the extent and crs and scaling of the ghrsst).

Using arguments of 'osgeo.gdal.Warp()' https://gdal.org/api/python/osgeo.gdal.html#osgeo.gdal.WarpOptions for setting the output grid (only matters if we try to Read, it's otherwise done with a VRT tempfile). Obviously a front-end would use whatever xarray arguments instead. (I don't know if you can set the crs/extent/dimension with that, you can with odc but not as flexibly as this I think).

from datetime import datetime
from os import path
from osgeo import gdal
from osgeo import gdalconst

## note that GDAL_HTTP_HEADERS="Authorization: Bearer <token>" must be set, see https://urs.earthdata.nasa.gov/documentation/for_users/user_token
def open_ghrsst(datestring, subdataset = "analysed_sst"):
    gdal.UseExceptions()
    gdal.
    ##datestring = '2002-06-01'
    dt    = datetime.strptime(datestring, '%Y-%m-%d')
    year  = datetime.strftime(dt, "%Y")
    month = datetime.strftime(dt, "%m")
    day   = datetime.strftime(dt, "%d")
    jday  = datetime.strftime(dt, "%j")

    filename = f'/vsicurl/https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/{year}{month}{day}090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc'

    if (subdataset == "analysed_sst") | (subdataset == "analysed_sst"):
        dsn = f"vrt://NetCDF:\"{filename}\":{subdataset}?a_srs=OGC:CRS84&a_ullr=-180,89.9905,180,-89.9905&a_scale=0.001&a_offset=25"
    else: 
        dsn = f"vrt://NetCDF:\"{filename}\":{subdataset}?a_srs=OGC:CRS84&a_ullr=-180,89.9905,180,-89.9905"
    return(gdal.Open(dsn))

def read_ghrsst(datestring, subdataset = "analysed_sst", **kwargs): 
    ds = open_ghrsst(datestring, subdataset)
    wds = gdal.Warp("/vsimem/ghrsst.vrt", ds, **kwargs)
    return(wds)

## relevant args are outputBounds, width, height (can leave one empty for proper aspect ratio)
## dstSRS, xRes, yRes, resampleAlg, etc. which correspond to gdalwarp opts
#ds = read_ghrsst("2023-12-01", outputBounds = [-93,  41, -76, 49], width = 512)
ds = read_ghrsst("2023-12-01", outputBounds = [-93,  46, -86, 49], width = 512)

## to xarray?
import xarray
import rioxarray
rioxarray.open_rasterio(ds.GetDescription())

@cboettig a small start on a family of examples I'd like to flesh out, it's not very fast to read (which is why I'm working on COG versions of these)

mdsumner commented 6 months ago

Now, rather poke around with an in memory osgeo.gdal dataset, return the VRT text and show examples of longlat and projected

from datetime import datetime
from os import path
from osgeo import gdal
from osgeo import gdalconst

## note that GDAL_HTTP_HEADERS="Authorization: Bearer <token>" must be set, see https://urs.earthdata.nasa.gov/documentation/for_users/user_token
def open_ghrsst(datestring, subdataset = "analysed_sst"):
    gdal.UseExceptions()
    ##datestring = '2002-06-01'
    dt    = datetime.strptime(datestring, '%Y-%m-%d')
    year  = datetime.strftime(dt, "%Y")
    month = datetime.strftime(dt, "%m")
    day   = datetime.strftime(dt, "%d")
    jday  = datetime.strftime(dt, "%j")
    #archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/
    #filename = f'/vsicurl/https://podaac-opendap.jpl.nasa.gov/opendap/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/{year}/{jday}/{year}{month}{day}090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc'
    filename = f'/vsicurl/https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/{year}{month}{day}090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc'
    if (subdataset == "analysed_sst") | (subdataset == "analysed_sst"):
        dsn = f"vrt://NetCDF:\"{filename}\":{subdataset}?a_srs=OGC:CRS84&a_ullr=-180,89.9905,180,-89.9905&a_scale=0.001&a_offset=25"
    else: 
        dsn = f"vrt://NetCDF:\"{filename}\":{subdataset}?a_srs=OGC:CRS84&a_ullr=-180,89.9905,180,-89.9905"
    return(gdal.Open(dsn))

def read_ghrsst(datestring, subdataset = "analysed_sst", **kwargs): 
    ds = open_ghrsst(datestring, subdataset)
    wds = gdal.Warp("/tmp/ghrsst.vrt", ds, **kwargs)
    vrt_out = wds.GetMetadata("xml:VRT")[0]
    wds.Close()
    return(vrt_out)

#ds = read_ghrsst("2023-12-01", outputBounds = [-93,  41, -76, 49], width = 512)
import xarray
import rioxarray

## this is the actual VRT text, the contents of a file.vrt
vrt = read_ghrsst("2023-12-01", outputBounds = [-93,  46, -86, 49], width = 512)
xarray1 = rioxarray.open_rasterio(vrt)

vrt = read_ghrsst("2023-12-01", outputBounds = [350000 * x for x in [-1, -1, 1, 1]], dstSRS = "+proj=laea +lon_0=-90 +lat_0=47.5")
xarray2 = rioxarray.open_rasterio(vrt)

mdsumner commented 6 months ago

wrapping that up in R so I can make the map I want ... I create

xarray1 (a tiny 512x512 1-degree extract from netcdf )
xarray2 (a tiny 75x75 35km*35km extract from netcdf in local projection)
xarray3 (a sizeable 1700x800 native-resolution extract for the lakes like the original example from an example COG)

It's so much faster from the COG (note that it's all fast until we ask for the data because it's all virtualized until we go for the ".data" property of the xarray to get actual values), I'll work on more comparable examples

py <- "
from datetime import datetime
from os import path
from osgeo import gdal
from osgeo import gdalconst

## note that GDAL_HTTP_HEADERS='Authorization: Bearer <token>' must be set, see https://urs.earthdata.nasa.gov/documentation/for_users/user_token
def open_ghrsst(datestring, subdataset = 'analysed_sst'):
    gdal.UseExceptions()
    dt    = datetime.strptime(datestring, '%Y-%m-%d')
    year  = datetime.strftime(dt, '%Y')
    month = datetime.strftime(dt, '%m')
    day   = datetime.strftime(dt, '%d')
    jday  = datetime.strftime(dt, '%j')
    filename = f'/vsicurl/https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/{year}{month}{day}090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc'
    if (subdataset == 'analysed_sst') | (subdataset == 'analysed_sst'):
        dsn = f'vrt://NetCDF:{filename}:{subdataset}?a_srs=OGC:CRS84&a_ullr=-180,89.9905,180,-89.9905&a_scale=0.001&a_offset=25'
    else: 
        dsn = f'vrt://NetCDF:{filename}:{subdataset}?a_srs=OGC:CRS84&a_ullr=-180,89.9905,180,-89.9905'
    return(gdal.Open(dsn))

def read_ghrsst(datestring, subdataset = 'analysed_sst', **kwargs): 
    ds = open_ghrsst(datestring, subdataset)
    wds = gdal.Warp('/tmp/ghrsst.vrt', ds, **kwargs)
    vrt_out = wds.GetMetadata('xml:VRT')[0]
    wds.Close()
    return(vrt_out)

#ds = read_ghrsst('2023-12-01', outputBounds = [-93,  41, -76, 49], width = 512)
import xarray
import rioxarray

## this is the actual VRT text, the contents of a file.vrt
vrt = read_ghrsst('2023-12-01', outputBounds = [-93,  46, -92, 47], width = 512)
xarray1 = rioxarray.open_rasterio(vrt)

vrt = read_ghrsst('2023-12-01', outputBounds = [35000 * x for x in [-1, -1, 1, 1]], dstSRS = '+proj=laea +lon_0=-90 +lat_0=47.5')
xarray2 = rioxarray.open_rasterio(vrt)

## or, forget all this and hit up a COG!
xarray3 = rioxarray.open_rasterio('vrt:///vsicurl/https://github.com/mdsumner/cog-example/raw/main/cog/ghrsst/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.tif?projwin=-93,49,-76,41')
"

x <- reticulate::py_run_string(py)

extent_from_xarray <- function(x) {
    xx <- x$x$values
    yy <- x$y$values
    resx <- diff(xx[1:2])
    resy <- diff(yy[1:2])
    c(xx[1] - resx/2, xx[length(xx)] + resx/2, yy[1] - resy/2, yy[length(yy)] + resy/2)
}
dim_from_xarray <- function(x) {
    dim <- x$shape
    unlist(dim)[3:2]
}
mat_from_xarray <- function(x) {
    mat <- x$data[1L,,]
    mat[dim(mat)[1]:1, ]
}

ex <- extent_from_xarray(x$xarray3)
ximage::ximage(mat_from_xarray(x$xarray3), ex, asp = 1/mean(ex[3:4] * pi/180), col = hcl.colors(64))
library(mapdata)
#> Loading required package: maps
maps::map("worldHires", add = T)

^{Created on 2023-12-24 with reprex v2.0.2}

mdsumner commented 6 months ago

I have to sort out the nodata and scaling handling, sort of expected rioxarray to do that but no hardcoded it is this to get Celsius from the GHRSST.

ximage::ximage(mat_from_xarray(x$xarray3) / 1000 + 25, ex, asp = 1/mean(ex[3:4] * pi/180), col = hcl.colors(64), zlim = c(2.649, 15.037))

cboettig commented 6 months ago

This is very cool, I'll have to play with these examples. I'm still curious about scaling across large time slices and the relative performance. I see https://github.com/espm-157/nasa-topst-env-justice/blob/main/drafts/earthaccess.ipynb

mdsumner commented 6 months ago

ah ok I'll look at that - the scaling here was to keep the integer packing of GHRSST (better than compressing doubles) but to change the scaling so that the output was C not F, best of both worlds in my opinion (but there's a mix of support for scaling on read that I'm not across yet - terra and sf do, vapour does not ... for example).

mdsumner commented 6 months ago

here's an example using our (IDEA program, AAD) catalog of OISST files - we keep these locally for now but could replace with these cloud-ready netcdf urls

here speed is hardly a problem because the files are so tiny 1440x720 vs the 36000x17999 GHRSST so the netcdf limitations don't really matter

next here I will do some warping to flip the hemispheres to Atlantic view, reproject and subset etc.

## this is a nascent project in the IDEA program, to publish our catalog of sources (to externalize our internal tooling)
idea <- arrow::read_parquet("https://github.com/mdsumner/idea_sources_public/raw/main/idea_sources_public.parquet")
files <- idea

pattern <- c("avhrr", "^.*www.ncei.noaa.gov.*sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/.*\\.nc$")
for (pat in pattern) {
    files <- dplyr::filter(files, stringr::str_detect(file, pat))
}

files$dsn <- sprintf("/vsicurl/https://%s", files$file)
files <- dplyr::mutate(files, date = as.POSIXct(as.Date(stringr::str_extract(basename(file), 
        "[0-9]{8}"), "%Y%m%d"), tz = "UTC")) |> 
    dplyr::distinct(date, .keep_all = TRUE) |> dplyr::arrange( date) |> 
    dplyr::transmute(dsn, date)

## add sds and crs so we can warp easy
make_sds <- function(x, sds = c("sst", "anom", "err", "ice")) {
    sds <- match.arg(sds)
    sprintf("vrt://NETCDF:\"%s\":%s?a_srs=OGC:CRS84", x, sds)
}

files$sds <- make_sds(files$dsn, "sst")

rioxarray <- reticulate::import("rioxarray")                   
gdal <- reticulate::import("osgeo.gdal")

## do this weird roundtrip to avoid 
#rioxarray$open_rasterio(tail(files$sds, 1), decode_times = FALSE)
#Error: KeyError: 'NETCDF_DIM_time_VALUES'

i <- nrow(files)  ## get the latest
dsn <- files$sds[i]
ds <- gdal$Translate("/tmp/oisst.vrt", gdal$Open(dsn))
ds$SetMetadata(reticulate::py_none())
#> [1] 0

r <- rioxarray$open_rasterio(ds$GetMetadata("xml:VRT"), mask_and_scale = TRUE)

extent_from_xarray <- function(x) {
    xx <- x$x$values
    yy <- x$y$values
    resx <- diff(xx[1:2])
    resy <- diff(yy[1:2])
    c(xx[1] - resx/2, xx[length(xx)] + resx/2, yy[1] - resy/2, yy[length(yy)] + resy/2)
}
dim_from_xarray <- function(x) {
    dim <- x$shape
    unlist(dim)[3:2]
}
mat_from_xarray <- function(x) {
    mat <- x$data[1L,,]
    mat[dim(mat)[1]:1, ]
}

ximage::ximage(mat_from_xarray(r), extent_from_xarray(r), col = hcl.colors(64))
title(files$date[i])

^{Created on 2023-12-24 with reprex v2.0.2}

mdsumner commented 6 months ago

the warper is buried under .rio.reproject which is entirely geared to transforming an entire grid to a crs ... but, we can generate the rasterio Affine (note that it's order is different to the GDAL geotransform) and pass that it with the crs and the dimension (shape) for our target grid:

## this is a nascent project in the IDEA program, to publish our catalog of sources (to externalize our internal tooling)
idea <- arrow::read_parquet("https://github.com/mdsumner/idea_sources_public/raw/main/idea_sources_public.parquet")
files <- idea

pattern <- c("avhrr", "^.*www.ncei.noaa.gov.*sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/.*\\.nc$")
for (pat in pattern) {
    files <- dplyr::filter(files, stringr::str_detect(file, pat))
}

files$dsn <- sprintf("/vsicurl/https://%s", files$file)
files <- dplyr::mutate(files, date = as.POSIXct(as.Date(stringr::str_extract(basename(file), 
        "[0-9]{8}"), "%Y%m%d"), tz = "UTC")) |> 
    dplyr::distinct(date, .keep_all = TRUE) |> dplyr::arrange( date) |> 
    dplyr::transmute(dsn, date)

## add sds and crs so we can warp easy
make_sds <- function(x, sds = c("sst", "anom", "err", "ice")) {
    sds <- match.arg(sds)
    sprintf("vrt://NETCDF:\"%s\":%s?a_srs=OGC:CRS84", x, sds)
}

files$sds <- make_sds(files$dsn, "sst")

rioxarray <- reticulate::import("rioxarray")                   
gdal <- reticulate::import("osgeo.gdal")

## do this weird roundtrip to avoid 
#rioxarray$open_rasterio(tail(files$sds, 1), decode_times = FALSE)
#Error: KeyError: 'NETCDF_DIM_time_VALUES'

i <- nrow(files)  ## get the latest
dsn <- files$sds[i]
ds <- gdal$Translate("/tmp/oisst.vrt", gdal$Open(dsn))
ds$SetMetadata(reticulate::py_none())
#> [1] 0

r <- rioxarray$open_rasterio(ds$GetMetadata("xml:VRT"), mask_and_scale = TRUE)
a <- r$to_numpy()

extent_from_xarray <- function(x) {
    xx <- x$x$values
    yy <- x$y$values
    resx <- diff(xx[1:2])
    resy <- diff(yy[1:2])
    c(xx[1] - resx/2, xx[length(xx)] + resx/2, yy[1] - resy/2, yy[length(yy)] + resy/2)
}
dim_from_xarray <- function(x) {
    dim <- x$shape
    unlist(dim)[3:2]
}
mat_from_xarray <- function(x) {
    mat <- x$data[1L,,]
    mat[dim(mat)[1]:1, ]
}

ximage::ximage(mat_from_xarray(r), extent_from_xarray(r), col = hcl.colors(64))
title(files$date[i])


## this works
rr <- r$rio$reproject("+proj=laea +lon_0=147 +lat_0=-42")
ximage::ximage(mat_from_xarray(rr), extent_from_xarray(rr), col = hcl.colors(64), asp = 1)


## but we want to get a specific target grid, so 
ex <- c(-1, 1, -1, 1) * 6e6
## we'll have to do our own snap/crop/aspect ratio stuff
transform <- affinity::extent_dim_to_gt(ex, c(1024, 1024))
## rasterio uses a different order to GDAL (lol aint life fun)
affine <- reticulate::import("affine")
aff <- affine$Affine( transform["xres"], 0, transform["xmin"], 0, transform["yres"], transform["ymax"])
rr <- r$rio$reproject("+proj=laea +lon_0=147 +lat_0=-42" , transform = aff, shape =  reticulate::tuple(1024L, 1024L))
ximage::ximage(mat_from_xarray(rr), extent_from_xarray(rr), col = hcl.colors(64), asp = 1)

^{Created on 2023-12-24 with reprex v2.0.2}

mdsumner commented 6 months ago

it's almost fast enough to animate

## but we want to get a specific target grid, so 
ex <- c(-1, 1, -1, 1) * 5e6
## we'll have to do our own snap/crop/aspect ratio stuff
dm <- c(1024L, 1024L)%/%2L
transform <- affinity::extent_dim_to_gt(ex, dm)
## rasterio uses a different order to GDAL (lol aint life fun)
affine <- reticulate::import("affine")
aff <- affine$Affine( transform["xres"], 0, transform["xmin"], 0, transform["yres"], transform["ymax"])
for (lon in seq(-180, 180, by = 5)) {
rr <- r$rio$reproject(sprintf("+proj=stere +lon_0=%i +lat_0=-90", lon) , transform = aff, shape =  reticulate::tuple(dm[1L], dm[2L]))
ximage::ximage(mat_from_xarray(rr), extent_from_xarray(rr), col = hcl.colors(64), asp = 1)
}

mdsumner commented 6 months ago

so, I followed your example at https://github.com/espm-157/nasa-topst-env-justice/blob/main/drafts/earthaccess.ipynb and went a bit sideways exploring how to load with rioxarray, here I get all the local GHRSST cogs and set them up with the new (since 3.7) vrt syntax for projwin and unscale (note that mine are scaled to emit Celsius, not Fahrenheit which is how I'd like the online COGs to work too). I am happy to see that rasterio does faithfully pass the gdal stuff along, so there's the lot of good here.

So this is fast because it's only over NFS on a local network, but I think it will be a fair comparison for actual online COGs where you can run the VMs network-close. The only step that takes any real time is the reduce by std().

Still got a lot to think about and learn in terms of nodata and masks (I'm not applying the ice mask but that would be good to do, especially where there is some ice)

The helpers are a bit of a pain because xarray does that netcdf thing where the y coords can be increasing or decreasing and will have to handle that case based on sign of the yres.

extent_from_xarray <- function(x) {
    xx <- x$x$values
    yy <- x$y$values
    resx <- diff(xx[1:2])
    resy <- diff(yy[1:2])
    if (resy < 0) {
      c(xx[1] - resx/2, xx[length(xx)] + resx/2,  yy[length(yy)] - resy/2, yy[1] + resy/2)

    } else {
    c(xx[1] - resx/2, xx[length(xx)] + resx/2, yy[1] - resy/2, yy[length(yy)] + resy/2)
    }
}
dim_from_xarray <- function(x) {
    dim <- x$shape

    rev(tail(unlist(dim), 2))
}
## this is different from above examples (note that the ys are going the wrong way)
mat_from_xarray2 <- function(x) {
    mat <- x$data
    if (length(dim(mat)) > 2) mat <- mat[1L,,]
    t(mat)
}

rast_from_xarray <- function(x) {
  terra::rast(terra::ext(extent_from_xarray(x)), ncols = dim_from_xarray(x)[1], nrows = dim_from_xarray(x)[2], 
              crs = x$spatial_ref$crs_wkt, vals = as.vector(mat_from_xarray2(x)))

}

library(raadtools)
library(vapour)
files <- ghrsstfiles() |> dplyr::filter(format(date, "%Y") %in% c("2020", "2022"))
dim(files)

ex <- c(-93, -76, 41, 49)
projwin <- paste0(ex[c(1, 4, 2, 3)], collapse = ",")
dsn <- sprintf("vrt://%s?projwin=%s&unscale=true&ot=Float32", files$fullname, projwin)

gdal <- reticulate::import("osgeo.gdal")
gdal$UseExceptions()
ds = gdal$BuildVRT(buildvrt <- tempfile(fileext = ".vrt"), dsn, options = "-separate")
ds$Close()

riods = rioxarray$open_rasterio(buildvrt, chunks = {})

system.time(std <- riods$std("band"))
#  user  system elapsed 
#203.823   9.471 214.289 

rr <- rast_from_xarray(std)
## not really sure why these end up as zeros ... because, the nodata is a constant?
rr[!rr > 0] <- NA
plot(rr, col = hcl.colors(64))

(I keep getting different and confusing answers from test runs versus full runs, so I'll leave that down to confusions to figure out later - this plot in one of the tests gave me confidence ... but this is not an actual reprex)

Alex taught me to use "chunks = {}" for dask, but I haven't explore this in detail.

cboettig commented 6 months ago

this is cool, thanks! looking forward to playing around with these example more and getting somewhat firm reprex-able benchmarks. (Ideally something I can run on a free codespaces instance in a defined dockerfile environment :-) )

mdsumner commented 6 months ago

oh, I only just realized my {} is an R block and not a python dictionary, so it should probably be reticulate::dict() there for dask

cboettig commented 6 months ago

@mdsumner this is great. I'd really like to figure out how to seed a larger discussion about use of GDAL from python for netcdf data. Studying more examples, it's really started to crystallize for me how python users can consistently use an odc.stac approach for COGs, but for netcdf/zarr/h5/etc switch over to a "pure xarray" approach that doesn't route via GDAL, and I'd love to unpack that a bit more.

I do think there's some logic to the current situation; if only because from what I can tell the netcdf / h5 / zarr serializations are being used in a way that doesn't really make them "GDAL-friendly", i.e. they seem to leave out CRS / spatial projection metadata etc in a way that makes them cumbersome for GDAL but matches the "this is just an n-d array of numbers" worldview...

mdsumner commented 6 months ago

yes indeed ... in fact I have a nice blog story fomenting, I'll try to draft it out to see how it sits with you

it's a big story, too much for a single sitting and I'm just focused on key gaps in the warper api especially, but I have enough workarounds to do some damage with a huge range of the array scene and its "problem" examples

the threads on Pangeo are good, this one I think has a very easy warper answer at the start, but it's more complex where that discussion goes later and the offshoot discussion, so it's hard to gauge where and when to inject. I had some v good positive feedback from the vid chat a fee weeks ago in their "what's next" discussion

https://discourse.pangeo.io/t/xarray-for-raster-data-dems-with-inconsistent-spatial-extent/821

also I had warm nods from NCI, about the community fracture between longlat arrays and map projections grids, I need to reach out there too for ongoing conversation

I thought I had a promising point of contact but sadly I got them offside early on, but also that person has been extremely busy with a new venture and perhaps less able to spend time with outsiders. We need an ally to guide us in 🙂

cboettig commented 6 months ago

I wrote up this draft, curious what you think. https://boettiger-lab.github.io/nasa-topst-env-justice/contents/xarray-benchmark.html Maybe it is too blunt a reaction to that coiled post but tries to make some points I've been trying to make.

If you have a chance, I'd be curious if you might benchmark this operation on your end too? I've embedded timings there, you see the fsspec+xarray approach takes 1hr 16min, while my attempt with xarray+rioxarray+gdal takes about 44 minutes. I imagine it's possible to do significantly faster though, at least with cogs.

mdsumner commented 6 months ago

nice! why don't you use lon/lat based slicing here? does it not work in that context?

dds = ds1.sel(x=slice(18000-9300, 18000-7600), y = slice(9000+4100,9000+4900))

cboettig commented 6 months ago

oh nice, I was hoping you might have an insight on that, that definitely annoyed me. GDAL complains and I suppress the warnings. Maybe just because I am using old gdal?? My rasterio seems to have gdal 3.6.4

import rasterio
rasterio.__gdal_version__

I'm still learning how to get python stuff to install and play nice...

mdsumner commented 6 months ago

ah yes, I normally grit my teeth with --no-binary ":all:" and go an do something else (I was advised that that is always the best approach, even though it's slow, so docker containerization becomes essential) - but, it's also because I'm mostly working "abstractly" atm, I'm exploring bleeding edge GDAL features, not doing actual proc work

cboettig commented 6 months ago

(though on the R side where I have gdal 3.8 gdal refuses to work without the suffix you gave me for wrapping the URL, like so:

vrt <- function(url) {
  prefix <-  "vrt://NETCDF:/vsicurl/"
  suffix <- ":analysed_sst?a_srs=OGC:CRS84&a_ullr=-180,90,180,-90"
  paste0(prefix, url, suffix)
}

iirc older gdal on R had the same effect, where it could read the file in, but just didn't have proper coordinates. I'd love if it could just work like the xarray fsspec example and get the correct coordinate system units without manually giving them in somehow...

mdsumner commented 6 months ago

it should be ok with the extent (like, whatever GDAL finds it's close enough ...) but you definitely need VRT of some sort for the CRS to be explicit (the warper doesn't have "a_srs") and in fact I think I've decided -90,90 is wrong - it should contract by one half-pixel at top and bottom, I think that's the right interpretation of the 17999 rows and the actual values in "lat" (not 18000)

just another small thing, we put in a config for NetCDF to assign longlat when the data looks like it (and the file is not conformant): https://gdal.org/drivers/raster/netcdf.html#config-GDAL_NETCDF_ASSUME_LONGLAT but, I made it too restrictive and so it won't work with MURSST I think (it should allow slightly outside the 180,360,90 range for situations where you can be a tiny bit sloppy)

mdsumner commented 6 months ago

another thing to consider is generating the VRT on a bleeding edge system and porting that to the actual runner, so like this will give fully-formed vrt text:

vapour::vapour_vrt(</vsicurl/urls>, projection = "OGC:CRS84", sds = 1, extent = c(-180, 180, -90, 90))

and, we could craft trad python/gdal code to do that, probably even with rasterio and older GDAL? that might be useful? You have to open each one ... but I've used that in the past by just opening the first one, and templating the rest.

mdsumner commented 6 months ago

it works for me, but it's not called "lon","lat" but "x","y"

(I picked a short time subset)

dds = ds1.sel(x=slice(-93, -76), y=slice(41, 49))
>>> dds
<xarray.Dataset>
Dimensions:           (time: 5, x: 1700, y: 0)
Coordinates:
  * time              (time) object 2020-01-01 09:00:00 ... 2020-01-05 09:00:00
  * x                 (x) float64 -92.99 -92.98 -92.97 ... -76.02 -76.01 -76.0
  * y                 (y) float64
    spatial_ref       int64 0
Data variables:
    analysed_sst      (time, y, x) float32 dask.array<chunksize=(1, 0, 1535), meta=np.ndarray>
    analysis_error    (time, y, x) float32 dask.array<chunksize=(1, 0, 1535), meta=np.ndarray>
    mask              (time, y, x) float32 dask.array<chunksize=(1, 0, 1700), meta=np.ndarray>
    sea_ice_fraction  (time, y, x) float32 dask.array<chunksize=(1, 0, 1700), meta=np.ndarray>
    dt_1km_data       (time, y, x) float32 dask.array<chunksize=(1, 0, 1700), meta=np.ndarray>
    sst_anomaly       (time, y, x) float32 dask.array<chunksize=(1, 0, 1535), meta=np.ndarray>
Attributes: (12/47)
    acknowledgment:             Please acknowledge the use of these data with...
    cdm_data_type:              grid
    comment:                    MUR = "Multi-scale Ultra-high Resolution"
    Conventions:                CF-1.7
    creator_email:              ghrsst@podaac.jpl.nasa.gov
    creator_name:               JPL MUR SST project
    ...                         ...
    summary:                    A merged, multi-sensor L4 Foundation SST anal...
    time_coverage_end:          20200101T210000Z
    time_coverage_start:        20191231T210000Z
    title:                      Daily MUR SST, Final product
    uuid:                       27665bc0-d5fc-11e1-9b23-0800200c9a66
    westernmost_longitude:      -180

or is it my version/s?

rasterio.show_versions()
rasterio info:
  rasterio: 1.3.9
      GDAL: 3.9.0dev-98c3491967
      PROJ: 9.3.0
      GEOS: 3.12.0
 PROJ DATA: /home/ubuntu/.local/share/proj:/usr/share/proj
 GDAL DATA: None

System:
    python: 3.10.13 (main, Aug 25 2023, 13:20:03) [GCC 9.4.0]
executable: /usr/bin/python3.10
   machine: Linux-5.4.0-167-generic-x86_64-with-glibc2.31

Python deps:
    affine: 2.4.0
     attrs: 23.1.0
   certifi: 2023.11.17
     click: 8.1.7
     cligj: 0.7.2
    cython: None
     numpy: 1.26.2
    snuggs: 1.4.7
click-plugins: None
setuptools: 69.0.2

mdsumner commented 6 months ago

oh, obviously that did not work .... you have to flip the y (aargh, I noticed this recently)

 ds1.sel(x=slice(-93, -76), y=slice(49, 41))
<xarray.Dataset>
Dimensions:           (time: 5, x: 1700, y: 800)
Coordinates:
  * time              (time) object 2020-01-01 09:00:00 ... 2020-01-05 09:00:00
  * x                 (x) float64 -92.99 -92.98 -92.97 ... -76.02 -76.01 -76.0
  * y                 (y) float64 49.0 48.99 48.98 48.97 ... 41.03 41.02 41.01
    spatial_ref       int64 0
Data variables:
    analysed_sst      (time, y, x) float32 dask.array<chunksize=(1, 800, 1535), meta=np.ndarray>
    analysis_error    (time, y, x) float32 dask.array<chunksize=(1, 800, 1535), meta=np.ndarray>
    mask              (time, y, x) float32 dask.array<chunksize=(1, 242, 1700), meta=np.ndarray>
    sea_ice_fraction  (time, y, x) float32 dask.array<chunksize=(1, 242, 1700), meta=np.ndarray>
    dt_1km_data       (time, y, x) float32 dask.array<chunksize=(1, 242, 1700), meta=np.ndarray>
    sst_anomaly       (time, y, x) float32 dask.array<chunksize=(1, 800, 1535), meta=np.ndarray>
Attributes: (12/47)
    acknowledgment:             Please acknowledge the use of these data with...
    cdm_data_type:              grid
    comment:                    MUR = "Multi-scale Ultra-high Resolution"
    Conventions:                CF-1.7
    creator_email:              ghrsst@podaac.jpl.nasa.gov
    creator_name:               JPL MUR SST project
    ...                         ...
    summary:                    A merged, multi-sensor L4 Foundation SST anal...
    time_coverage_end:          20200101T210000Z
    time_coverage_start:        20191231T210000Z
    title:                      Daily MUR SST, Final product
    uuid:                       27665bc0-d5fc-11e1-9b23-0800200c9a66
    westernmost_longitude:      -180

mdsumner commented 6 months ago

this is all pretty fantastic, and I would aspire to replace my R tools (currently using raster ...) with reticulate hooks into xarray, but I can't see it being worthwhile unless the proc can actually run close to the data (so maybe next I'll be into OpenEO and similar) - I can't really see myself setting up systems that spin up remote VMs with terraform (though I'm doing that for personal exploration)

cboettig commented 6 months ago

Oh thanks! That flipped y totally threw me. Compiled rasterio from source against gdal 3.8.2 and used your flipped-y subset for the latitudes and all is well.

I'm benchmarking now from a VM inside AWS us-west-2 where the data lives, it's not all that much faster I think... Still need to understand parallelization and use of ram / dask in this context, seems to use more RAM than I expected. Also seems like there's a lot of the run time where it's just sitting there with low cpu % use, very low network bandwidth use, just wasting time, not sure what's going on there. and even when eth0 traffic gets going, it's order 100 Mb/s, which is something like 1% of the max speed on that machine....

mdsumner commented 6 months ago

do you know how to tell xarray/rioxarray to open a multiband dataset and treat band as time?

mdsumner commented 6 months ago

oh, actually band is a dimension already - I just need squeeze to drop the degenerate "time"

mdsumner commented 6 months ago

do you ever see this sort of stuff?

I've encountered a few times,

cboettig commented 6 months ago

haven't seen those messages. though I'm not always attentive to messages. don't suppose they could be related to network issues?

mdsumner commented 6 months ago

I don't know, it gets in the way tho!

Currently stuck on passing literal vrt text in memory to the dataset open, it should just work but might need some fssspec magic?

mdsumner commented 6 months ago

ah, well these both do work (vsicurl to a derived VRT, and that VRT text in mem)

(I think I'm getting confused when the auth is working, which happens when I try to work with temp files and so on switching between the langs)

dsn <- "/vsicurl/https://raw.githubusercontent.com/mdsumner/ghrsst.coop/main/examples/20210101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc.vrt"

library(reticulate)

xarray <- import("xarray")
xarray$open_dataset(dsn, engine = "rasterio", chunks = {})

tx <- readr::read_file(gsub("/vsicurl/", "", dsn))
xarray$open_dataset(tx, engine = "rasterio", chunks = {})
# <xarray.Dataset>
# Dimensions:      (band: 1, x: 1700, y: 800)
# Coordinates:
#   * band         (band) int64 1
#   * x            (x) float64 -93.0 -92.99 -92.98 -92.97 ... -76.03 -76.02 -76.01
#   * y            (y) float64 49.0 48.99 48.98 48.97 ... 41.04 41.03 41.02 41.01
#     spatial_ref  int64 ...
# Data variables:
#     band_data    (band, y, x) float32 ...
#

mdsumner commented 6 months ago

ah, it doesn't like the VRT text in memory with open_mfdataset

dsn <- "/vsicurl/https://raw.githubusercontent.com/mdsumner/ghrsst.coop/main/examples/20210101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc.vrt"
dsn1 <- "/vsicurl/https://raw.githubusercontent.com/mdsumner/ghrsst.coop/main/examples/20210102090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc.vrt"
library(reticulate)

xarray <- import("xarray")
xarray$open_dataset(dsn, engine = "rasterio", chunks = {})

tx <- readr::read_file(gsub("/vsicurl/", "", dsn))
tx1 <- c(readr::read_file(gsub("/vsicurl/", "", dsn)), readr::read_file(gsub("/vsicurl/", "", dsn1)))

xarray$open_mfdataset(tx1, engine = "rasterio", chunks = {})
Error: basic_string::substr: __pos (which is 1684) > this->size() (which is 1301)

and it doesn't like non-file files

xarray$open_mfdataset(c(dsn, dsn1), engine = "rasterio", chunks = {})
Error: FileNotFoundError: [Errno 2] No such file or directory: '/vsicurl/https://raw.githubusercontent.com/mdsumner/ghrsst.coop/main/examples/20210101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc.vrt'

mdsumner commented 6 months ago

ah, but this works!! so we can force it with temp VRT files or VRT urls without vsicurl (though it's ok with those in the singular case)


urls_vrt <- c("https://raw.githubusercontent.com/mdsumner/ghrsst.coop/main/examples/20210101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc.vrt", "https://raw.githubusercontent.com/mdsumner/ghrsst.coop/main/examples/20210102090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc.vrt")

xarray$open_mfdataset(urls_vrt, engine = "rasterio", chunks = {},   concat_dim="band", 
                            combine="nested")
<xarray.Dataset>
Dimensions:      (band: 2, x: 1700, y: 800)
Coordinates:
  * band         (band) int64 1 1
  * x            (x) float64 -93.0 -92.99 -92.98 -92.97 ... -76.03 -76.02 -76.01
  * y            (y) float64 49.0 48.99 48.98 48.97 ... 41.04 41.03 41.02 41.01
    spatial_ref  int64 0
Data variables:
    band_data    (band, y, x) float32 dask.array<chunksize=(1, 128, 128), meta=np.ndarray>

mdsumner commented 6 months ago

but no, looks like a bust - we can't use the subdataset syntax - so we can't use GDAL for the virtual crop ... which means we also can't warp to arbitrary grids in an xarray context ... not for these files at least

## created "files" with two years of these cropped SDS VRTs: 
## ..
##vrt/20221220090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc.vrt
##vrt/20221221090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc.vrt

ds <- xarray$open_mfdataset(files, engine = "rasterio", concat_dim = "band", combine = "nested", chunks = {})
std <- ds$std()
system.time(std$compute())
ERROR 4: NETCDF:/vsicurl/https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20220527090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc:analysed_sst: No such file or directory
Error: rasterio.errors.RasterioIOError: Read or write failed. NETCDF:/vsicurl/https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20220527090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc:analysed_sst: No such file or directoryTiming stopped at: 25.74 0 35.15
ERROR 4: NETCDF:/vsicurl/https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20220526090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc:analysed_sst: No such file or directory

mdsumner commented 6 months ago

I just forcibly removed relevant packages and reinstalled from source (python3.10) it was pretty painless:

sudo python3.10 -m pip uninstall rioxarray pandas xarray rasterio <etc> 
## but do that again in case there are crufty lib paths
sudo python3.10 -m pip uninstall rioxarray pandas xarray rasterio <etc> 
## do again until none found

## then
sudo python3.10 -m pip install rioxarray pandas xarray rasterio <etc> --no-binary ":all:"

now afaict everything is very up to date (I'm on a system with gdal-from-source), next is to make sure the HDF5 is up to date and find out where that error messaging is coming from: python, netcdf, hdf itself, or gdal

mdsumner commented 6 months ago

ok!! so we can do VRT text inline in memory, but it has to be a list() from R - is there a scalar/list thing in python that this treads on for open_mfdataset?

we still can't do actual data proc though, it's still looking for a file so that's a lazy-ops request for xarray I think to stop looking for files, why doesn't it ask gdal for the array data when the time comes?

this is excellent, creating files is a pita

later I'll worry about band being a dimension and whether we can restore that (I can't see in GDAL how to make an mdim object from multiple files so I haven't explored that)

The step of using inline text read from VRT urls can be replaced by f string templating the VRT (so we can do this in pre vrt:// versions of GDAL too).

dsnlist <- c("/vsicurl/https://raw.githubusercontent.com/mdsumner/ghrsst.coop/main/examples/20210101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc.vrt", 
         "/vsicurl/https://raw.githubusercontent.com/mdsumner/ghrsst.coop/main/examples/20210102090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc.vrt")

urllist <- gsub("/vsicurl/", "", dsnlist)
txtlist <- vapply(urllist, function(.x) readr::read_file(.x), "")

writeLines(txtlist[1])
#> <VRTDataset rasterXSize="1700" rasterYSize="800">
#>   <SRS dataAxisToSRSAxisMapping="1,2">GEOGCS["WGS 84 (CRS84)",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563]],PRIMEM["Greenwich",0],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AXIS["Longitude",EAST],AXIS["Latitude",NORTH],AUTHORITY["OGC","CRS84"]]</SRS>
#>   <GeoTransform> -9.3005004165841427e+01,  1.0000000152592130e-02,  0.0000000000000000e+00,  4.9004998836693254e+01,  0.0000000000000000e+00, -9.9999997626146822e-03</GeoTransform>
#>     <VRTRasterBand dataType="Int16" band="1">
#>       <NoDataValue>-32768</NoDataValue>
#>     <Offset>298.15</Offset>
#>     <Scale>0.001</Scale>
#>     <SimpleSource>
#>       <SourceFilename relativeToVRT="0">NETCDF:/vsicurl/https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20210101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc:analysed_sst</SourceFilename>
#>       <SourceBand>1</SourceBand>
#>       <SourceProperties RasterXSize="36000" RasterYSize="17999" DataType="Int16" BlockXSize="2047" BlockYSize="1023" />
#>       <SrcRect xOff="8699" yOff="4099" xSize="1700" ySize="800" />
#>       <DstRect xOff="0" yOff="0" xSize="1700" ySize="800" />
#>     </SimpleSource>
#>   </VRTRasterBand>
#> </VRTDataset>
library(reticulate)
rioxarray <- import("rioxarray")
rasterio <- import("rasterio")
xarray <- import("xarray")
rioxarray$open_rasterio(txtlist[1])
#> <xarray.DataArray (band: 1, y: 800, x: 1700)>
#> [1360000 values with dtype=int16]
#> Coordinates:
#>   * band         (band) int64 1
#>   * x            (x) float64 -93.0 -92.99 -92.98 -92.97 ... -76.03 -76.02 -76.01
#>   * y            (y) float64 49.0 48.99 48.98 48.97 ... 41.04 41.03 41.02 41.01
#>     spatial_ref  int64 0
#> Attributes:
#>     _FillValue:    -32768
#>     scale_factor:  0.001
#>     add_offset:    298.15
rioxarray$open_rasterio(urllist[1])
#> <xarray.DataArray (band: 1, y: 800, x: 1700)>
#> [1360000 values with dtype=int16]
#> Coordinates:
#>   * band         (band) int64 1
#>   * x            (x) float64 -93.0 -92.99 -92.98 -92.97 ... -76.03 -76.02 -76.01
#>   * y            (y) float64 49.0 48.99 48.98 48.97 ... 41.04 41.03 41.02 41.01
#>     spatial_ref  int64 0
#> Attributes:
#>     _FillValue:    -32768
#>     scale_factor:  0.001
#>     add_offset:    298.15
rioxarray$open_rasterio(dsnlist[1])
#> <xarray.DataArray (band: 1, y: 800, x: 1700)>
#> [1360000 values with dtype=int16]
#> Coordinates:
#>   * band         (band) int64 1
#>   * x            (x) float64 -93.0 -92.99 -92.98 -92.97 ... -76.03 -76.02 -76.01
#>   * y            (y) float64 49.0 48.99 48.98 48.97 ... 41.04 41.03 41.02 41.01
#>     spatial_ref  int64 0
#> Attributes:
#>     _FillValue:    -32768
#>     scale_factor:  0.001
#>     add_offset:    298.15

rasterio$open(txtlist[1])
#> <open DatasetReader name='<VRTDataset rasterXSize="1700" rasterYSize="800">
#>   <SRS dataAxisToSRSAxisMapping="1,2">GEOGCS["WGS 84 (CRS84)",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563]],PRIMEM["Greenwich",0],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AXIS["Longitude",EAST],AXIS["Latitude",NORTH],AUTHORITY["OGC","CRS84"]]</SRS>
#>   <GeoTransform> -9.3005004165841427e+01,  1.0000000152592130e-02,  0.0000000000000000e+00,  4.9004998836693254e+01,  0.0000000000000000e+00, -9.9999997626146822e-03</GeoTransform>
#>     <VRTRasterBand dataType="Int16" band="1">
#>       <NoDataValue>-32768</NoDataValue>
#>     <Offset>298.15</Offset>
#>     <Scale>0.001</Scale>
#>     <SimpleSource>
#>       <SourceFilename relativeToVRT="0">NETCDF:/vsicurl/https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20210101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc:analysed_sst</SourceFilename>
#>       <SourceBand>1</SourceBand>
#>       <SourceProperties RasterXSize="36000" RasterYSize="17999" DataType="Int16" BlockXSize="2047" BlockYSize="1023" />
#>       <SrcRect xOff="8699" yOff="4099" xSize="1700" ySize="800" />
#>       <DstRect xOff="0" yOff="0" xSize="1700" ySize="800" />
#>     </SimpleSource>
#>   </VRTRasterBand>
#> </VRTDataset>
#> ' mode='r'>
rasterio$open(urllist[1])
#> <open DatasetReader name='https://raw.githubusercontent.com/mdsumner/ghrsst.coop/main/examples/20210101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc.vrt' mode='r'>
rasterio$open(dsnlist[1])
#> <open DatasetReader name='/vsicurl/https://raw.githubusercontent.com/mdsumner/ghrsst.coop/main/examples/20210101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc.vrt' mode='r'>

xarray$open_dataset(txtlist[1], engine = "rasterio",  chunks = {})
#> <xarray.Dataset>
#> Dimensions:      (band: 1, x: 1700, y: 800)
#> Coordinates:
#>   * band         (band) int64 1
#>   * x            (x) float64 -93.0 -92.99 -92.98 -92.97 ... -76.03 -76.02 -76.01
#>   * y            (y) float64 49.0 48.99 48.98 48.97 ... 41.04 41.03 41.02 41.01
#>     spatial_ref  int64 ...
#> Data variables:
#>     band_data    (band, y, x) float32 ...

## this is good, but note use of as.list (!!)
xarray$open_mfdataset(as.list(txtlist), engine = "rasterio", concat_dim = "band", combine = "nested", chunks = {}, parallel = TRUE)
#> <xarray.Dataset>
#> Dimensions:      (band: 2, x: 1700, y: 800)
#> Coordinates:
#>   * band         (band) int64 1 1
#>   * x            (x) float64 -93.0 -92.99 -92.98 -92.97 ... -76.03 -76.02 -76.01
#>   * y            (y) float64 49.0 48.99 48.98 48.97 ... 41.04 41.03 41.02 41.01
#>     spatial_ref  int64 0
#> Data variables:
#>     band_data    (band, y, x) float32 dask.array<chunksize=(1, 128, 128), meta=np.ndarray>

## doesn't like this (looking for a file)
#xarray$open_mfdataset(dsnlist, engine = "rasterio", concat_dim = "band", combine = "nested", chunks = {}, parallel = TRUE)
# Error: FileNotFoundError: [Errno 2] No such file or directory: '/vsicurl/https://raw.githubusercontent.com/mdsumner/ghrsst.coop/main/examples/20210102090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc.vrt'

## this is good
xarray$open_mfdataset(urllist, engine = "rasterio", concat_dim = "band", combine = "nested", chunks = {}, parallel = TRUE)
#> <xarray.Dataset>
#> Dimensions:      (band: 2, x: 1700, y: 800)
#> Coordinates:
#>   * band         (band) int64 1 1
#>   * x            (x) float64 -93.0 -92.99 -92.98 -92.97 ... -76.03 -76.02 -76.01
#>   * y            (y) float64 49.0 48.99 48.98 48.97 ... 41.04 41.03 41.02 41.01
#>     spatial_ref  int64 0
#> Data variables:
#>     band_data    (band, y, x) float32 dask.array<chunksize=(1, 128, 128), meta=np.ndarray>

^{Created on 2024-01-05 with reprex v2.0.2}

cboettig commented 6 months ago

this sounds promising? the last example looks like it opens the list of urls, right? though we are getting the temporal dimension "band", right? but you say:

we still can't do actual data proc though, it's still looking for a file so that's a lazy-ops request for xarray I think to stop looking for files, why doesn't it ask gdal for the array data when the time comes?

so it's just reading the vrt metadata and can't access the actual data?

mdsumner commented 6 months ago

yep

worth pursuing 🙏

mdsumner commented 6 months ago

back to R, tile up the target domain and parallelize across the sources (I also scan the 98 tiles and trim that to 74 because some of them have no data, and every time step is the same in that regard)

I'm only doing 20 time steps here as a test ...

this has visited the same data and done an equivalent task for 20 time steps, in 150 seconds - so it's nowhere near as fast but I'll keep exploring - I just like having control over the steps and this will generalize to actual regridding with the warper

dates <- seq(as.Date("2021-01-01"), by = "1 day", length.out = 20)
template <- "https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/%s090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc"
urls <- sprintf(template, format(dates, "%Y%m%d"))

dsn <- sprintf("NETCDF:/vsicurl/%s:analysed_sst", urls)

#info <- vapour::vapour_raster_info(dsn[1L])
## grid we want (we can do exact snapping discussion later)
ex <- c(-93, -76, 41, 49)
res <- c(0.01, 0.01)

#

## so we do
library(vapour)
d <- gdal_raster_data(dsn[1L], target_ext = ex, target_res = res)

## but we want to do 720 files at once, so chunk it up
library(grout)  hypertidy/grout
g <- grout(attr(d, "dimension"), extent = attr(d, "extent"), blocksize = c(128, 128))
tind <- tile_index(g)

## scan for which blocks have no data and not read them
test <- rep(TRUE, nrow(tind))
for (i in seq_along(test)) {
  ex <- unlist(tind[i, c("xmin", "xmax", "ymin", "ymax")])

  d <- gdal_raster_data(dsn[1L], target_ext = ex, target_res = res)
  if (all(is.na(d[[1]]))) test[i] <- FALSE
}

tind$test <- test

library(parallel)
cl <- makeCluster(31)
readfun <- function(x) {
    vapour::gdal_raster_data(x, target_ext = extent, target_res = res)
}
#ximage::ximage(readfun(dsn[1]))

tind1 <- tind[tind$test, ]
lstd <- vector("list", nrow(tind1))
for (i in seq_len(nrow(tind1))) {
  extent <- unlist(tind1[i, c("xmin", "xmax", "ymin", "ymax")])
  parallel::clusterExport(cl, c("extent", "res"))

 d <- parLapply(cl, dsn, readfun)
 lstd[[i]] <- apply(matrix(unlist(d), length(d[[1]][[1]])), 1, sd)
}

stopCluster(cl)

rg <- range(unlist(lapply(lstd, range, na.rm = TRUE)))
rg
## no image the grid by part reconstruction
ex <- c(-93, -76, 41, 49)

plot(ex[1:2], ex[3:4], asp = 1/cos(mean(ex[3:4]) * pi/180), type = "n")
for (i in seq_along(lstd)) {
  ximage::ximage(matrix(lstd[[i]], 128, byrow = TRUE), unlist(tind1[i, c("xmin", "xmax", "ymin", "ymax")]), add = TRUE,  zlim = rg, col = hcl.colors(64))  
}

there are slight discrepancies in the tile extents because I haven't used the actual extent to snap our window to (I'll fix that up)

mdsumner commented 6 months ago

not clear that tiling so finely is worthwhile when the domain is 1700*800, I'm trying a different size

I'll check the timings on your xarray way too, just need to commit to having that happen - but I'm not great at batch python running, will try to get a bit of exp

mdsumner commented 6 months ago

ok this is good, got a full python batch running now - it took ~100 minutes for the R way (I end up using reticulate to get the url list because templating that doesn't work like it does for the "opendap" urls that are in the python script of this repo)

I realized I don't yet have a sense of the network-lag I'm subject to in Tasmania ...

I will also try running both on my local copies of the COGs and the original netcdfs, which will be closer to the best possible perf

mdsumner commented 6 months ago

the code just because I'm a bit chaotic at this end ...

the python

import earthaccess
import rioxarray
import rasterio
import xarray as xr
import numpy as np
import time
t = time.time()

results = earthaccess.search_data(
    short_name="MUR-JPL-L4-GLOB-v4.1",
    temporal=("2020-01-01", "2021-12-31")
)

import warnings
warnings.filterwarnings('ignore')
import os
from pathlib import Path
cookies = os.path.expanduser("~/.urs_cookies")
Path(cookies).touch()

## pull out the URLs
data_links = [granule.data_links(access="external") for granule in results]
url_links = [f'{link[0]}' for link in data_links]

# and here we go
with rasterio.Env(GDAL_INGESTED_BYTES_AT_OPEN="32000",
                  GDAL_HTTP_MULTIPLEX="YES",
                  GDAL_HTTP_MERGE_CONSECUTIVE_RANGES="YES",
                  GDAL_HTTP_VERSION="2",
                  GDAL_NUM_THREADS="ALL_CPUS",
                  GDAL_DISABLE_READDIR_ON_OPEN="EMPTY_DIR",
                  GDAL_HTTP_COOKIEFILE=cookies, 
                  GDAL_HTTP_COOKIEJAR=cookies, 
                  GDAL_HTTP_NETRC=True):
    ds1 = xr.open_mfdataset(url_links, 
                           engine = "rasterio", 
                           concat_dim="time", 
                           combine="nested",
                           chunks = {}
                           )

dds = ds1.sel(x=slice(-93, -76), y=slice(49, 41))
std = dds.analysed_sst.std("time")

std.compute()
print(std)

a  = std.values

print(np.nanmax(a))
print( time.time() - t)

the R

library(reticulate)
py <- "
import earthaccess
import rioxarray
import rasterio
import xarray as xr
import numpy as np
import time

results = earthaccess.search_data(
    short_name='MUR-JPL-L4-GLOB-v4.1',
    temporal=('2020-01-01', '2021-12-31')
)
## pull out the URLs
data_links = [granule.data_links(access='external') for granule in results]
url_links = [f'{link[0]}' for link in data_links]

"

x <- py_run_string(py)
urls <- x$url_links

dsn <- sprintf("NETCDF:/vsicurl/%s:analysed_sst", urls)

info <- vapour::vapour_raster_info(dsn[1L])
## grid we want (we can do exact snapping discussion later)
ex0 <- c(-93, -76, 41, 49)
res <- c(0.01, 0.01)

#

## so we do
library(vapour)
d <- gdal_raster_data(dsn[1L], target_ext = ex0, target_res = res)

## but we want to do 720 files at once, so chunk it up
library(grout) ## hypertidy/grout
g <- grout(attr(d, "dimension"), extent = attr(d, "extent"), blocksize = c(attr(d, "dimension") %/% 4))
tind <- tile_index(g)

## scan for which blocks have no data and not read them
test <- rep(TRUE, nrow(tind))
for (i in seq_along(test)) {
  ex <- unlist(tind[i, c("xmin", "xmax", "ymin", "ymax")])

  d <- gdal_raster_data(dsn[1L], target_ext = ex, target_res = res)
  if (all(is.na(d[[1]]))) test[i] <- FALSE
}

tind$test <- test
tind1 <- tind[tind$test, ]

library(parallel)
cl <- makeCluster(31)
readfun <- function(x) {
    vapour::gdal_raster_data(x, target_ext = extent, target_res = res)
}

extent <- ex0

#ximage::ximage(readfun(dsn[1]))

lstd <- vector("list", nrow(tind1))
for (i in seq_len(nrow(tind1))) {
  extent <- unlist(tind1[i, c("xmin", "xmax", "ymin", "ymax")])
  parallel::clusterExport(cl, c("extent", "res"))

 d <- parLapply(cl, dsn, readfun)
 lstd[[i]] <- apply(matrix(unlist(d), length(d[[1]][[1]])), 1, sd)
}

stopCluster(cl)

rg <- range(unlist(lapply(lstd, range, na.rm = TRUE)))
rg
## now image the grid by part reconstruction

plot(ex0[1:2], ex0[3:4], asp = 1/cos(mean(ex0[3:4]) * pi/180), type = "n")
for (i in seq_along(lstd)) {
  ximage::ximage(matrix(lstd[[i]], 128, byrow = TRUE), unlist(tind1[i, c("xmin", "xmax", "ymin", "ymax")]), add = TRUE,  zlim = rg, col = hcl.colors(64))  
}

cboettig commented 6 months ago

With gdalcubes, the best I can do for this computation from my desktop is about ~ 13 minutes. Interestingly, it's faster to crop the files but serialize each cropped file to disk (~12 minutes), and then compute the standard deviation over the two years worth against disk (~ 7 seconds), rather than doing the reduce_time() call in streaming mode (data_gd |> gdalcubes::crop(extent) |> reduce_time("sd(sst)").

The reason appears to be related to I/O speeds on my network -- in the former case, most of the 12 minutes is spent with my network pulling 700 - 999Mb/s (the max speed on my network card), while in the latter case network reads stay consistently at 100 Mb/s - probably because in the second case it has to do the extra computation on the fly as it streams the data in, which slows down the speed at which it can read??

library(earthdatalogin)
library(gdalcubes)

urls <- edl_search(short_name = "MUR-JPL-L4-GLOB-v4.1",
                   temporal = c("2020-01-01", "2021-12-31"))

vrt <- function(url) {
  prefix <-  "vrt://NETCDF:/vsicurl/"
  suffix <- ":analysed_sst?a_srs=OGC:CRS84&a_ullr=-180,90,180,-90"
  paste0(prefix, url, suffix)
}

gdalcubes_options(parallel = parallel::detectCores()*2)
with_gdalcubes()

url_dates <- as.Date(gsub(".*(\\d{8})\\d{6}.*", "\\1", urls), format="%Y%m%d")

data_gd <- gdalcubes::stack_cube(vrt(urls),
                                   datetime_values = url_dates,
                                   band_names = "sst")

extent = list(left=-93, right=-76, bottom=41, top=49,
              t0="2020-01-01", t1="2021-12-31")

bench::bench_time({ # 12 minutes
  data_gd |> 
    gdalcubes::crop(extent) |> 
    write_ncdf("~/sst.nc")
})

bench::bench_time({  # 7 seconds
 ncdf_cube("/home/cboettig/sst.nc") |>
    reduce_time("sd(sst)") |>
    plot(col = viridisLite::viridis(10))
})

mdsumner commented 6 months ago

how many cores do you have? I see that xarray happily saturates without my intervention, don't yet know how to control that

mdsumner commented 6 months ago

I agree about the general idea, that intensive work requires local files or at least temp files - and that's not terrible, but I think we can find a lot of good here across python and R (and the real libs) - and for a long time I've been thinking that it's just not my problem, there could be all sorts of networking simply making caches where there's demand, and I still think that will happen at some higher level

mdsumner commented 6 months ago

not absolutely sure yet, but it looks like my R way takes 100 minutes, and your rioxarray way takes nearly 4 hours here (that's in the ballpark of your rioxarray comparison to gdalcubes I think).

I'll keep checking, and do those local comparisons too. I also haven't compared this earthdata source with the ones I use in this repo (the /opendap/ ones) not sure if they are aliased or a real copy.

mdsumner commented 6 months ago

gdalcubes will absolutely rip at native resolution, but will fall down when you want a remote COG at lower resolution - same as terra::project without by_util = TRUE, it would be good if there was one R GDAL bindings and I don't expect I can summon up the effort to make just that one thing work across all, it was quite a lot of work just for terra and it's still not great

mdsumner commented 6 months ago

can you open this at all with xarray, via any means?

https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202202/oisst-avhrr-v02r01.20220218.nc

cboettig commented 5 months ago

nope, doesn't work for me either. all 6 engines ( ['netcdf4', 'h5netcdf', 'scipy', 'rasterio', 'store', 'zarr']) say file not found, even on a locally downloaded copy. weird because gdalinfo reads from either the url or local copy fine. any chance this is related to https://github.com/corteva/rioxarray/issues/174 ?

cboettig commented 5 months ago

how many cores do you have? I see that xarray happily saturates without my intervention, don't yet know how to control that

128 cores on that desktop :-) . Though really still wrapping my head a lot around parallelization issues. For GDAL, I set these env vars for cloud use, including "GDAL_NUM_THREADS" = "ALL_CPUS". Perhaps ridiculously, I have found that additionally it helps to set gdalcubes parallelization over the actual core count -- gdalcubes_options(parallel = parallel::detectCores()*2) -- my understanding is that because the jobs over network are I/O limited rather than CPU limited, this gets gdalcubes to spawn more processes. each process is often using something like 10% - 20% of a core anyway, so by spanning more processes I can keep my network card closer to saturation speed (i.e. pulling down close to 900 Mb/s most of the time, rather than stuck in the low 10s or low 100s).

mdsumner commented 5 months ago

ah that's interesting

I think those GDAL vars only apply to the warper specifically

mdsumner / ghrsst.coop

exploring python pathways #3