exploring python pathways

mdsumner commented 6 months ago

Read the source NetCDF with osgeo.gdal, and (in a very basic vsimem tempfile way) transfer to xarray via rioxarray.

(code assumes GDAL_HTTP_HEADERS is set to your earthdata "Authorization: Header ", and we must use GDAL >= 3.7 for the simple vrt:// syntax to fix up the extent and crs and scaling of the ghrsst).

Using arguments of 'osgeo.gdal.Warp()' https://gdal.org/api/python/osgeo.gdal.html#osgeo.gdal.WarpOptions for setting the output grid (only matters if we try to Read, it's otherwise done with a VRT tempfile). Obviously a front-end would use whatever xarray arguments instead. (I don't know if you can set the crs/extent/dimension with that, you can with odc but not as flexibly as this I think).

from datetime import datetime
from os import path
from osgeo import gdal
from osgeo import gdalconst

## note that GDAL_HTTP_HEADERS="Authorization: Bearer <token>" must be set, see https://urs.earthdata.nasa.gov/documentation/for_users/user_token
def open_ghrsst(datestring, subdataset = "analysed_sst"):
    gdal.UseExceptions()
    gdal.
    ##datestring = '2002-06-01'
    dt    = datetime.strptime(datestring, '%Y-%m-%d')
    year  = datetime.strftime(dt, "%Y")
    month = datetime.strftime(dt, "%m")
    day   = datetime.strftime(dt, "%d")
    jday  = datetime.strftime(dt, "%j")

    filename = f'/vsicurl/https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/{year}{month}{day}090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc'

    if (subdataset == "analysed_sst") | (subdataset == "analysed_sst"):
        dsn = f"vrt://NetCDF:\"{filename}\":{subdataset}?a_srs=OGC:CRS84&a_ullr=-180,89.9905,180,-89.9905&a_scale=0.001&a_offset=25"
    else: 
        dsn = f"vrt://NetCDF:\"{filename}\":{subdataset}?a_srs=OGC:CRS84&a_ullr=-180,89.9905,180,-89.9905"
    return(gdal.Open(dsn))

def read_ghrsst(datestring, subdataset = "analysed_sst", **kwargs): 
    ds = open_ghrsst(datestring, subdataset)
    wds = gdal.Warp("/vsimem/ghrsst.vrt", ds, **kwargs)
    return(wds)

## relevant args are outputBounds, width, height (can leave one empty for proper aspect ratio)
## dstSRS, xRes, yRes, resampleAlg, etc. which correspond to gdalwarp opts
#ds = read_ghrsst("2023-12-01", outputBounds = [-93,  41, -76, 49], width = 512)
ds = read_ghrsst("2023-12-01", outputBounds = [-93,  46, -86, 49], width = 512)

## to xarray?
import xarray
import rioxarray
rioxarray.open_rasterio(ds.GetDescription())

@cboettig a small start on a family of examples I'd like to flesh out, it's not very fast to read (which is why I'm working on COG versions of these)

mdsumner commented 6 months ago

ah it does apply lots of places but probably not with these netcdfs:

https://github.com/search?q=repo%3AOSGeo%2Fgdal%20num_threads&type=code

maybe affects the decompression tho

goergen95 commented 6 months ago

Hi both,

there is really a lot of interesting stuff going on in this thread!! I am currently working on adding support for cloud-native resources in {mapme.biodiversity} and it seems like I am going down a similar road. Really appreciate reading your experiences.

More to the point of this thread, with the script below I managed to run the analysis of the mentioned blog post on my laptop within 25 min. Download speed moved around 4 MB/s. Using wrap/unwrap seems to accumulate the rasters in RAM (amounting here to ~20GB), but writing the masked raster to disk should be an easy fix.

I would be interested to benchmark this script on some machine closer to the data. Unfortunatley, I currently don't have the resource to spin up a VM in the US.


library(earthdatalogin) # remotes::install_github("boettiger-lab/earthdatalogin")
library(progressr)
library(future)
library(furrr)
library(terra)

urls <- edl_search(short_name = "MUR-JPL-L4-GLOB-v4.1",
                   temporal = c("2020-01-01", "2021-12-31"))
gdal_cloud_config()
vrt <- function(url, sd) {
    prefix <-  "vrt://NETCDF:/vsicurl/"
    suffix <- sprintf(":%s?a_srs=OGC:CRS84&a_ullr=-180,90,180,-90", sd)
    paste0(prefix, url, suffix)
}

win <- c(-93, -76, 41, 49)
vrts_sst <- sapply(urls, vrt, sd = "analysed_sst")
vrts_ice <- sapply(urls, vrt, sd = "sea_ice_fraction")

plan(multisession, workers = 16)
bench::bench_time({
    with_progress({
        p <- progressor(steps = length(urls))
        sst_filtered <- future_map(seq_len(length(urls)), function(i) {
            p()
            sst <- rast(vrts_sst[i], win = win)
            ice <- rast(vrts_ice[i], win = win)
            ice <- subst(ice, 0, 0.15, 1, 0)
            wrap(mask(sst, ice, maskvalues = 0))
        })
        sst_sd <- app(rast(lapply(sst_filtered, unwrap)), sd, na.rm = TRUE)
    })
})
plan(sequential)

plot(sst_sd)
``

mdsumner commented 6 months ago

interesting! thanks @goergen95 I still have a bit of cleaning up to do to make sure I understand my benchmarks but will take this on

@cboettig for reference I usually have 32 cores, but I'm looking forward to utilizing more HPC and AWS soon

mdsumner commented 6 months ago

and, also I'd completely missed that earthdatalogin provides brilliant commonality to the python tools, that's excellent - I can probably use that to replace some of our own catalogue already (unpicking from our local data lib is one of the longer term tasks)

I'll follow up with that dim thing in rioxarray, the oisst has degenerate z and t dims (but doesn't have even basic crs compliance ... which I requested of them long ago - adding these new features in GDAL for vrt strings and netcdf meta-assumptions was my way of unpicking our dependence from our R functions)

mdsumner commented 6 months ago

a specific question, can you show your xarray.show_versions() ? I'm confused about the libnetcdf none thing and the horrible HDF5 messages I get

INSTALLED VERSIONS
------------------
commit: None
python: 3.10.13 (main, Aug 25 2023, 13:20:03) [GCC 9.4.0]
python-bits: 64
OS: Linux
OS-release: 5.4.0-169-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_AU.UTF-8
LOCALE: ('en_AU', 'UTF-8')
libhdf5: 1.10.4
libnetcdf: None

xarray: 2023.12.0
pandas: 2.1.4
numpy: 1.26.2
scipy: None
netCDF4: None
pydap: None
h5netcdf: 1.3.0
h5py: 3.10.0
Nio: None
zarr: 2.16.1
cftime: 1.6.3
nc_time_axis: None
iris: None
bottleneck: None
dask: 2023.12.1
distributed: None
matplotlib: 3.8.2
cartopy: None
seaborn: None
numbagg: None
fsspec: 2023.12.2
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 69.0.2
pip: 23.3.2
conda: None
pytest: 7.4.3
mypy: None
IPython: None
sphinx: 7.2.6

my netcdf is 4.9.2 and otherwise working

mdsumner commented 6 months ago

oh! it's probably that I don't have the python package netCDF4 installed and it's falling back to HDF5 ... lol that's probably been making my life complicated for a while ... since gdal has its own bindings for its python ... doh

fixed!!

INSTALLED VERSIONS

commit: None python: 3.10.13 (main, Aug 25 2023, 13:20:03) [GCC 9.4.0] python-bits: 64 OS: Linux OS-release: 5.4.0-169-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_AU.UTF-8 LOCALE: ('en_AU', 'UTF-8') libhdf5: 1.10.4 libnetcdf: 4.9.2

xarray: 2023.12.0 pandas: 2.1.4 numpy: 1.26.2 scipy: None netCDF4: 1.6.5 pydap: None h5netcdf: 1.3.0 h5py: 3.10.0 Nio: None zarr: 2.16.1 cftime: 1.6.3 nc_time_axis: None iris: None bottleneck: None dask: 2023.12.1 distributed: None matplotlib: 3.8.2 cartopy: None seaborn: None numbagg: None fsspec: 2023.12.2 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 69.0.2 pip: 23.3.2 conda: None pytest: 7.4.3 mypy: None IPython: None sphinx: 7.2.6

mdsumner commented 6 months ago

I think my current take is we should write a new real GDAL engine for xarray using osgeo.gdal, it would be like an empty terra::rast() and simply call back to the warper from its list of sources, and translate that from/to xarray's degenerate rectlinear coords as needed

there's going to be no end of problems with two extra layers of interpretation, with everything falling back on rasterio

but, odc might be enough - does it register as an engine for rasterio? (I don't think so, I think you have to use it upfront to return xarray objects, and then you're back at rasterio again ...)

for the dim thing though it might be enough to have an option for rioxarry to ignore dims, like rasterio does

cboettig commented 6 months ago

@mdsumner have you dug deeper at all into how odc.stac does handles the interface between xarray and GDAL? is it mediating that via rioxarray/rasterio or some other mechanism? I should probably give a try with these netcdfs just for comparison; but I do wish it were a bit more flexible (or I better understood how it worked), it doesn't seem to have anything analgous to gdalcubes::stack_cube() that can take an arbitrary list of urls/files rather than stac metadata (though I suppose it's not so hard to generate the stac json input even if there isn't a stac catalog, and of course earthdata has one). iirc, odc.stac depends on stackstac, which depends on rasterio, xarray, & dask but not rioxarray.

Hi @goergen95, neat to see you here! would love to chat more with you about this as well. I've been meaning to take a deeper look at your mapme.biodiversity work, especially the new branch that emphasizes the cloud platforms. I'm sure this thread here is a bit of a mess to follow. Somewhat orthogonal to the discussion here, but I think you also work with a lot vector data and I've been playing with methods for doing this same basic http-range-request strategy using the duckdb spatial module, which has been really nice. (Recently gained support for quadkey index too).

mdsumner commented 6 months ago

not yet, I'm getting closer - was great to crack into rioxarray!

I feel like this array vs grid stuff is here to stay and we'll be stuck in a rut for a long time now

cboettig commented 6 months ago

I took a crack at odc.stac / stackstac with netcdf but I think I'm hitting the issue where rasterio refuses to read netcdf like this which in thinks is a multi-band object (i.e. if these netcdfs were cogs, presumably each variable like analysed_sst would have been serialized in a separate asset). Here's my notebook: https://github.com/espm-157/nasa-topst-env-justice/blob/main/drafts/stackstac-ncdf.ipynb

Does this look to you like the same issue you were hacking away with just recently in https://github.com/corteva/rioxarray/issues/174 ?

mdsumner commented 6 months ago

I will have a look at that but I got bogged down in rioxarray, there's new features in GDAL that will break the way it gets netcdf values metadata from vsicurl (because it's very slow for unlimited dims to do that) - it's fixable but outside my game atm

I've rejigged to leverage rioxarray to load VRT generated by GDAL, so lazy warping at its finest. I'm using my own COGs here but it will scale to other sources and I'll expand on that.

This def warp() is much like vapour::gdal_raster_dsn() with out_name set totempfile(fileext = ".vrt")

library(reticulate)
library(raadtools)
files <- ghrsstfiles() |> dplyr::filter(date >= as.POSIXct("2020-01-01"), 
                                        date <= as.POSIXct("2021-12-31"))

reticulate::py_save_object(files$fullname, "a.pickle")

py <- '
import xarray
from osgeo import gdal
gdal.UseExceptions()
import tempfile
import os
import pickle
with open("a.pickle", "rb") as picklefile:
    dsn_links = pickle.load(picklefile)
picklefile.close()

ext = [-93, -76, 41, 49]

##dsn_links = [f"vrt://NETCDF:\"/vsicurl/{link}\":analysed_sst?a_ullr=-180,89.995,180,-89.995&a_srs=OGC:CRS84" for link in url_links]

def warp(dsn, target_dim = None, target_crs = None, target_res = None, target_ext = None, resample = None):
    ds = gdal.Open(dsn)
    outputBounds = None
    xRes = None
    yRes = None
    height = None
    width = None
    dstSRS = None
    resampleAlg = None
    if target_ext is not None: 
        outputBounds = [target_ext[0], target_ext[2], target_ext[1], target_ext[3]]
    if target_dim is not None: 
        width = target_dim[0]
        height = target_dim[1]
    if target_crs is not None:
        dstSRS = target_crs
    if target_res is not None: 
        xRes = target_res[0]
        yRes = target_res[1]
    if resample is not None: 
        resampleAlg = resample
    tf = tempfile.NamedTemporaryFile(suffix = ".vrt").name
    w = gdal.Warp(tf, ds, outputBounds = outputBounds, xRes = xRes, yRes = yRes, height = height, width = width, dstSRS = dstSRS, resampleAlg = resampleAlg)
    #return w.GetMetadata("xml:VRT")[0]
    return tf

wfile = [warp(dsn, target_ext = ext) for dsn in dsn_links]
x = xarray.open_mfdataset(wfile, engine = "rasterio", 
                       concat_dim="time", 
                       combine="nested",
                       parallel=True, chunks = {})

'
system.time({
x <- py_run_string(py)
std <- x$x$std("time")$compute()
a <- std$to_array()$to_numpy()

ximage::ximage(a[1,1,,], ext = c(-93, -76, 41, 49), col = hcl.colors(64))
})

it's just easier for me to experiment on speedy files and things I understand (I still have to generate all the tempfiles, it's not working to carry the VRT text directly or use vsimem for loading up xarray)

mdsumner commented 6 months ago

I was a bit surprised, that took 420 seconds. 32 cores, 727 local COG files (I missed the final two days of 2021). I'll compare to the local netcdfs next, and do it natively with xarray vs via this warper path.

mdsumner commented 6 months ago

It seems the COGs are 2x as fast as the NetCDFs used at native resolution. The nice things about this code is I can target any grid, here I use the COGs again - only code change made is the extent, resolution, and crs in one easy place:

ext = [-1e7,1e7,-1e7,1e7]

wfile = [warp(dsn, target_ext = ext, target_crs = "EPSG:3031", target_res = [50000, 50000]) for dsn in dsn_links]

here is std computed from the 2020-2021 COGs at 50km resolution in 400 seconds

this would take much longer on the netcdfs because of them having no overviews.

An aside, and the crux of a real post I need to write. This will all work the same when you have curvlinear grids and rectilinear, just need to craft VRT that contains the GEOLOCATION metadata -and, I think that should be added to gdal_translate and the vrt:// syntax, and avoid much scaffolding in downstream packages.

mdsumner commented 6 months ago

@mdsumner have you dug deeper at all into how odc.stac does handles the interface between xarray and GDAL? is it mediating that via rioxarray/rasterio or some other mechanism? I should probably give a try with these netcdfs just for comparison; but I do wish it were a bit more flexible (or I better understood how it worked), it doesn't seem to have anything analgous to gdalcubes::stack_cube() that can take an arbitrary list of urls/files rather than stac metadata (though I suppose it's not so hard to generate the stac json input even if there isn't a stac catalog, and of course earthdata has one). iirc, odc.stac depends on stackstac, which depends on rasterio, xarray, & dask but not rioxarray.

I don't think so, but it's pretty easy to create stac items - I might try this myself. We did this to draft a stac api for REMA, but I don't know if this will be the same as what odc.stac needs to load (and no idea about how time vs mosaic works):

from pystac import Item

import geopandas as gpd

gdf = gpd.read_file("https://data.pgc.umn.edu/elev/dem/setsm/REMA/indexes/REMA_Mosaic_Index_latest_gpkg.zip", 
  layer = "REMA_Mosaic_Index_v2_2m")

items = []

for row in gdf.iterfeatures():
    feature = row["properties"]
    stac_url = feature["s3url"].replace("https://polargeospatialcenter.github.io/stac-browser/#/external/", "https://")
    item = Item.from_file(stac_url)
    items.append(item.to_dict())

ah no, the stac_url is json not a tif - so there's still another level, but you can create stac json from a dsn and then do that!

cboettig commented 6 months ago

very cool! and yes, very excited about the ability to change extent, resolution, and projection on the fly too, I think that aspect is really being overlooked in the xarray-direct narratives.

I am curious how these benchmarks do over networked connections, though obviously harder to have a controlled benchmark there. I downloaded the full copy of the 2 years of netcdfs locally just to compare. Even when I limit it to 32 threads I can do the computation directly on the netcdfs using gdalcubes in 214 seconds.

> extent = list(left=-93, right=-76, bottom=41, top=49,
+               t0="2020-01-01", t1="2021-12-31")
> bench::bench_time({
+   proxy_cube |> 
+     gdalcubes::crop(extent) |> 
+     reduce_time("sd(sst)") |>
+     plot(col = viridisLite::viridis(10))
+ })
[==================================================>] 100 %
process    real 
  7.12s   3.56m

Re ODC stac -- see my example https://github.com/espm-157/nasa-topst-env-justice/blob/main/drafts/stackstac-ncdf.ipynb . We can already get STAC JSON metadata from NASA, but it seems the problem is that rasterio refuses to read "multiband" assets when accessed this way (I don't entirely understand, since GDAL can obviously, and rasterio can read these ncdfs seperately....)

mdsumner commented 6 months ago

well that seems about right, you have 4x the cores I have, so you are half the time it takes me to do the COGs. :)

I see about the multiband/multivar thing, thanks for pointing that out - do you know what the ds.count is? Do you know how to find out what was actually passed to rasterio.open? Because, this is not equal to 1, so maybe it's only opening the outer layer (which can only list the subdatasets, ds.count is 0).

AutoParallelRioReader._open(self)
    337 if ds.count != 1:
    338     ds.close()
    339     raise RuntimeError(
--> 340         f"Assets must have exactly 1 band, but file {self.url!r} has {ds.count}. "
    341         "We can't currently handle multi-band rasters (each band has to be "
    342         "a separate STAC asset), so you'll need to exclude this asset from your analysis."
    343     )

going on

item[0].get_self_href()
'https://cmr.earthdata.nasa.gov/stac/POCLOUD/collections/MUR-JPL-L4-GLOB-v4.1.v4.1/items/20200101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1'

## we see

 ds = rasterio.open("/vsicurl/https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20200101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc")

ds.count
0

and that ds.count() != 1 looks suspicious to me.

maybe it should be picking one

ds.subdatasets
['netcdf:/vsicurl/https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20200101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc:analysed_sst', 'netcdf:/vsicurl/https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20200101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc:analysis_error', 'netcdf:/vsicurl/https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20200101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc:mask', 'netcdf:/vsicurl/https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20200101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc:sea_ice_fraction', 'netcdf:/vsicurl/https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20200101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc:dt_1km_data', 'netcdf:/vsicurl/https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20200101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc:sst_anomaly']

cboettig commented 6 months ago

well that seems about right, you have 4x the cores I have, so you are half the time it takes me to do the COGs. :)

True, but this on the netcdf files, not on COGs. And I'm setting the threads to 32, not 128 -- actually it's just as fast on 32 threads as 128 threads, I suspect the speed of my NVMe disk is rate-limiting, not CPU. What's your hard-disk setup?

This is on an NVMe disk with realized performance spec:

sudo hdparm -Tt /dev/nvme0n1

/dev/nvme0n1:
 Timing cached reads:   22582 MB in  1.99 seconds = 11324.73 MB/sec
 Timing buffered disk reads: 3032 MB in  3.00 seconds = 1010.58 MB/sec

Re the ds.count thing -- hmm... interesting! I hadn't tried to see what that would return, I'd assumed it was greater than 1, not 0, but what you say makes sense. yeah I haven't figured out how to sensibly re-write the urls to give it the dimension manually like that. Even if I just reach in and hack the URLs listed on the href field of the STAC catalog, it seems to be prepending the /vsicurl/ bit internally, so if I manually add a prefix things come out all mangled. I assume I can't use the suffix slice without the netcdf: prefix part?

cboettig commented 6 months ago

oh I see what you mean, ds.subdatasets is already being read by rasterio correctly, its just that stackstac isn't expecting such a thing as ds.subdatasets() to exist. but maybe it won't be too hard to add logic to look there.... this sounds promising!

mdsumner commented 6 months ago

oh, I see about your cores - my disk is sshfs mount, it's all research cloud (openstack) so I don't actually know

mdsumner commented 6 months ago

so if I manually add a prefix things come out all mangled. I assume I can't use the suffix slice without the netcdf: prefix part?

you can now with GDAL 3.9:

vrt://{dsn}?sd_name=analysed_sst

add in a "&if=NetCDF" to be really sure

#ds = rasterio.open("vrt:///vsicurl/https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20200101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc?a_srs=OGC:CRS84&sd_name=analysed_sst&if=NetCDF")

ds = rasterio.open("vrt:///vsicurl/https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20200101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc?a_srs=OGC:CRS84&sd_name=analysed_sst")
ds.count
1
ds.files
 ['NETCDF:"/vsicurl/https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20200101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc":analysed_sst']

I can't see it being much fun including GDAL-versioning to this stuff though ... I guess we have to train the stac world to turn assets into dsn, or get it all working with multidim (!!)

mdsumner commented 6 months ago

I have no idea how to pass an R vector to python without that pickle hack ...

any ideas? I don't want to serialize thousands of files paths to python code to eval it (!)

mdsumner commented 6 months ago

@cboettig can you please try timing the gdalcubes way but set chunking like this:

chunking = c(1, 1023, 2047)

## corresponds to 

vapour::vapour_raster_info(<dsn>)$block

they're weird values ... like it was a mistake ... but it does seem to make a difference at least for RasterIO to make read calls aligned on the blocks precisely, and maybe for the warp calls in gdalcubes too

mdsumner commented 6 months ago

I appreciate these discussions, I've seen a lot of good comments about this problem (points in time over large areas is the general case)

https://discourse.pangeo.io/t/extracting-pixel-values-to-the-points-distributed-over-larger-area/3957/2

we need a toolkit of base level functionality - a tool can target a source optimally for defined cases, but no general user interface can manage a user problem against general sources

cboettig commented 5 months ago

@mdsumner have you seen: https://girder.github.io/large_image_wheels/

%pip install -U --find-links https://girder.github.io/large_image_wheels GDAL

gdal 3.9 in 5 seconds....

mdsumner commented 5 months ago

I hadn't, will explore - I guess this is the python version of r2u ?

btw, with ghcr.io/rocker-org/geospatial:dev-osgeo this do you think it should have pip and osgeo.gdal? I lost track of where it's mean to be at

cboettig commented 5 months ago

btw, with ghcr.io/rocker-org/geospatial:dev-osgeo this do you think it should have pip and osgeo.gdal? I lost track of where it's mean to be at

Yes, it should now have pip and osgeo.gdal out of the box in a default python environment (that lives in /opt/venv). However, I'm not sure it is compiling the osgeo.gdal from source, and I kinda think maybe it should be building rasterio from source too, but maybe that's asking for trouble... on my list to explore...

I hadn't, will explore - I guess this is the python version of r2u ?

perhaps, though I doubt Dirk would agree with that characterization. wheels are built with ancient versions of compilers to ensure compatibility across almost all platforms. r2u builds 'native' binaries, e.g. it gives ubuntu 22.04 users binaries compiled with the gcc version that ubuntu 22.04 ships with (etc). However, r2u doesn't offer something like prebuilt gdal 3.9.

Note that if you pip install -U --find-links https://girder.github.io/large_image_wheels GDAL you have full commandline gdal utilities too.

mdsumner commented 5 months ago

it works! (python package naming is so weird ...)


 ##docker run --rm -ti  ghcr.io/rocker-org/geospatial:dev-osgeo bash

gdalinfo  --version
#GDAL 3.8.2, released 2023/16/12
apt update && apt upgrade -y
apt install python3-pip -y
python3 -m pip install -U --find-links https://girder.github.io/large_image_wheels GDAL

gdalinfo --version
## GDAL 3.9.0dev-fa6ef24a21-dirty, released 2024/01/07

## it won't work with R package, not sure how to install from source in this 
##Rscript -e "terra::gdal()"  ->> 3.8.2

##python3
from osgeo import gdal
gdal.VersionInfo()
'3090000'

mdsumner commented 5 months ago

overall, I think I just don't fit the moulds - I will have to just settle in and make my own containers, building GDAL from source regularly is now one of the the least difficult parts here 😂

and, what is up with python package names ... it's osgeo with submodules gdal, osr, ogr, gdal_array ... and they call it "GDAL" - I don't understand how you're supposed to figure out how to install, there's hyphens, periods, mixed caps, all the fun - (but also the cli, is "wheels" more like a kind of conda? an os package manager, not just python)

mdsumner commented 5 months ago

so weird, those cli utils go in python3.10:

 ls /usr/local/lib
cmake        libgdal.so           libgeos_c.so         libgeos.so         libproj.so.25        pkgconfig  python3.10
gdalplugins  libgdal.so.34        libgeos_c.so.1       libgeos.so.3.12.1  libproj.so.25.9.3.1  python2.7  R
jni          libgdal.so.34.3.8.2  libgeos_c.so.1.18.1  libproj.so         ocaml                python3

 ls /usr/local/lib/python3.10/dist-packages/
GDAL-3.9.0.dist-info  GDAL.libs  osgeo  osgeo_utils

ls /usr/local/lib/python3.10/dist-packages/osgeo
bin                                          _gdal.cpython-310-x86_64-linux-gnu.so  _ogr.cpython-310-x86_64-linux-gnu.so
gdal                                         gdalnumeric.py                         ogr.py
_gdal_array.cpython-310-x86_64-linux-gnu.so  gdal.py                                _osr.cpython-310-x86_64-linux-gnu.so
gdal_array.py                                _gnm.cpython-310-x86_64-linux-gnu.so   osr.py
_gdalconst.cpython-310-x86_64-linux-gnu.so   gnm.py                                 proj
gdalconst.py                                 __init__.py                            __pycache__

 ls /usr/local/lib/python3.10/dist-packages/osgeo/bin
applygeo      gdal_contour    gdalinfo           gdalsrsinfo     geod         invgeod    ogrinfo     __pycache__
cct           gdal_create     gdallocationinfo   gdaltindex      geotifcp     invproj    ogrlineref  sozip
cs2cs         gdaldem         gdalmanage         gdaltransform   gie          listgeo    ogrtindex
gdaladdo      gdalenhance     gdalmdiminfo       gdal_translate  gnmanalyse   makegeo    proj
gdalbuildvrt  gdal_footprint  gdalmdimtranslate  gdal_viewshed   gnmmanage    nearblack  projinfo
gdal-config   gdal_grid       gdal_rasterize     gdalwarp        __init__.py  ogr2ogr    projsync

this is entirely out of my experience, and doesn't seem like a good idea at all - thanks for the info about wheels and old compilers!

cboettig commented 5 months ago

re the cli utils -- yeah, python packaging model is even loser and more flexible than than R's; I believe it's a relatively common thing for python packages to install cli tools (though presumably most are binding python). e.g. of course many python modules are intended only to be used as cli tools rather than loaded with import and run from a python interpreter. it still feels weird to me too in this case but I may use this if only as a very quick way to test the same code in 4 different versions of GDAL.

yes, ick! the import / naming conventions in python drive me bonkers as well. Again, I'm gathering this is because python packaging system itself is just crazy flexible, in some ways not really a system at all so much as a set of (various competing) conventions.

I think the package name used by pip is the one from setup.py (or these days more often pyproject.toml, e.g. https://github.com/OSGeo/gdal/blob/master/swig/python/setup.py.in#L421), while the name you import is the name of a directory at the package root (i.e. next to setup.py, what we would think of as the R/ dir in an R package), https://github.com/OSGeo/gdal/tree/master/swig/python/osgeo. By convention these are the same but obviously not all the time! (And also, there's nothing in pypi or elsewhere to ensure unique namespaces of this folder name that you refer to, e.g. you could make your own package that is pip install bob but loads via import xarray)

Unlike our R/ dir, you can have arbitrarily many sub-directories (== submodules). The mapping of special characters is annoying, I think the basic rule is that you use . to replace / on a path (but can also be used of course to reference a function/method/variable inside a file), and that you replace any - in path or dir names with _ -- not sure if those rules apply to package names as well.

I know Hadley has always said that he thinks R is powerful precisely because it's so permissive about so much. I've definitely come to see python as even more permissive. anyway you probably knew most of all this already and I probably got it half wrong

mdsumner commented 5 months ago

this is all excellent, fwiw a lot of my problems trying to get everything aligned came down to needing to update pip! that and ensuring --no-binary is happening when needed and not otherwise

I agree about that mix of permissive and constraints, there's a lot of pro vs con axes here but it's not simple precisely because there are so many of them.

mdsumner / ghrsst.coop

exploring python pathways #3

INSTALLED VERSIONS