mdsumner / pawsey-aad

0 stars 0 forks source link

specifications #2

Open mdsumner opened 3 months ago

mdsumner commented 3 months ago
mdsumner commented 1 month ago

loose todo list

mdsumner commented 1 month ago

pull from ghcr.io without the docker hub intermediary?

singularity pull --dir $MYSOFTWARE/sif_lib docker:ghcr.io/osgeo/gdal:ubuntu-small-3.9.2
mdsumner commented 1 week ago

A bit of a summary that I emailed, for my own reference.

Script in R/oisst_daily.R has the new bucket way with bowerbird.

bowerbird now writing directly to Acacia and we have tested that along with programmatically making buckets publicly available.

Secrets can be passed as quoted names now, so we can put host and bucket user/pass secrets in a consistent framing. bowerbird will attempt to find an env var of that name first, and then try its value.

We’re particularly interested in the {mirai} framework for async evaluation, this works tightly with the R {targets} system (think “make for R”), and already has wrappers for slurm via the {crew.cluster} package:

I haven’t seen how to use it in anger for job scheduling yet on Pawsey, but there are examples around. If you have a chance to explore crew.cluster::crew_controller_slurm() and related functions that would be awesome, or if you can find others already using it on Pawsey that could help us a lot.

https://wlandau.github.io/crew.cluster/reference/crew_controller_slurm.html

All the deps needed are on this docker image:

module load singularity/4.1.0-slurm
singularity pull --dir $MYSOFTWARE/sif_lib docker:ghcr.io/mdsumner/gdal-builds:rocker-gdal-dev-python

For us, part of the puzzle crossing the divide from R to Python has included getting across VirtualiZarr, which is the successor to Python kerchunk, and is closely related to the NASA/Opendap system DMR++ that stores references to byte ranges and the encoding used of chunks in files like NetCDF/GRIB/HDF that themselves aren’t cloud-friendly, enabling them to be loaded as a Zarr store in xarray without any reformatiting or copying at all (indexed by a big json, or by a Parquet store which scales better). Creating kerchunk index collection descriptions for our object store will allow us to easily express what we have in R in an xarray context.