mdsumner / pawsey-aad

0 stars 0 forks source link

singularity docker image #1

Open mdsumner opened 3 weeks ago

mdsumner commented 3 weeks ago

Get the docker image with R libraries ready for running bowerbird

I pull the rocker-gdal-dev-python image from docker hub with this

module load singularity/4.1.0-slurm
singularity pull --dir $MYSOFTWARE/sif_lib docker://mdsumner/hypertidy:main

The dockerfiles and build process is here:

https://github.com/mdsumner/gdal-builds/pkgs/container/gdal-builds

This task pushes the ghcr.io image to docker hub (I don't otherwise know if we can pull directly from the github packages):

https://github.com/mdsumner/gdal-builds/blob/main/.github/workflows/pub-dockerhub.yml

Run the docker image and start R

First start an interactive process with reasonable memory and time (mem=20Gb and 1 hour on copy partition, for example).

In this shell we have python3 and R with every required geospatial library and package available.

module load singularity/4.1.0-slurm
singularity shell $MYSOFTWARE/sif_lib/hypertidy_main.sif

R

In R load bowerbird and run some file-synchronization

This step will download files filtered to only 2024, it should be about 300-400 files in total, and will take several minutes at least.

library(bowerbird)
library(blueant)

## modify so only recent files are downloaded
art <- blueant::sources("Artist AMSR2 near-real-time 3.125km sea ice concentration") |>
     bb_modify_source(method = list(accept_follow = "avhrr/2024/",
                                    accept_download = "Antarctic3125/asi.*5\\.4\\.(tif)"))

sst <- blueant::sources("NOAA OI 1/4 Degree Daily SST AVHRR") |>
     bb_modify_source(method = list(accept_follow = "avhrr/2024.*", accept_download = ".*2024.*nc$"))

datadir <- file.path(Sys.getenv("MYSCRATCH"), "bowerbird_files")
if (!file.exists(datadir)) dir.create(datadir)
cf <- bb_config(local_file_root = datadir)

cf <- bb_add(cf, rbind(art, sst))

status <- bb_sync(cf, verbose = TRUE, confirm_downloads_larger_than = -1)

## (for interactive use we have to confirm above a certain file size, unless confirm set to negative value)

This process is an administration step, but started from scratch here - usually our process will check for new files and only download those that are needed. Our full configuration has a few dozen sources and is tending towards ~40Tb with some 1.2 million files.

See this blog post https://ropensci.org/blog/2018/11/13/antarctic/ and documentation for more on bowerbird. https://docs.ropensci.org/bowerbird/

We have chosen two different datasets (sea surface temperature in NetCDF, sea ice concentration in GeoTIFF) from sites that don't require authentication. Many other data sets now require authentication by password or token, so some further input to bowerbird configuration is needed. Like this, but we avoided this complication for now:


bb_modify_source(user = <user>, password= <pass>)

Confirm files being downloaded on the system

This should now list some hundreds of NetCDF and GeoTIFF files.

find $MYSCRATCH/bowerbird_files -type f

Start raadtools and find the files

(There is a lot more to say here!)

library(raadtools)
set_data_roots(file.path(Sys.getenv("MYSCRATCH"), "bowerbird_files"), refresh_cache = TRUE)
readsst(latest = T)

readsst(latest = F)

We should see output like this:

class      : RasterLayer
dimensions : 720, 1440, 1036800  (nrow, ncol, ncell)
resolution : 0.25, 0.25  (x, y)
extent     : -180, 180, -90, 90  (xmin, xmax, ymin, ymax)
crs        : +proj=longlat +datum=WGS84 +no_defs
source     : memory
names      : Daily.sea.surface.temperature
values     : -1.8, 31.85  (min, max)
time       : 2024-06-05

class      : RasterLayer
dimensions : 720, 1440, 1036800  (nrow, ncol, ncell)
resolution : 0.25, 0.25  (x, y)
extent     : -180, 180, -90, 90  (xmin, xmax, ymin, ymax)
crs        : +proj=longlat +datum=WGS84 +no_defs
source     : memory
names      : Daily.sea.surface.temperature
values     : -1.8, 32.23  (min, max)
time       : 2020-01-01

@knservis @raymondben (thanks for the chat today!)

mdsumner commented 4 days ago

I bit the bullet to parallelize the synch job, here just for OISST as I get warmed up. (Obviously we will need to be careful to have the synch check the contents on Acacia not on disk, and use a more fine-grained call to rclone sync or just copy for recent data)

work.R

library(bowerbird)
library(blueant)

datadir <- file.path(Sys.getenv("MYSCRATCH"), "bowerbird_files")
if (!file.exists(datadir)) dir.create(datadir)
cf <- bb_config(local_file_root = datadir)

years <- 1982:2024
srcslist <- purrr::map(years, \(.x) {
  blueant::sources("NOAA OI 1/4 Degree Daily SST AVHRR") |> 
    bb_modify_source(method = list(accept_follow = sprintf("avhrr/%i.*", .x), accept_download = sprintf(".*%i.*nc$", .x)))
})

cflist<- lapply(srcslist, \(.x) bb_add(cf, .x))

options(parallelly.fork.enable = TRUE, future.rng.onMisuse = "ignore")
library(furrr)
plan(multicore)
statuses <- future_map(cflist, \(.cf) bb_sync(.cf, verbose = TRUE, confirm_downloads_larger_than = -1))
saveRDS(statuses, sprintf("statuses_%s.rds", format(Sys.date())))