rsignell-usgs commented 11 months ago

Thanks to help from @yuvipanda and @cisaacstern, we are now able to run pangeo-forge Beam recipes on HPC systems like Denali, using local Beam runners.

Try it yourself:

On Denali, create a runner conda environment using this environment.yml file:

name: runner
channels:
- conda-forge
dependencies:
- python=3.9.13
- pangeo-forge-recipes=0.10.0
- apache-beam=2.42.0
- pandas<2.0
- s3fs
- pip:
  - git+https://github.com/pangeo-forge/pangeo-forge-runner.git@main

Create a local_config.py file for the pangeo-forge-runner like this:
```
from pathlib import Path
import os
HERE = Path(__file__).parent
```

DATA_PREFIX = HERE / 'data' os.makedirs(DATA_PREFIX, exist_ok=True)

Target output should be partitioned by job id

c.TargetStorage.root_path = f"{DATA_PREFIX}/{{job_name}}/output" c.TargetStorage.fsspec_class = "fsspec.implementations.local.LocalFileSystem"

c.InputCacheStorage.fsspec_class = c.TargetStorage.fsspec_class c.InputCacheStorage.fsspec_args = c.TargetStorage.fsspec_args

Input data cache should not be partitioned by job id, as we want to get the datafile

from the source only once

c.InputCacheStorage.root_path = f"{DATA_PREFIX}/cache/input"

c.InputCacheStorage.root_path = ""

c.MetadataCacheStorage.fsspec_class = c.TargetStorage.fsspec_class c.MetadataCacheStorage.fsspec_args = c.TargetStorage.fsspec_args

Metadata cache should be per job, as kwargs changing can change metadata

c.MetadataCacheStorage.root_path = f"{DATA_PREFIX}/{{job_name}}/cache/metadata"


3. Run a recipe from the command line, which will execute the recipe in parallel using available cores (all the cores on a node if you are using Denali):

pangeo-forge-runner bake --repo https://github.com/pforgetest/gpcp-from-gcs-feedstock/ --ref beam-refactor --config local_config.py --prune --Bake.job_name=test01


4. Explore the resulting zarr dataset in the `./data/` directory!

rsignell-usgs commented 11 months ago

If you want to rechunk a pile of local files on Denali with a local recipe, instead of processing Cloud data using a recipe in a repo, you can just point to a local feedstock directory which contains the recipe, like this example (make sure you grab a compute node first, of course):

cd /caldera/hytest_scratch/scratch/rsignell/pangeo-forge 
cat local_config.py
ls ./gene_recipe/feedstock/
pangeo-forge-runner bake --repo ./gene_recipe -f local_config.py   --Bake.job_name=test02 >& foo.log &

In this case the ./feedstock/recipe.py looks like this:

#!/usr/bin/env python
# coding: utf-8

import fsspec
import xarray as xr

fs = fsspec.filesystem('file')
flist = fs.glob('/caldera/hytest_scratch/scratch/gzt5142/LOCA_hist/*historical.nc')

ds = xr.open_dataset(fs.open(flist[0]), mask_and_scale=False)

chunk_plan={
        'time': 720*4, 
        'lon': 160, 
        'lat': 160
    }

from pangeo_forge_recipes.patterns import pattern_from_file_sequence
from pangeo_forge_recipes.transforms import OpenURLWithFSSpec, OpenWithXarray, StoreToZarr
import apache_beam as beam

pattern = pattern_from_file_sequence([flist[0]], concat_dim='time', nitems_per_file=len(ds.time))

recipe = (
    beam.Create(pattern.items())
    | OpenURLWithFSSpec()
    | OpenWithXarray(file_type=pattern.file_type, xarray_open_kwargs={"decode_coords": "all"})
    | StoreToZarr(
        store_name="gene",
        combine_dims=pattern.combine_dim_keys,
        target_chunks=chunk_plan
    )
)

rsignell-usgs commented 11 months ago

This was my first "real" test, and unfortunately it's currently giving a blosc decompression error. But I think the pattern is good, so I wanted to "document" this before I head out on leave for a week or two. I will be checking in from time to time if someone wants to try another rechunking workflow using this approach on Denali.

gzt5142 commented 11 months ago

@rsignell-usgs -- would it be useful to create a shared conda environment that was available to anybody who wanted to do this? And loadable as a linux 'module' ....

Something like:

denali#  module load pgf-runner
<edit recipe.py>
denali# pangeo-forge-runner bake  <...etc...>

It would be fairly straightforward to set this up along side the hytest conda env we share on /caldera.

rsignell-usgs commented 11 months ago

Good idea @gzt5142 , that would seem useful indeed!

norlandrhagen commented 11 months ago

Super cool to see this in action @rsignell-usgs!

yuvipanda commented 11 months ago

THIS MAKES ME SO SO HAPPY @rsignell-usgs!

rsignell-usgs commented 11 months ago

If @yuvipanda is happy, we all are happy! 😇

hytest-org / hytest

Running pangeo-forge Beam recipes for rechunking on HPC (e.g. Denali) #350

Target output should be partitioned by job id

Input data cache should not be partitioned by job id, as we want to get the datafile

from the source only once

c.InputCacheStorage.root_path = f"{DATA_PREFIX}/cache/input"

Metadata cache should be per job, as kwargs changing can change metadata