hytest-org / hytest

https://hytest-org.github.io/hytest/
22 stars 10 forks source link

Running pangeo-forge Beam recipes for rechunking on HPC (e.g. Denali) #350

Open rsignell-usgs opened 11 months ago

rsignell-usgs commented 11 months ago

Thanks to help from @yuvipanda and @cisaacstern, we are now able to run pangeo-forge Beam recipes on HPC systems like Denali, using local Beam runners.

Try it yourself:

  1. On Denali, create a runner conda environment using this environment.yml file:
    name: runner
    channels:
    - conda-forge
    dependencies:
    - python=3.9.13
    - pangeo-forge-recipes=0.10.0
    - apache-beam=2.42.0
    - pandas<2.0
    - s3fs
    - pip:
      - git+https://github.com/pangeo-forge/pangeo-forge-runner.git@main
  2. Create a local_config.py file for the pangeo-forge-runner like this:
    
    from pathlib import Path
    import os
    HERE = Path(__file__).parent

DATA_PREFIX = HERE / 'data' os.makedirs(DATA_PREFIX, exist_ok=True)

Target output should be partitioned by job id

c.TargetStorage.root_path = f"{DATA_PREFIX}/{{job_name}}/output" c.TargetStorage.fsspec_class = "fsspec.implementations.local.LocalFileSystem"

c.InputCacheStorage.fsspec_class = c.TargetStorage.fsspec_class c.InputCacheStorage.fsspec_args = c.TargetStorage.fsspec_args

Input data cache should not be partitioned by job id, as we want to get the datafile

from the source only once

c.InputCacheStorage.root_path = f"{DATA_PREFIX}/cache/input"

c.InputCacheStorage.root_path = ""

c.MetadataCacheStorage.fsspec_class = c.TargetStorage.fsspec_class c.MetadataCacheStorage.fsspec_args = c.TargetStorage.fsspec_args

Metadata cache should be per job, as kwargs changing can change metadata

c.MetadataCacheStorage.root_path = f"{DATA_PREFIX}/{{job_name}}/cache/metadata"


3. Run a recipe from the command line, which will execute the recipe in parallel using available cores (all the cores on a node if you are using Denali):

pangeo-forge-runner bake --repo https://github.com/pforgetest/gpcp-from-gcs-feedstock/ --ref beam-refactor --config local_config.py --prune --Bake.job_name=test01


4. Explore the resulting zarr dataset in the `./data/` directory!
rsignell-usgs commented 11 months ago

If you want to rechunk a pile of local files on Denali with a local recipe, instead of processing Cloud data using a recipe in a repo, you can just point to a local feedstock directory which contains the recipe, like this example (make sure you grab a compute node first, of course):

cd /caldera/hytest_scratch/scratch/rsignell/pangeo-forge 
cat local_config.py
ls ./gene_recipe/feedstock/
pangeo-forge-runner bake --repo ./gene_recipe -f local_config.py   --Bake.job_name=test02 >& foo.log &

In this case the ./feedstock/recipe.py looks like this:

#!/usr/bin/env python
# coding: utf-8

import fsspec
import xarray as xr

fs = fsspec.filesystem('file')
flist = fs.glob('/caldera/hytest_scratch/scratch/gzt5142/LOCA_hist/*historical.nc')

ds = xr.open_dataset(fs.open(flist[0]), mask_and_scale=False)

chunk_plan={
        'time': 720*4, 
        'lon': 160, 
        'lat': 160
    }

from pangeo_forge_recipes.patterns import pattern_from_file_sequence
from pangeo_forge_recipes.transforms import OpenURLWithFSSpec, OpenWithXarray, StoreToZarr
import apache_beam as beam

pattern = pattern_from_file_sequence([flist[0]], concat_dim='time', nitems_per_file=len(ds.time))

recipe = (
    beam.Create(pattern.items())
    | OpenURLWithFSSpec()
    | OpenWithXarray(file_type=pattern.file_type, xarray_open_kwargs={"decode_coords": "all"})
    | StoreToZarr(
        store_name="gene",
        combine_dims=pattern.combine_dim_keys,
        target_chunks=chunk_plan
    )
)
rsignell-usgs commented 11 months ago

This was my first "real" test, and unfortunately it's currently giving a blosc decompression error. But I think the pattern is good, so I wanted to "document" this before I head out on leave for a week or two. I will be checking in from time to time if someone wants to try another rechunking workflow using this approach on Denali.

gzt5142 commented 11 months ago

@rsignell-usgs -- would it be useful to create a shared conda environment that was available to anybody who wanted to do this? And loadable as a linux 'module' ....

Something like:

denali#  module load pgf-runner
<edit recipe.py>
denali# pangeo-forge-runner bake  <...etc...>

It would be fairly straightforward to set this up along side the hytest conda env we share on /caldera.

rsignell-usgs commented 11 months ago

Good idea @gzt5142 , that would seem useful indeed!

norlandrhagen commented 11 months ago

Super cool to see this in action @rsignell-usgs!

yuvipanda commented 11 months ago

THIS MAKES ME SO SO HAPPY @rsignell-usgs!

rsignell-usgs commented 11 months ago

If @yuvipanda is happy, we all are happy! 😇