Open rsignell-usgs opened 11 months ago
If you want to rechunk a pile of local files on Denali with a local recipe, instead of processing Cloud data using a recipe in a repo, you can just point to a local feedstock
directory which contains the recipe, like this example (make sure you grab a compute node first, of course):
cd /caldera/hytest_scratch/scratch/rsignell/pangeo-forge
cat local_config.py
ls ./gene_recipe/feedstock/
pangeo-forge-runner bake --repo ./gene_recipe -f local_config.py --Bake.job_name=test02 >& foo.log &
In this case the ./feedstock/recipe.py
looks like this:
#!/usr/bin/env python
# coding: utf-8
import fsspec
import xarray as xr
fs = fsspec.filesystem('file')
flist = fs.glob('/caldera/hytest_scratch/scratch/gzt5142/LOCA_hist/*historical.nc')
ds = xr.open_dataset(fs.open(flist[0]), mask_and_scale=False)
chunk_plan={
'time': 720*4,
'lon': 160,
'lat': 160
}
from pangeo_forge_recipes.patterns import pattern_from_file_sequence
from pangeo_forge_recipes.transforms import OpenURLWithFSSpec, OpenWithXarray, StoreToZarr
import apache_beam as beam
pattern = pattern_from_file_sequence([flist[0]], concat_dim='time', nitems_per_file=len(ds.time))
recipe = (
beam.Create(pattern.items())
| OpenURLWithFSSpec()
| OpenWithXarray(file_type=pattern.file_type, xarray_open_kwargs={"decode_coords": "all"})
| StoreToZarr(
store_name="gene",
combine_dims=pattern.combine_dim_keys,
target_chunks=chunk_plan
)
)
This was my first "real" test, and unfortunately it's currently giving a blosc decompression error. But I think the pattern is good, so I wanted to "document" this before I head out on leave for a week or two. I will be checking in from time to time if someone wants to try another rechunking workflow using this approach on Denali.
@rsignell-usgs -- would it be useful to create a shared conda environment that was available to anybody who wanted to do this? And loadable as a linux 'module' ....
Something like:
denali# module load pgf-runner
<edit recipe.py>
denali# pangeo-forge-runner bake <...etc...>
It would be fairly straightforward to set this up along side the hytest conda env we share on /caldera
.
Good idea @gzt5142 , that would seem useful indeed!
Super cool to see this in action @rsignell-usgs!
THIS MAKES ME SO SO HAPPY @rsignell-usgs!
If @yuvipanda is happy, we all are happy! 😇
Thanks to help from @yuvipanda and @cisaacstern, we are now able to run pangeo-forge Beam recipes on HPC systems like Denali, using local Beam runners.
Try it yourself:
runner
conda environment using thisenvironment.yml
file:local_config.py
file for the pangeo-forge-runner like this:DATA_PREFIX = HERE / 'data' os.makedirs(DATA_PREFIX, exist_ok=True)
Target output should be partitioned by job id
c.TargetStorage.root_path = f"{DATA_PREFIX}/{{job_name}}/output" c.TargetStorage.fsspec_class = "fsspec.implementations.local.LocalFileSystem"
c.InputCacheStorage.fsspec_class = c.TargetStorage.fsspec_class c.InputCacheStorage.fsspec_args = c.TargetStorage.fsspec_args
Input data cache should not be partitioned by job id, as we want to get the datafile
from the source only once
c.InputCacheStorage.root_path = f"{DATA_PREFIX}/cache/input"
c.InputCacheStorage.root_path = ""
c.MetadataCacheStorage.fsspec_class = c.TargetStorage.fsspec_class c.MetadataCacheStorage.fsspec_args = c.TargetStorage.fsspec_args
Metadata cache should be per job, as kwargs changing can change metadata
c.MetadataCacheStorage.root_path = f"{DATA_PREFIX}/{{job_name}}/cache/metadata"
pangeo-forge-runner bake --repo https://github.com/pforgetest/gpcp-from-gcs-feedstock/ --ref beam-refactor --config local_config.py --prune --Bake.job_name=test01