pangeo-forge / staged-recipes

A place to submit pangeo-forge recipes before they become fully fledged pangeo-forge feedstocks
https://pangeo-forge.readthedocs.io/en/latest/
Apache License 2.0
39 stars 63 forks source link

Proposed Recipe for historical NorESM2-LM vmo #128

Open jdldeauna opened 2 years ago

jdldeauna commented 2 years ago

Source Dataset

Meridional ocean mass transport from the CMIP6 NorESM2-LM historical model

Transformation / Alignment / Merging

I tried using this template to make a recipe but this dataset is not available in the s3://esgf-world bucket:

netcdf_cat = 's3://cmip6-nc/esgf-world.csv.gz'
df_s3 = pd.read_csv(netcdf_cat, dtype='unicode')
df_vmo = df_s3.query("source_id=='NorESM2-LM' & table_id == 'Omon' & variable_id == 'vmo' & experiment_id == 'historical'")
# df_vmo is empty

I tried exploring the FilePattern method but I'm confused how to set it up because the cmip6 esgf URLs are written per decade (i.e., 'http://noresg.nird.sigma2.no/thredds/fileServer/esg_dataroot/cmor/CMIP6/CMIP/NCC/NorESM2-LM/historical/r1i1p1f1/Omon/vmo/gn/v20190815/vmo_Omon_NorESM2-LM_historical_r1i1p1f1_gn_185001-185912.nc'), with the last file containing only the last five years (...201001-201412.nc). Does FilePattern only accept a time_concat_dim object?

Output Dataset

Stored with the rest of the CMIP6 outputs in Pangeo Cloud - https://storage.googleapis.com/cmip6/pangeo-cmip6-noQC.json

cisaacstern commented 2 years ago

Thanks for opening this issue, @jdldeauna!

You should be able to create a FilePattern for the url pattern you describe as follows:

import datetime

import pandas as pd
from pangeo_forge_recipes.patterns import ConcatDim, FilePattern

BASE_URL = "https://data.org/dataset"

dates = pd.date_range("1850-01", "2010-01", freq="10YS")

time_concat_dim = ConcatDim("date", keys=dates)

def make_url(date):
    """With a start date as input, return a url terminating in
    ``{start}-{end}.nc`` where end is 10 years after the start
    date for years other than 2010. If the start date is 2010,
    the end date will be 5 years after the start date.

    :param date: The start date.
    """
    # assign 10 year interval for all years aside from 2010
    freq = "10YS" if date.year != 2010 else "5YS"

    # make a time range based on the assigned interval
    start, end = pd.date_range(date, periods=2, freq=freq)

    # subtract one day from the end of the range
    end = end - datetime.timedelta(days=1)

    # return the url with the timestamp in '%Y%m' format
    return f"{BASE_URL}/{start.strftime('%Y%m')}-{end.strftime('%Y%m')}.nc"

pattern = FilePattern(make_url, time_concat_dim)

for _, url in pattern.items():
    print(url)

which prints

https://data.org/dataset/185001-185912.nc
https://data.org/dataset/186001-186912.nc
https://data.org/dataset/187001-187912.nc
https://data.org/dataset/188001-188912.nc
https://data.org/dataset/189001-189912.nc
https://data.org/dataset/190001-190912.nc
https://data.org/dataset/191001-191912.nc
https://data.org/dataset/192001-192912.nc
https://data.org/dataset/193001-193912.nc
https://data.org/dataset/194001-194912.nc
https://data.org/dataset/195001-195912.nc
https://data.org/dataset/196001-196912.nc
https://data.org/dataset/197001-197912.nc
https://data.org/dataset/198001-198912.nc
https://data.org/dataset/199001-199912.nc
https://data.org/dataset/200001-200912.nc
https://data.org/dataset/201001-201412.nc

If you instead assign

BASE_URL = 'http://noresg.nird.sigma2.no/thredds/fileServer/esg_dataroot/cmor/CMIP6/CMIP/NCC/NorESM2-LM/historical/r1i1p1f1/Omon/vmo/gn/v20190815/vmo_Omon_NorESM2-LM_historical_r1i1p1f1_gn_'

I believe you will get the urls for this dataset.

Note also that we just pushed a big update to the recipe contribution docs this morning, which may be of help to you:

https://pangeo-forge.readthedocs.io/en/latest/pangeo_forge_cloud/recipe_contribution.html#required-files

Please let me know if this solves your problem. If not, I'm standing by to help. Really looking forward to bringing this data to the cloud together!

jdldeauna commented 2 years ago

Thanks for your help @cisaacstern ! Right now I'm trying to decide how to chunk the dataset. When I try to download the URLs through xarray I get this error:

urls = []
for _, url in pattern.items():
    urls.append(url)

ds = xr.open_mfdataset(urls)

OSError: [Errno -90] NetCDF: file not found: b'http://noresg.nird.sigma2.no/thredds/fileServer/esg_dataroot/cmor/CMIP6/CMIP/NCC/NorESM2-LM/historical/r1i1p1f1/Omon/vmo/gr/v20190815/vmo_Omon_NorESM2-LM_historical_r1i1p1f1_gr_185001-185912.nc'


I've tried using alternate thredds servers (e.g., http://esgf-data3.ceda.ac.uk/ and http://esgf3.dkrz.de/), which also return the same error. I'm confused because when I try accessing an individual URL, I'm able to download the file associated with it. I can also download files from other data sources using xr.open_mfdataset. Would appreciate any advice, thanks!

cisaacstern commented 2 years ago

Awesome progress, @jdldeauna!

You are correct that it appears these urls are hard to open over HTTP with xarray (if there is a way, I also could not find it).

As shown in the tutorial section you link on chunking, to determine target_chunks we only need the first netcdf input file. So I just downloaded the first file:

!wget http://noresg.nird.sigma2.no/thredds/fileServer/esg_dataroot/cmor/CMIP6/CMIP/NCC/NorESM2-LM/historical/r1i1p1f1/Omon/vmo/gr/v20190815/vmo_Omon_NorESM2-LM_historical_r1i1p1f1_gr_185001-185912.nc
--2022-03-23 16:52:36--  http://noresg.nird.sigma2.no/thredds/fileServer/esg_dataroot/cmor/CMIP6/CMIP/NCC/NorESM2-LM/historical/r1i1p1f1/Omon/vmo/gr/v20190815/vmo_Omon_NorESM2-LM_historical_r1i1p1f1_gr_185001-185912.nc
Resolving noresg.nird.sigma2.no (noresg.nird.sigma2.no)... 158.36.100.36, 158.36.100.100, 2001:700:2:8500::36, ...
Connecting to noresg.nird.sigma2.no (noresg.nird.sigma2.no)|158.36.100.36|:80... connected.
HTTP request sent, awaiting response... 200 200
Length: 1795188500 (1.7G) [application/x-netcdf]
Saving to: ‘vmo_Omon_NorESM2-LM_historical_r1i1p1f1_gr_185001-185912.nc’

vmo_Omon_NorESM2-LM 100%[===================>]   1.67G  14.4MB/s    in 2m 1s   

2022-03-23 16:54:38 (14.1 MB/s) - ‘vmo_Omon_NorESM2-LM_historical_r1i1p1f1_gr_185001-185912.nc’ saved [1795188500/1795188500]

and then opened it directly from the downloaded copy

import xarray as xr

local_path = "vmo_Omon_NorESM2-LM_historical_r1i1p1f1_gr_185001-185912.nc"
ds = xr.open_dataset(local_path)

which worked. Then, I applied the calculation from the tutorial you linked, except I bumped chunksize_optimal from 50e6 (in the tutorial) to 125e6 (because 50 MB seemed too conservative, and ~ 125 MB should be fine):

ntime = len(ds.time)       # the number of time slices
chunksize_optimal = 125e6  # desired chunk size in bytes
ncfile_size = ds.nbytes    # the netcdf file size
chunksize = max(int(ntime* chunksize_optimal/ ncfile_size),1)

target_chunks = ds.dims.mapping
target_chunks['time'] = chunksize

target_chunks   # a dictionary giving the chunk sizes in each dimension

which gives us

{'time': 3, 'bnds': 2, 'lev': 70, 'j': 385, 'i': 360, 'vertices': 4}

We can then check how this chunking scheme affects the chunk size for vmo, which turns out to be ~ 111 MB per chunk, which should be good

ds_chunked = ds.chunk(target_chunks)
ds_chunked.vmo
Screen Shot 2022-03-23 at 5 12 07 PM
jdldeauna commented 2 years ago

This is great! I was able to create the recipe. Unfortunately when I try to execute this error appears:

for input_key in recipe.iter_inputs():
    recipe.cache_input(input_key)

FileNotFoundError: http://noresg.nird.sigma2.no/thredds/fileServer/esg_dataroot/cmor/CMIP6/CMIP/NCC/NorESM2-LM/historical/r1i1p1f1/Omon/vmo/gr/v20190815/vmo_Omon_NorESM2-LM_historical_r1i1p1f1_gr_185001-185912.nc

I think it may have to do with the URL again, so I'm going to try and track down how best to download from the ESGF servers. Thanks again for all your help @cisaacstern !

cisaacstern commented 2 years ago

Awesome, @jdldeauna!

Could you share the complete Python code you are using to create the recipe which throws this FileNotFoundError?

I suspect there may be a way to resolve this, but I won't know for sure without the actual recipe, so I can try it out. Thanks!

jdldeauna commented 2 years ago

Sure!

import datetime
import pandas as pd
import xarray as xr
from pangeo_forge_recipes.patterns import ConcatDim, FilePattern
from pangeo_forge_recipes.recipes.xarray_zarr import XarrayZarrRecipe

BASE_URL = 'http://noresg.nird.sigma2.no/thredds/fileServer/esg_dataroot/cmor/CMIP6/CMIP/NCC/NorESM2-LM/historical/r1i1p1f1/Omon/vmo/gr/v20190815/vmo_Omon_NorESM2-LM_historical_r1i1p1f1_gr_'

dates = pd.date_range("1850-01", "2010-01", freq="10YS")

time_concat_dim = ConcatDim("time", keys=dates)

def make_url(time):
    """With a start date as input, return a url terminating in
    ``{start}-{end}.nc`` where end is 10 years after the start
    date for years other than 2010. If the start date is 2010,
    the end date will be 5 years after the start date.

    :param date: The start date.
    """
    # assign 10 year interval for all years aside from 2010
    freq = "10YS" if time.year != 2010 else "5YS"

    # make a time range based on the assigned interval
    start, end = pd.date_range(time, periods=2, freq=freq)

    # subtract one day from the end of the range
    end = end - datetime.timedelta(days=1)

    # return the url with the timestamp in '%Y%m' format
    return f"{BASE_URL}{start.strftime('%Y%m')}-{end.strftime('%Y%m')}.nc"

pattern = FilePattern(make_url, time_concat_dim)

# Decide on chunk size by downloading local copy of 1 file
local_path = "vmo_Omon_NorESM2-LM_historical_r1i1p1f1_gr_185001-185912.nc"
ds = xr.open_dataset(local_path)

ntime = len(ds.time)       # the number of time slices
chunksize_optimal = 125e6  # desired chunk size in bytes
ncfile_size = ds.nbytes    # the netcdf file size
chunksize = max(int(ntime* chunksize_optimal/ ncfile_size),1)

target_chunks = ds.dims.mapping
target_chunks['time'] = chunksize

# the netcdf lists some of the coordinate variables as data variables. This is a fix which we want to apply to each chunk.
def set_bnds_as_coords(ds):
    new_coords_vars = [var for var in ds.data_vars if 'bnds' in var or 'bounds' in var]
    ds = ds.set_coords(new_coords_vars)
    return ds

recipe = XarrayZarrRecipe(
    pattern, 
    target_chunks=target_chunks,
    process_chunk=set_bnds_as_coords,
    xarray_concat_kwargs={'join':'exact'},
)

Executing recipe:

import zarr

for input_key in recipe.iter_inputs():
    recipe.cache_input(input_key)

FileNotFoundError: http://noresg.nird.sigma2.no/thredds/fileServer/esg_dataroot/cmor/CMIP6/CMIP/NCC/NorESM2-LM/historical/r1i1p1f1/Omon/vmo/gr/v20190815/vmo_Omon_NorESM2-LM_historical_r1i1p1f1_gr_185001-185912.nc
cisaacstern commented 2 years ago

Thanks for sharing!

I don't seem to be able to test my hypothesis about FileNotFoundError because now I'm getting FSTimeoutError. And indeed trying to get this file now with wget (which worked great earlier) also just hangs (seemingly) indefinitely. 😅

Because not even wget appears to work, this looks like an issue with http://noresg.nird.sigma2.no/thredds/fileServer. If/when this server comes back up (i.e., wget works on it), I can test some ideas for the recipe. (The short version of this, is that we open inputs with fsspec, and depending on the source server, sometimes tweaking FilePattern.fsspec_open_kwargs can help.

jdldeauna commented 2 years ago

Hi! I can use wget to download from the following servers: http://noresg.nird.sigma2.no/, http://esgf-data3.ceda.ac.uk/ and http://esgf3.dkrz.de/. However, when I try to execute the recipe with each of these servers, the cell just runs for a long time, too 😅 . I'll look more into fsspec, thanks!

cisaacstern commented 2 years ago

@jdldeauna, glad the server is back up! Executing the recipe actually does work for me now, without changing any kwargs.

Note that it's to be expected that the input caching will take a long time, because each input is pretty large. To figure out the exact input size without downloading, we can use wget --spider

$ wget --spider http://noresg.nird.sigma2.no/thredds/fileServer/esg_dataroot/cmor/CMIP6/CMIP/NCC/NorESM2-LM/historical/r1i1p1f1/Omon/vmo/gr/v20190815/vmo_Omon_NorESM2-LM_historical_r1i1p1f1_gr_185001-185912.nc
...
Length: 1795188500 (1.7G) [application/x-netcdf]

So that will take a long time to cache! Without logging turned on, it would be totally understandable to assume that such a long-running process has just stalled. To turn on debug logs, if you have not already:

from pangeo_forge_recipes.recipes import setup_logging

setup_logging("DEBUG")

Then when running a subsetted test of the recipe with

recipe_pruned = recipe.copy_pruned()
runner_func = recipe_pruned.to_function()
runner_func()

we see progress logs like this during the caching stage

pangeo_forge_recipes.recipes.xarray_zarr - INFO - Caching input 'Index({DimIndex(name='time', index=0, sequence_len=17, operation=<CombineOp.CONCAT: 2>)})'
pangeo_forge_recipes.storage - INFO - Caching file 'http://noresg.nird.sigma2.no/thredds/fileServer/esg_dataroot/cmor/CMIP6/CMIP/NCC/NorESM2-LM/historical/r1i1p1f1/Omon/vmo/gr/v20190815/vmo_Omon_NorESM2-LM_historical_r1i1p1f1_gr_185001-185912.nc'
pangeo_forge_recipes.storage - INFO - Copying remote file 'http://noresg.nird.sigma2.no/thredds/fileServer/esg_dataroot/cmor/CMIP6/CMIP/NCC/NorESM2-LM/historical/r1i1p1f1/Omon/vmo/gr/v20190815/vmo_Omon_NorESM2-LM_historical_r1i1p1f1_gr_185001-185912.nc' to cache
pangeo_forge_recipes.storage - DEBUG - entering fs.open context manager for /var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/tmphpgr686_/IvjofcX8/a27a03bf7f29029b4b34b6c3ee38c660-http_noresg.nird.sigma2.no_thredds_fileserver_esg_dataroot_cmor_cmip6_cmip_ncc_noresm2-lm_historical_r1i1p1f1_omon_vmo_gr_v20190815_vmo_omon_noresm2-lm_historical_r1i1p1f1_gr_185001-185912.nc
pangeo_forge_recipes.storage - DEBUG - FSSpecTarget.open yielding <fsspec.implementations.local.LocalFileOpener object at 0x7fe2b0d625e0>
pangeo_forge_recipes.storage - DEBUG - _copy_btw_filesystems total bytes copied: 10000000
pangeo_forge_recipes.storage - DEBUG - avg throughput over 0.05 min: 3.31 MB/sec
... etc

So this input is 1.7 GB, which means that at 3.31 MB/sec we'd expect caching it to take (1.7e9 / 3.3e6) / 60 = ~8.5 minutes.

Let me know if that makes sense and what else I can do to help! We'll definitely get this to work 😄

cisaacstern commented 2 years ago

Just because these inputs are so large, I might recommend assigning the cache to be a named local directory as follows:

from fsspec.implementations.local import LocalFileSystem
from pangeo_forge_recipes.storage import CacheFSSpecTarget

fs_local = LocalFileSystem()
recipe.storage_config.cache = CacheFSSpecTarget(fs_local, "cache")

Making this assignment before executing the test will write the files to a local directory named cache (you could also make this any name you want, by changing the argument in CacheFSSpecTarget above. If you don't do this, the recipe's default cache is a temporary local directory which Python will eventually automatically delete; sort of a pain if you plan on iterating on this recipe a bit, and need to cache these large inputs more than once.

Just to confirm, are you working with a local installation? For working with such large inputs I would definitely recommend a local installation as opposed to the Sandbox. This is for the same reason: to avoid having to cache these inputs more than once.

jdldeauna commented 2 years ago

I was able to run a pruned copy of the recipe on my local installation, thanks @cisaacstern !

I did have a question about executing the recipe as outlined in the CMIP6 tutorial:

import zarr

for input_key in recipe.iter_inputs():
    recipe.cache_input(input_key)

# use recipe to create the zarr store:
recipe.prepare_target() 

# is it there?
zgroup = zarr.open(recipe.target_mapper)
print(zgroup.tree())

for chunk in recipe.iter_chunks():
    recipe.store_chunk(chunk)
recipe.finalize_target()


I was wondering if it's preferred to do this type of execution for CMIP6 files over this version:

recipe_pruned = recipe.copy_pruned()
runner_func = recipe_pruned.to_function()
runner_func()
cisaacstern commented 2 years ago

Awesome 🎉

If you are satisfied with the outcome of your local test, I invite you to Make a PR for your recipe so we can move towards including in the Pangeo Forge Cloud catalog.

Regarding execution style, good question. The execution style is intentionally completely independent of the source data. In this case, the inconsistency you're are seeing is a product of the fact that we have not actually deprecated the manual stage functions yet. We plan to do this, as noted in:

After this happens, we'll remove all of the remaining docs references to the older style (including the one you just noted) and the recipe.to_function()() mode will become the default manual execution style. Until then, the docs (unfortunately) still have some relics of the old manual style kicking around. The result of that approach is the same, but since we're planning to deprecate it, I'm just not encouraging its use now.

jdldeauna commented 2 years ago

I invite you to Make a PR for your recipe

Yep, I'm working on putting together the meta.yaml and recipe.py file, so I can do the PR. :)

After this happens, we'll remove all of the remaining docs references to the older style (including the one you just noted) and the recipe.to_function()() mode will become the default manual execution style.

I see, thanks for the clarification!

jdldeauna commented 2 years ago

Some questions on the meta.yaml file:

  1. If I didn't develop in a Docker Image, what do I put under pangeo_notebook_version?

  2. Which bakery should I specify?

Thanks!

cisaacstern commented 2 years ago

If I didn't develop in a Docker Image, what do I put under pangeo_notebook_version?

Yeah, we need better documentation for this. You can put "2021.12.02".

Which bakery should I specify?

You can use the same one in the example you linked, "pangeo-ldeo-nsf-earthcube", which is currently the only one.

Right now, the default storage target for that bakery points to our Open Storage Network (OSN) bucket for Pangeo Forge. I believe your goal is to get this data into the existing CMIP6 catalog, so we'll have to see the best way to do that, which might involve adding an alternate storage target to this bakery, or perhaps just pointing the CMIP6 catalog listing for this dataset at our OSN bucket. Either is an option; I imagine @jbusecke can help us decide which is best.

jdldeauna commented 2 years ago

Alright! Re: bakeries, I was confused by this post and thought the links were pointing to different bakeries but were actually just pointing to templates. 😅

I believe your goal is to get this data into the existing CMIP6 catalog,

Yes, please let me know if I can help with making it accessible. Thanks for all your help!

jbusecke commented 2 years ago

Heya folks, this looks like a really cool addition.

I think this is a great way to get some CMIP6 output on the cloud while I am still working on a more automated way to upload new CMIP6 stores (https://github.com/pangeo-data/pangeo-cmip6-cloud/issues/31). Just wanted to flag this here as possible 'temporary'? I think it would be nice to ultimately keep all CMIP6 stuff in one spot/catalog?

jdldeauna commented 2 years ago

Sure! Just to follow my train of thought, I looked up the old Google Form which led me to Pangeo Forge, and I figured I would start with a staged recipe.

I think it would be nice to ultimately keep all CMIP6 stuff in one spot/catalog?

Sounds good! What would you recommend in terms of next steps for this dataset?

jbusecke commented 2 years ago

I think if this works, you should totally just work with it. I would be happy to iterate with you and @cisaacstern in the future so consolidate/automate the efforts here!