pangeo-forge / staged-recipes

A place to submit pangeo-forge recipes before they become fully fledged pangeo-forge feedstocks
https://pangeo-forge.readthedocs.io/en/latest/
Apache License 2.0
39 stars 63 forks source link

Pipeline for SODA 3.4.2 #23

Open shanicetbailey opened 3 years ago

shanicetbailey commented 3 years ago

SODA 3.4.2 Dataset

SODA version 3.4.2 is an ocean reanalysis that is available on the native grid in netcdf format. The native interlaced horizonal velocity and conserved tracer (e.g. temperature and salinity) grids form a tripolar Arakawa-B grid, varying from 0.1°x0.25° at high latitude to 1/4°x1/4° in the tropics (quasi-isotropic grid spacing increases from ~11.7km at 65 latitude to ~28.0km at the Equator, 1440x1070 grid points). The (1.5Gb sized) topography map (created by Whit Anderson of GFDL) is here.

Transformation / Alignment / Merging

NA

Output Dataset

Output of pipeline would ideally be stored as zarr files for optimal cloud computing performance.

rabernat commented 3 years ago

Just noting that each file is 1.8 GB.

rabernat commented 3 years ago

I have a functional recipe as follows:

import pandas as pd
from pangeo_forge.recipe import NetCDFtoZarrSequentialRecipe

url_base = 'https://dsrs.atmos.umd.edu/DATA/soda3.4.2/ORIGINAL/ocean/'

dates = pd.date_range(start='1991-01-30', end='2019-12-19', freq='5D')
date_string = [d.strftime('%Y_%m_%d') for d in dates]
urls = [url_base + f'soda3.4.2_5dy_ocean_or_{dstring}.nc' for dstring in date_string]

recipe = NetCDFtoZarrSequentialRecipe(
    input_urls=urls,
    sequence_dim="time",
    inputs_per_chunk=1
)
rabernat commented 3 years ago

I tried to run the recipe and found that considerable data is missing from 1992. These are the only 1992 files:

  | soda3.4.2_5dy_ocean_or_1992_12_05.nc | 2017-09-23 03:03 | 1.8G |  
  | soda3.4.2_5dy_ocean_or_1992_12_10.nc | 2017-09-23 03:09 | 1.8G |  
  | soda3.4.2_5dy_ocean_or_1992_12_15.nc | 2017-09-23 03:16 | 1.8G |  
  | soda3.4.2_5dy_ocean_or_1992_12_20.nc | 2017-09-23 03:23 | 1.8G |  
  | soda3.4.2_5dy_ocean_or_1992_12_25.nc | 2017-09-23 03:29 | 1.8G |  
  | soda3.4.2_5dy_ocean_or_1992_12_30.nc | 2017-09-23 03:36 | 1.8G |  

@stb2145 - could you bring this to the attention of the dataset owner?

In the meantime, I will restart the recipe with a start date if 1993-01-04

rabernat commented 3 years ago

p.s. Pangeo forge is curently downloading the data 😄

image

Will take a few days.

shanicetbailey commented 3 years ago

Something they noted on the download page is:

Note: we had a data loss event and we only have the soda3.4.2 ocean files on the original grid for 1/30/1991-5/20/1991 and beginning 12/5/1992, and for sea ice beginning 01/05/1993. We have all the regridded data.

shanicetbailey commented 3 years ago

Please upload the topography map as well (1.5GB), thank you!

rabernat commented 3 years ago

I almost had this recipe done! But then I accidentally deleted nearly 5 TB of SODA data (see https://github.com/dask/gcsfs/issues/364). 😭 😭 😭

I am starting again, but it will take a few days to re-download all the netCDF files.

shanicetbailey commented 3 years ago

Oh no! 😓 I'm sorry that happened, but thank you for making this a priority!

rabernat commented 3 years ago

The SODA dataset is mostly all there. The recipe process uncovered two corrupted files:

Reading these results in HDF5 errors. Consequently, there is missing data on those two days.

The dataset is accessible now on Pangeo Cloud.

import gcsfs
import xarray as xr

url = 'gs://pangeo-forge-us-central1/pangeo-forge/soda/soda3.4.2_5dy_ocean_or'
fs = gcsfs.GCSFileSystem()
ds = xr.open_zarr(fs.get_mapper(url), consolidated=True)
ds.info()
xarray.Dataset {
dimensions:
    nv = 2 ;
    st_edges_ocean = 51 ;
    st_ocean = 50 ;
    sw_edges_ocean = 51 ;
    sw_ocean = 50 ;
    time = 1970 ;
    xt_ocean = 1440 ;
    xu_ocean = 1440 ;
    yt_ocean = 1070 ;
    yu_ocean = 1070 ;

variables:
    float32 anompb(time, yt_ocean, xt_ocean) ;
        anompb:cell_methods = time: mean ;
        anompb:coordinates = geolon_t geolat_t ;
        anompb:long_name = T-cell bottom pressure - rho0*grav*ht ;
        anompb:time_avg_info = average_T1,average_T2,average_DT ;
        anompb:units = dbar ;
        anompb:valid_range = [-1000000.0, 1000000.0] ;
    timedelta64[ns] average_DT(time) ;
        average_DT:long_name = Length of average period ;
    datetime64[ns] average_T1(time) ;
        average_T1:long_name = Start time for average period ;
    datetime64[ns] average_T2(time) ;
        average_T2:long_name = End time for average period ;
    float32 evap_heat(time, yt_ocean, xt_ocean) ;
        evap_heat:cell_methods = time: mean ;
        evap_heat:coordinates = geolon_t geolat_t ;
        evap_heat:long_name = latent heat flux into ocean (<0 cools ocean) ;
        evap_heat:standard_name = surface_downward_latent_heat_flux ;
        evap_heat:time_avg_info = average_T1,average_T2,average_DT ;
        evap_heat:units = W/m^2 ;
        evap_heat:valid_range = [-10000000000.0, 10000000000.0] ;
    float32 hflux_cpl(time, yt_ocean, xt_ocean) ;
        hflux_cpl:cell_methods = time: mean ;
        hflux_cpl:coordinates = geolon_t geolat_t ;
        hflux_cpl:long_name = surface heat flux coming through coupler ;
        hflux_cpl:time_avg_info = average_T1,average_T2,average_DT ;
        hflux_cpl:units = Watts/m^2 ;
        hflux_cpl:valid_range = [-10000.0, 10000.0] ;
    float32 hflux_pme(time, yt_ocean, xt_ocean) ;
        hflux_pme:cell_methods = time: mean ;
        hflux_pme:coordinates = geolon_t geolat_t ;
        hflux_pme:long_name = heat flux (relative to 0C) from pme transfer of water across ocean surface ;
        hflux_pme:time_avg_info = average_T1,average_T2,average_DT ;
        hflux_pme:units = Watts/m^2 ;
        hflux_pme:valid_range = [-10000.0, 10000.0] ;
    float32 hflux_total(time, yt_ocean, xt_ocean) ;
        hflux_total:cell_methods = time: mean ;
        hflux_total:coordinates = geolon_t geolat_t ;
        hflux_total:long_name = surface heat flux from coupler plus restore (omits mass transfer heating) ;
        hflux_total:time_avg_info = average_T1,average_T2,average_DT ;
        hflux_total:units = Watts/m^2 ;
        hflux_total:valid_range = [-10000.0, 10000.0] ;
    float32 lw_heat(time, yt_ocean, xt_ocean) ;
        lw_heat:cell_methods = time: mean ;
        lw_heat:coordinates = geolon_t geolat_t ;
        lw_heat:long_name = longwave flux into ocean (<0 cools ocean) ;
        lw_heat:standard_name = surface_net_downward_longwave_flux ;
        lw_heat:time_avg_info = average_T1,average_T2,average_DT ;
        lw_heat:units = W/m^2 ;
        lw_heat:valid_range = [-10000000000.0, 10000000000.0] ;
    float32 mlp(time, yt_ocean, xt_ocean) ;
        mlp:cell_methods = time: mean ;
        mlp:coordinates = geolon_t geolat_t ;
        mlp:long_name = Depth of potential density mixed layer ;
        mlp:time_avg_info = average_T1,average_T2,average_DT ;
        mlp:units = m ;
        mlp:valid_range = [-1000000.0, 1000000.0] ;
    float32 mls(time, yt_ocean, xt_ocean) ;
        mls:cell_methods = time: mean ;
        mls:coordinates = geolon_t geolat_t ;
        mls:long_name = mixed layer depth determined by salinity criteria ;
        mls:time_avg_info = average_T1,average_T2,average_DT ;
        mls:units = m ;
        mls:valid_range = [0.0, 1000000.0] ;
    float32 mlt(time, yt_ocean, xt_ocean) ;
        mlt:cell_methods = time: mean ;
        mlt:coordinates = geolon_t geolat_t ;
        mlt:long_name = mixed layer depth determined by temperature criteria ;
        mlt:time_avg_info = average_T1,average_T2,average_DT ;
        mlt:units = m ;
        mlt:valid_range = [0.0, 1000000.0] ;
    float32 net_heating(time, yt_ocean, xt_ocean) ;
        net_heating:cell_methods = time: mean ;
        net_heating:coordinates = geolon_t geolat_t ;
        net_heating:long_name = surface ocean heat flux coming through coupler and mass transfer ;
        net_heating:time_avg_info = average_T1,average_T2,average_DT ;
        net_heating:units = Watts/m^2 ;
        net_heating:valid_range = [-10000.0, 10000.0] ;
    float64 nv(nv) ;
        nv:cartesian_axis = N ;
        nv:long_name = vertex number ;
        nv:units = none ;
    float32 prho(time, st_ocean, yt_ocean, xt_ocean) ;
        prho:cell_methods = time: mean ;
        prho:coordinates = geolon_t geolat_t ;
        prho:long_name = potential density referenced to 0 dbar ;
        prho:standard_name = sea_water_potential_density ;
        prho:time_avg_info = average_T1,average_T2,average_DT ;
        prho:units = kg/m^3 ;
        prho:valid_range = [-10.0, 100000.0] ;
    float32 salt(time, st_ocean, yt_ocean, xt_ocean) ;
        salt:cell_methods = time: mean ;
        salt:coordinates = geolon_t geolat_t ;
        salt:long_name = Practical Salinity ;
        salt:standard_name = sea_water_salinity ;
        salt:time_avg_info = average_T1,average_T2,average_DT ;
        salt:units = psu ;
        salt:valid_range = [-10.0, 100.0] ;
    float32 salt_flux_rstr(time, yt_ocean, xt_ocean) ;
        salt_flux_rstr:cell_methods = time: mean ;
        salt_flux_rstr:coordinates = geolon_t geolat_t ;
        salt_flux_rstr:long_name = sfc_salt_flux_restore: flux from restoring term ;
        salt_flux_rstr:time_avg_info = average_T1,average_T2,average_DT ;
        salt_flux_rstr:units = kg/(m^2*sec) ;
        salt_flux_rstr:valid_range = [-10000.0, 10000.0] ;
    float32 salt_flux_total(time, yt_ocean, xt_ocean) ;
        salt_flux_total:cell_methods = time: mean ;
        salt_flux_total:coordinates = geolon_t geolat_t ;
        salt_flux_total:long_name = sfc_salt_flux_total ;
        salt_flux_total:time_avg_info = average_T1,average_T2,average_DT ;
        salt_flux_total:units = kg/(m^2*sec) ;
        salt_flux_total:valid_range = [-10000.0, 10000.0] ;
    float32 sbd(time, yt_ocean, xt_ocean) ;
        sbd:cell_methods = time: mean ;
        sbd:coordinates = geolon_t geolat_t ;
        sbd:long_name = rate of mass transferred below the mixed layer base ;
        sbd:time_avg_info = average_T1,average_T2,average_DT ;
        sbd:units = kg/sec ;
        sbd:valid_range = [-1.0000000200408773e+20, 1.0000000200408773e+20] ;
    float32 sbd_dhdt(time, yt_ocean, xt_ocean) ;
        sbd_dhdt:cell_methods = time: mean ;
        sbd_dhdt:coordinates = geolon_t geolat_t ;
        sbd_dhdt:long_name = rate of mass transferred below the mixed layer base due to dh/dt ;
        sbd_dhdt:time_avg_info = average_T1,average_T2,average_DT ;
        sbd_dhdt:units = kg/sec ;
        sbd_dhdt:valid_range = [-1.0000000200408773e+20, 1.0000000200408773e+20] ;
    float32 sbd_horz(time, yt_ocean, xt_ocean) ;
        sbd_horz:cell_methods = time: mean ;
        sbd_horz:coordinates = geolon_t geolat_t ;
        sbd_horz:long_name = rate of mass transferred below the mixed layer base due to horz advect ;
        sbd_horz:time_avg_info = average_T1,average_T2,average_DT ;
        sbd_horz:units = kg/sec ;
        sbd_horz:valid_range = [-1.0000000200408773e+20, 1.0000000200408773e+20] ;
    float32 sbd_mld(time, yt_ocean, xt_ocean) ;
        sbd_mld:cell_methods = time: mean ;
        sbd_mld:coordinates = geolon_t geolat_t ;
        sbd_mld:long_name = mixed layer depth used for subduction diagnostics ;
        sbd_mld:time_avg_info = average_T1,average_T2,average_DT ;
        sbd_mld:units = m ;
        sbd_mld:valid_range = [-100.0, 1.0000000200408773e+20] ;
    float32 sbd_vert(time, yt_ocean, xt_ocean) ;
        sbd_vert:cell_methods = time: mean ;
        sbd_vert:coordinates = geolon_t geolat_t ;
        sbd_vert:long_name = rate of mass transferred below the mixed layer base due to vert advect ;
        sbd_vert:time_avg_info = average_T1,average_T2,average_DT ;
        sbd_vert:units = kg/sec ;
        sbd_vert:valid_range = [-1.0000000200408773e+20, 1.0000000200408773e+20] ;
    float32 sens_heat(time, yt_ocean, xt_ocean) ;
        sens_heat:cell_methods = time: mean ;
        sens_heat:coordinates = geolon_t geolat_t ;
        sens_heat:long_name = sensible heat into ocean (<0 cools ocean) ;
        sens_heat:standard_name = surface_downward_sensible_heat_flux ;
        sens_heat:time_avg_info = average_T1,average_T2,average_DT ;
        sens_heat:units = W/m^2 ;
        sens_heat:valid_range = [-10000000000.0, 10000000000.0] ;
    float32 ssh(time, yt_ocean, xt_ocean) ;
        ssh:cell_methods = time: mean ;
        ssh:coordinates = geolon_t geolat_t ;
        ssh:long_name = effective sea level (eta_t + patm/(rho0*g)) on T cells ;
        ssh:standard_name = sea_surface_height_above_geoid ;
        ssh:time_avg_info = average_T1,average_T2,average_DT ;
        ssh:units = meter ;
        ssh:valid_range = [-1000.0, 1000.0] ;
    float64 st_edges_ocean(st_edges_ocean) ;
        st_edges_ocean:cartesian_axis = Z ;
        st_edges_ocean:long_name = tcell zstar depth edges ;
        st_edges_ocean:positive = down ;
        st_edges_ocean:units = meters ;
    float64 st_ocean(st_ocean) ;
        st_ocean:cartesian_axis = Z ;
        st_ocean:edges = st_edges_ocean ;
        st_ocean:long_name = tcell zstar depth ;
        st_ocean:positive = down ;
        st_ocean:units = meters ;
    float64 sw_edges_ocean(sw_edges_ocean) ;
        sw_edges_ocean:cartesian_axis = Z ;
        sw_edges_ocean:long_name = ucell zstar depth edges ;
        sw_edges_ocean:positive = down ;
        sw_edges_ocean:units = meters ;
    float64 sw_ocean(sw_ocean) ;
        sw_ocean:cartesian_axis = Z ;
        sw_ocean:edges = sw_edges_ocean ;
        sw_ocean:long_name = ucell zstar depth ;
        sw_ocean:positive = down ;
        sw_ocean:units = meters ;
    float32 swflx(time, yt_ocean, xt_ocean) ;
        swflx:cell_methods = time: mean ;
        swflx:coordinates = geolon_t geolat_t ;
        swflx:long_name = shortwave flux into ocean (>0 heats ocean) ;
        swflx:standard_name = surface_net_downward_shortwave_flux ;
        swflx:time_avg_info = average_T1,average_T2,average_DT ;
        swflx:units = W/m^2 ;
        swflx:valid_range = [-10000000000.0, 10000000000.0] ;
    float32 taux(time, yu_ocean, xu_ocean) ;
        taux:cell_methods = time: mean ;
        taux:coordinates = geolon_c geolat_c ;
        taux:long_name = i-directed wind stress forcing u-velocity ;
        taux:standard_name = surface_downward_x_stress ;
        taux:time_avg_info = average_T1,average_T2,average_DT ;
        taux:units = N/m^2 ;
        taux:valid_range = [-10.0, 10.0] ;
    float32 tauy(time, yu_ocean, xu_ocean) ;
        tauy:cell_methods = time: mean ;
        tauy:coordinates = geolon_c geolat_c ;
        tauy:long_name = j-directed wind stress forcing v-velocity ;
        tauy:standard_name = surface_downward_y_stress ;
        tauy:time_avg_info = average_T1,average_T2,average_DT ;
        tauy:units = N/m^2 ;
        tauy:valid_range = [-10.0, 10.0] ;
    float32 temp(time, st_ocean, yt_ocean, xt_ocean) ;
        temp:cell_methods = time: mean ;
        temp:coordinates = geolon_t geolat_t ;
        temp:long_name = Potential temperature ;
        temp:standard_name = sea_water_potential_temperature ;
        temp:time_avg_info = average_T1,average_T2,average_DT ;
        temp:units = degrees C ;
        temp:valid_range = [-10.0, 500.0] ;
    object time(time) ;
        time:bounds = time_bounds ;
        time:calendar_type = JULIAN ;
        time:cartesian_axis = T ;
        time:long_name = time ;
    int64 time_bounds(time, nv) ;
        time_bounds:calendar = JULIAN ;
        time_bounds:long_name = time axis boundaries ;
        time_bounds:units = nanoseconds ;
    float32 u(time, st_ocean, yu_ocean, xu_ocean) ;
        u:cell_methods = time: mean ;
        u:coordinates = geolon_c geolat_c ;
        u:long_name = i-current ;
        u:standard_name = sea_water_x_velocity ;
        u:time_avg_info = average_T1,average_T2,average_DT ;
        u:units = m/sec ;
        u:valid_range = [-10.0, 10.0] ;
    float32 v(time, st_ocean, yu_ocean, xu_ocean) ;
        v:cell_methods = time: mean ;
        v:coordinates = geolon_c geolat_c ;
        v:long_name = j-current ;
        v:standard_name = sea_water_y_velocity ;
        v:time_avg_info = average_T1,average_T2,average_DT ;
        v:units = m/sec ;
        v:valid_range = [-10.0, 10.0] ;
    float32 wt(time, sw_ocean, yt_ocean, xt_ocean) ;
        wt:cell_methods = time: mean ;
        wt:coordinates = geolon_t geolat_t ;
        wt:long_name = dia-surface velocity T-points ;
        wt:time_avg_info = average_T1,average_T2,average_DT ;
        wt:units = m/sec ;
        wt:valid_range = [-100000.0, 100000.0] ;
    float64 xt_ocean(xt_ocean) ;
        xt_ocean:cartesian_axis = X ;
        xt_ocean:long_name = tcell longitude ;
        xt_ocean:units = degrees_E ;
    float64 xu_ocean(xu_ocean) ;
        xu_ocean:cartesian_axis = X ;
        xu_ocean:long_name = ucell longitude ;
        xu_ocean:units = degrees_E ;
    float64 yt_ocean(yt_ocean) ;
        yt_ocean:cartesian_axis = Y ;
        yt_ocean:long_name = tcell latitude ;
        yt_ocean:units = degrees_N ;
    float64 yu_ocean(yu_ocean) ;
        yu_ocean:cartesian_axis = Y ;
        yu_ocean:long_name = ucell latitude ;
        yu_ocean:units = degrees_N ;

// global attributes:
    :filename = soda3.4.2_5dy_ocean_or_1993_01_04.nc ;
    :grid_tile = 1 ;
    :grid_type = mosaic ;
    :title = MOM5_SODA_3.4.2 ;
}
shanicetbailey commented 3 years ago

Thank you so much @rabernat !

shanicetbailey commented 3 years ago

I think I will also need the transport terms no? It would be easier, as you've mentioned, to use SODA's transport variables rather than manually compute those terms ourselves. Here is the link to the 10-day transport terms: https://dsrs.atmos.umd.edu/DATA/soda3.4.2/ORIGINAL/transport/. Each file is 600MB.

Do you think we'll need the ice variables? If so, the link is in the original post but here also for your convenience: wget -r -l1 --no-parent --progress=bar -nd -A 'soda3.4.2_5dyice*.nc' https://dsrs.atmos.umd.edu/DATA/soda3.4.2/ORIGINAL/ice/

rabernat commented 3 years ago

Ah ok, I will make a recipe for the transport terms as well.

shanicetbailey commented 3 years ago
rabernat commented 3 years ago

The SSH files are evidently incomplete. There are only 2 files from 2017. Is this a known issue?

rabernat commented 3 years ago

The transport data is now at gs://pangeo-forge-us-central1/pangeo-forge/soda/soda3.4.2_10dy_transport_or.

Do we need ICE and SSH for now? Or can you move forward without them?

rabernat commented 3 years ago

For reference, the transport recipe looks like this

import pandas as pd
from pangeo_forge.recipe import NetCDFtoZarrSequentialRecipe

url_base = 'https://dsrs.atmos.umd.edu/DATA/soda3.4.2/ORIGINAL/'

dates = pd.date_range(start='1993-01-07', end='2019-12-17', freq='10D')
date_string = [d.strftime('%Y_%m_%d') for d in dates]

variable = "transport"
url_pattern = "https://dsrs.atmos.umd.edu/DATA/soda3.4.2/ORIGINAL/{variable}/soda3.4.2_10dy_{variable}_or_{date}.nc"
urls = [url_pattern.format(variable=variable, date=dstr) for dstr in date_string]

recipe = NetCDFtoZarrSequentialRecipe(
    input_urls=urls,
    sequence_dim="time",
    inputs_per_chunk=1,
    cache_inputs=True,
)
shanicetbailey commented 3 years ago

The SSH files are evidently incomplete. There are only 2 files from 2017. Is this a known issue?

This is all that's mentioned on the download page: (Note: we had a data loss event and we only have the soda3.4.2 ocean files on the original grid for 1/30/1991-5/20/1991 and beginning 12/5/1992, and for sea ice beginning 01/05/1993. We have all the regridded data.)

shanicetbailey commented 3 years ago

The transport data is now at gs://pangeo-forge-us-central1/pangeo-forge/soda/soda3.4.2_10dy_transport_or.

Do we need ICE and SSH for now? Or can you move forward without them?

I think I can move on without the other files. Thank you!

rabernat commented 3 years ago

Based on the conversation with @StephenGriffies, it sounds like we are missing the key pme_river field, which is needed for surface E-P-R. Shancie, can you reach out to the SODA folks and ask about this? Cc me and Steve. Thanks.

shanicetbailey commented 3 years ago

The transport data is now at gs://pangeo-forge-us-central1/pangeo-forge/soda/soda3.4.2_10dy_transport_or.

Do we need ICE and SSH for now? Or can you move forward without them?

Hi @rabernat , I am going to need the ICE files. When you can could you also create a recipe for this please?

ICE files: https://dsrs.atmos.umd.edu/DATA/soda3.4.2/ORIGINAL/ice/ (5-day, 77MB)

rabernat commented 3 years ago

Will ask @cisaacstern to help with this. Should be straightforward.

For reference, these SODA recipes were the very first ones I ran at scale.

cisaacstern commented 3 years ago

@stb2145, happy to help. Do you have a hard deadline for when you need the ICE zarr store completed? Want to make sure I prioritize appropriately.

shanicetbailey commented 3 years ago

Thanks @cisaacstern! I do not have a hard deadline, but would appreciate it if you could complete this in the next couple of weeks :)