Open shanicetbailey opened 3 years ago
Just noting that each file is 1.8 GB.
I have a functional recipe as follows:
import pandas as pd
from pangeo_forge.recipe import NetCDFtoZarrSequentialRecipe
url_base = 'https://dsrs.atmos.umd.edu/DATA/soda3.4.2/ORIGINAL/ocean/'
dates = pd.date_range(start='1991-01-30', end='2019-12-19', freq='5D')
date_string = [d.strftime('%Y_%m_%d') for d in dates]
urls = [url_base + f'soda3.4.2_5dy_ocean_or_{dstring}.nc' for dstring in date_string]
recipe = NetCDFtoZarrSequentialRecipe(
input_urls=urls,
sequence_dim="time",
inputs_per_chunk=1
)
I tried to run the recipe and found that considerable data is missing from 1992. These are the only 1992 files:
| soda3.4.2_5dy_ocean_or_1992_12_05.nc | 2017-09-23 03:03 | 1.8G |
| soda3.4.2_5dy_ocean_or_1992_12_10.nc | 2017-09-23 03:09 | 1.8G |
| soda3.4.2_5dy_ocean_or_1992_12_15.nc | 2017-09-23 03:16 | 1.8G |
| soda3.4.2_5dy_ocean_or_1992_12_20.nc | 2017-09-23 03:23 | 1.8G |
| soda3.4.2_5dy_ocean_or_1992_12_25.nc | 2017-09-23 03:29 | 1.8G |
| soda3.4.2_5dy_ocean_or_1992_12_30.nc | 2017-09-23 03:36 | 1.8G |
@stb2145 - could you bring this to the attention of the dataset owner?
In the meantime, I will restart the recipe with a start date if 1993-01-04
p.s. Pangeo forge is curently downloading the data 😄
Will take a few days.
Something they noted on the download page is:
Note: we had a data loss event and we only have the soda3.4.2 ocean files on the original grid for 1/30/1991-5/20/1991 and beginning 12/5/1992, and for sea ice beginning 01/05/1993. We have all the regridded data.
Please upload the topography map as well (1.5GB), thank you!
I almost had this recipe done! But then I accidentally deleted nearly 5 TB of SODA data (see https://github.com/dask/gcsfs/issues/364). 😭 😭 😭
I am starting again, but it will take a few days to re-download all the netCDF files.
Oh no! 😓 I'm sorry that happened, but thank you for making this a priority!
The SODA dataset is mostly all there. The recipe process uncovered two corrupted files:
Reading these results in HDF5 errors. Consequently, there is missing data on those two days.
The dataset is accessible now on Pangeo Cloud.
import gcsfs
import xarray as xr
url = 'gs://pangeo-forge-us-central1/pangeo-forge/soda/soda3.4.2_5dy_ocean_or'
fs = gcsfs.GCSFileSystem()
ds = xr.open_zarr(fs.get_mapper(url), consolidated=True)
ds.info()
xarray.Dataset {
dimensions:
nv = 2 ;
st_edges_ocean = 51 ;
st_ocean = 50 ;
sw_edges_ocean = 51 ;
sw_ocean = 50 ;
time = 1970 ;
xt_ocean = 1440 ;
xu_ocean = 1440 ;
yt_ocean = 1070 ;
yu_ocean = 1070 ;
variables:
float32 anompb(time, yt_ocean, xt_ocean) ;
anompb:cell_methods = time: mean ;
anompb:coordinates = geolon_t geolat_t ;
anompb:long_name = T-cell bottom pressure - rho0*grav*ht ;
anompb:time_avg_info = average_T1,average_T2,average_DT ;
anompb:units = dbar ;
anompb:valid_range = [-1000000.0, 1000000.0] ;
timedelta64[ns] average_DT(time) ;
average_DT:long_name = Length of average period ;
datetime64[ns] average_T1(time) ;
average_T1:long_name = Start time for average period ;
datetime64[ns] average_T2(time) ;
average_T2:long_name = End time for average period ;
float32 evap_heat(time, yt_ocean, xt_ocean) ;
evap_heat:cell_methods = time: mean ;
evap_heat:coordinates = geolon_t geolat_t ;
evap_heat:long_name = latent heat flux into ocean (<0 cools ocean) ;
evap_heat:standard_name = surface_downward_latent_heat_flux ;
evap_heat:time_avg_info = average_T1,average_T2,average_DT ;
evap_heat:units = W/m^2 ;
evap_heat:valid_range = [-10000000000.0, 10000000000.0] ;
float32 hflux_cpl(time, yt_ocean, xt_ocean) ;
hflux_cpl:cell_methods = time: mean ;
hflux_cpl:coordinates = geolon_t geolat_t ;
hflux_cpl:long_name = surface heat flux coming through coupler ;
hflux_cpl:time_avg_info = average_T1,average_T2,average_DT ;
hflux_cpl:units = Watts/m^2 ;
hflux_cpl:valid_range = [-10000.0, 10000.0] ;
float32 hflux_pme(time, yt_ocean, xt_ocean) ;
hflux_pme:cell_methods = time: mean ;
hflux_pme:coordinates = geolon_t geolat_t ;
hflux_pme:long_name = heat flux (relative to 0C) from pme transfer of water across ocean surface ;
hflux_pme:time_avg_info = average_T1,average_T2,average_DT ;
hflux_pme:units = Watts/m^2 ;
hflux_pme:valid_range = [-10000.0, 10000.0] ;
float32 hflux_total(time, yt_ocean, xt_ocean) ;
hflux_total:cell_methods = time: mean ;
hflux_total:coordinates = geolon_t geolat_t ;
hflux_total:long_name = surface heat flux from coupler plus restore (omits mass transfer heating) ;
hflux_total:time_avg_info = average_T1,average_T2,average_DT ;
hflux_total:units = Watts/m^2 ;
hflux_total:valid_range = [-10000.0, 10000.0] ;
float32 lw_heat(time, yt_ocean, xt_ocean) ;
lw_heat:cell_methods = time: mean ;
lw_heat:coordinates = geolon_t geolat_t ;
lw_heat:long_name = longwave flux into ocean (<0 cools ocean) ;
lw_heat:standard_name = surface_net_downward_longwave_flux ;
lw_heat:time_avg_info = average_T1,average_T2,average_DT ;
lw_heat:units = W/m^2 ;
lw_heat:valid_range = [-10000000000.0, 10000000000.0] ;
float32 mlp(time, yt_ocean, xt_ocean) ;
mlp:cell_methods = time: mean ;
mlp:coordinates = geolon_t geolat_t ;
mlp:long_name = Depth of potential density mixed layer ;
mlp:time_avg_info = average_T1,average_T2,average_DT ;
mlp:units = m ;
mlp:valid_range = [-1000000.0, 1000000.0] ;
float32 mls(time, yt_ocean, xt_ocean) ;
mls:cell_methods = time: mean ;
mls:coordinates = geolon_t geolat_t ;
mls:long_name = mixed layer depth determined by salinity criteria ;
mls:time_avg_info = average_T1,average_T2,average_DT ;
mls:units = m ;
mls:valid_range = [0.0, 1000000.0] ;
float32 mlt(time, yt_ocean, xt_ocean) ;
mlt:cell_methods = time: mean ;
mlt:coordinates = geolon_t geolat_t ;
mlt:long_name = mixed layer depth determined by temperature criteria ;
mlt:time_avg_info = average_T1,average_T2,average_DT ;
mlt:units = m ;
mlt:valid_range = [0.0, 1000000.0] ;
float32 net_heating(time, yt_ocean, xt_ocean) ;
net_heating:cell_methods = time: mean ;
net_heating:coordinates = geolon_t geolat_t ;
net_heating:long_name = surface ocean heat flux coming through coupler and mass transfer ;
net_heating:time_avg_info = average_T1,average_T2,average_DT ;
net_heating:units = Watts/m^2 ;
net_heating:valid_range = [-10000.0, 10000.0] ;
float64 nv(nv) ;
nv:cartesian_axis = N ;
nv:long_name = vertex number ;
nv:units = none ;
float32 prho(time, st_ocean, yt_ocean, xt_ocean) ;
prho:cell_methods = time: mean ;
prho:coordinates = geolon_t geolat_t ;
prho:long_name = potential density referenced to 0 dbar ;
prho:standard_name = sea_water_potential_density ;
prho:time_avg_info = average_T1,average_T2,average_DT ;
prho:units = kg/m^3 ;
prho:valid_range = [-10.0, 100000.0] ;
float32 salt(time, st_ocean, yt_ocean, xt_ocean) ;
salt:cell_methods = time: mean ;
salt:coordinates = geolon_t geolat_t ;
salt:long_name = Practical Salinity ;
salt:standard_name = sea_water_salinity ;
salt:time_avg_info = average_T1,average_T2,average_DT ;
salt:units = psu ;
salt:valid_range = [-10.0, 100.0] ;
float32 salt_flux_rstr(time, yt_ocean, xt_ocean) ;
salt_flux_rstr:cell_methods = time: mean ;
salt_flux_rstr:coordinates = geolon_t geolat_t ;
salt_flux_rstr:long_name = sfc_salt_flux_restore: flux from restoring term ;
salt_flux_rstr:time_avg_info = average_T1,average_T2,average_DT ;
salt_flux_rstr:units = kg/(m^2*sec) ;
salt_flux_rstr:valid_range = [-10000.0, 10000.0] ;
float32 salt_flux_total(time, yt_ocean, xt_ocean) ;
salt_flux_total:cell_methods = time: mean ;
salt_flux_total:coordinates = geolon_t geolat_t ;
salt_flux_total:long_name = sfc_salt_flux_total ;
salt_flux_total:time_avg_info = average_T1,average_T2,average_DT ;
salt_flux_total:units = kg/(m^2*sec) ;
salt_flux_total:valid_range = [-10000.0, 10000.0] ;
float32 sbd(time, yt_ocean, xt_ocean) ;
sbd:cell_methods = time: mean ;
sbd:coordinates = geolon_t geolat_t ;
sbd:long_name = rate of mass transferred below the mixed layer base ;
sbd:time_avg_info = average_T1,average_T2,average_DT ;
sbd:units = kg/sec ;
sbd:valid_range = [-1.0000000200408773e+20, 1.0000000200408773e+20] ;
float32 sbd_dhdt(time, yt_ocean, xt_ocean) ;
sbd_dhdt:cell_methods = time: mean ;
sbd_dhdt:coordinates = geolon_t geolat_t ;
sbd_dhdt:long_name = rate of mass transferred below the mixed layer base due to dh/dt ;
sbd_dhdt:time_avg_info = average_T1,average_T2,average_DT ;
sbd_dhdt:units = kg/sec ;
sbd_dhdt:valid_range = [-1.0000000200408773e+20, 1.0000000200408773e+20] ;
float32 sbd_horz(time, yt_ocean, xt_ocean) ;
sbd_horz:cell_methods = time: mean ;
sbd_horz:coordinates = geolon_t geolat_t ;
sbd_horz:long_name = rate of mass transferred below the mixed layer base due to horz advect ;
sbd_horz:time_avg_info = average_T1,average_T2,average_DT ;
sbd_horz:units = kg/sec ;
sbd_horz:valid_range = [-1.0000000200408773e+20, 1.0000000200408773e+20] ;
float32 sbd_mld(time, yt_ocean, xt_ocean) ;
sbd_mld:cell_methods = time: mean ;
sbd_mld:coordinates = geolon_t geolat_t ;
sbd_mld:long_name = mixed layer depth used for subduction diagnostics ;
sbd_mld:time_avg_info = average_T1,average_T2,average_DT ;
sbd_mld:units = m ;
sbd_mld:valid_range = [-100.0, 1.0000000200408773e+20] ;
float32 sbd_vert(time, yt_ocean, xt_ocean) ;
sbd_vert:cell_methods = time: mean ;
sbd_vert:coordinates = geolon_t geolat_t ;
sbd_vert:long_name = rate of mass transferred below the mixed layer base due to vert advect ;
sbd_vert:time_avg_info = average_T1,average_T2,average_DT ;
sbd_vert:units = kg/sec ;
sbd_vert:valid_range = [-1.0000000200408773e+20, 1.0000000200408773e+20] ;
float32 sens_heat(time, yt_ocean, xt_ocean) ;
sens_heat:cell_methods = time: mean ;
sens_heat:coordinates = geolon_t geolat_t ;
sens_heat:long_name = sensible heat into ocean (<0 cools ocean) ;
sens_heat:standard_name = surface_downward_sensible_heat_flux ;
sens_heat:time_avg_info = average_T1,average_T2,average_DT ;
sens_heat:units = W/m^2 ;
sens_heat:valid_range = [-10000000000.0, 10000000000.0] ;
float32 ssh(time, yt_ocean, xt_ocean) ;
ssh:cell_methods = time: mean ;
ssh:coordinates = geolon_t geolat_t ;
ssh:long_name = effective sea level (eta_t + patm/(rho0*g)) on T cells ;
ssh:standard_name = sea_surface_height_above_geoid ;
ssh:time_avg_info = average_T1,average_T2,average_DT ;
ssh:units = meter ;
ssh:valid_range = [-1000.0, 1000.0] ;
float64 st_edges_ocean(st_edges_ocean) ;
st_edges_ocean:cartesian_axis = Z ;
st_edges_ocean:long_name = tcell zstar depth edges ;
st_edges_ocean:positive = down ;
st_edges_ocean:units = meters ;
float64 st_ocean(st_ocean) ;
st_ocean:cartesian_axis = Z ;
st_ocean:edges = st_edges_ocean ;
st_ocean:long_name = tcell zstar depth ;
st_ocean:positive = down ;
st_ocean:units = meters ;
float64 sw_edges_ocean(sw_edges_ocean) ;
sw_edges_ocean:cartesian_axis = Z ;
sw_edges_ocean:long_name = ucell zstar depth edges ;
sw_edges_ocean:positive = down ;
sw_edges_ocean:units = meters ;
float64 sw_ocean(sw_ocean) ;
sw_ocean:cartesian_axis = Z ;
sw_ocean:edges = sw_edges_ocean ;
sw_ocean:long_name = ucell zstar depth ;
sw_ocean:positive = down ;
sw_ocean:units = meters ;
float32 swflx(time, yt_ocean, xt_ocean) ;
swflx:cell_methods = time: mean ;
swflx:coordinates = geolon_t geolat_t ;
swflx:long_name = shortwave flux into ocean (>0 heats ocean) ;
swflx:standard_name = surface_net_downward_shortwave_flux ;
swflx:time_avg_info = average_T1,average_T2,average_DT ;
swflx:units = W/m^2 ;
swflx:valid_range = [-10000000000.0, 10000000000.0] ;
float32 taux(time, yu_ocean, xu_ocean) ;
taux:cell_methods = time: mean ;
taux:coordinates = geolon_c geolat_c ;
taux:long_name = i-directed wind stress forcing u-velocity ;
taux:standard_name = surface_downward_x_stress ;
taux:time_avg_info = average_T1,average_T2,average_DT ;
taux:units = N/m^2 ;
taux:valid_range = [-10.0, 10.0] ;
float32 tauy(time, yu_ocean, xu_ocean) ;
tauy:cell_methods = time: mean ;
tauy:coordinates = geolon_c geolat_c ;
tauy:long_name = j-directed wind stress forcing v-velocity ;
tauy:standard_name = surface_downward_y_stress ;
tauy:time_avg_info = average_T1,average_T2,average_DT ;
tauy:units = N/m^2 ;
tauy:valid_range = [-10.0, 10.0] ;
float32 temp(time, st_ocean, yt_ocean, xt_ocean) ;
temp:cell_methods = time: mean ;
temp:coordinates = geolon_t geolat_t ;
temp:long_name = Potential temperature ;
temp:standard_name = sea_water_potential_temperature ;
temp:time_avg_info = average_T1,average_T2,average_DT ;
temp:units = degrees C ;
temp:valid_range = [-10.0, 500.0] ;
object time(time) ;
time:bounds = time_bounds ;
time:calendar_type = JULIAN ;
time:cartesian_axis = T ;
time:long_name = time ;
int64 time_bounds(time, nv) ;
time_bounds:calendar = JULIAN ;
time_bounds:long_name = time axis boundaries ;
time_bounds:units = nanoseconds ;
float32 u(time, st_ocean, yu_ocean, xu_ocean) ;
u:cell_methods = time: mean ;
u:coordinates = geolon_c geolat_c ;
u:long_name = i-current ;
u:standard_name = sea_water_x_velocity ;
u:time_avg_info = average_T1,average_T2,average_DT ;
u:units = m/sec ;
u:valid_range = [-10.0, 10.0] ;
float32 v(time, st_ocean, yu_ocean, xu_ocean) ;
v:cell_methods = time: mean ;
v:coordinates = geolon_c geolat_c ;
v:long_name = j-current ;
v:standard_name = sea_water_y_velocity ;
v:time_avg_info = average_T1,average_T2,average_DT ;
v:units = m/sec ;
v:valid_range = [-10.0, 10.0] ;
float32 wt(time, sw_ocean, yt_ocean, xt_ocean) ;
wt:cell_methods = time: mean ;
wt:coordinates = geolon_t geolat_t ;
wt:long_name = dia-surface velocity T-points ;
wt:time_avg_info = average_T1,average_T2,average_DT ;
wt:units = m/sec ;
wt:valid_range = [-100000.0, 100000.0] ;
float64 xt_ocean(xt_ocean) ;
xt_ocean:cartesian_axis = X ;
xt_ocean:long_name = tcell longitude ;
xt_ocean:units = degrees_E ;
float64 xu_ocean(xu_ocean) ;
xu_ocean:cartesian_axis = X ;
xu_ocean:long_name = ucell longitude ;
xu_ocean:units = degrees_E ;
float64 yt_ocean(yt_ocean) ;
yt_ocean:cartesian_axis = Y ;
yt_ocean:long_name = tcell latitude ;
yt_ocean:units = degrees_N ;
float64 yu_ocean(yu_ocean) ;
yu_ocean:cartesian_axis = Y ;
yu_ocean:long_name = ucell latitude ;
yu_ocean:units = degrees_N ;
// global attributes:
:filename = soda3.4.2_5dy_ocean_or_1993_01_04.nc ;
:grid_tile = 1 ;
:grid_type = mosaic ;
:title = MOM5_SODA_3.4.2 ;
}
Thank you so much @rabernat !
I think I will also need the transport terms no? It would be easier, as you've mentioned, to use SODA's transport variables rather than manually compute those terms ourselves. Here is the link to the 10-day transport terms: https://dsrs.atmos.umd.edu/DATA/soda3.4.2/ORIGINAL/transport/. Each file is 600MB.
Do you think we'll need the ice variables? If so, the link is in the original post but here also for your convenience: wget -r -l1 --no-parent --progress=bar -nd -A 'soda3.4.2_5dyice*.nc' https://dsrs.atmos.umd.edu/DATA/soda3.4.2/ORIGINAL/ice/
Ah ok, I will make a recipe for the transport terms as well.
The SSH files are evidently incomplete. There are only 2 files from 2017. Is this a known issue?
The transport data is now at gs://pangeo-forge-us-central1/pangeo-forge/soda/soda3.4.2_10dy_transport_or
.
Do we need ICE and SSH for now? Or can you move forward without them?
For reference, the transport recipe looks like this
import pandas as pd
from pangeo_forge.recipe import NetCDFtoZarrSequentialRecipe
url_base = 'https://dsrs.atmos.umd.edu/DATA/soda3.4.2/ORIGINAL/'
dates = pd.date_range(start='1993-01-07', end='2019-12-17', freq='10D')
date_string = [d.strftime('%Y_%m_%d') for d in dates]
variable = "transport"
url_pattern = "https://dsrs.atmos.umd.edu/DATA/soda3.4.2/ORIGINAL/{variable}/soda3.4.2_10dy_{variable}_or_{date}.nc"
urls = [url_pattern.format(variable=variable, date=dstr) for dstr in date_string]
recipe = NetCDFtoZarrSequentialRecipe(
input_urls=urls,
sequence_dim="time",
inputs_per_chunk=1,
cache_inputs=True,
)
The SSH files are evidently incomplete. There are only 2 files from 2017. Is this a known issue?
This is all that's mentioned on the download page: (Note: we had a data loss event and we only have the soda3.4.2 ocean files on the original grid for 1/30/1991-5/20/1991 and beginning 12/5/1992, and for sea ice beginning 01/05/1993. We have all the regridded data.)
The transport data is now at
gs://pangeo-forge-us-central1/pangeo-forge/soda/soda3.4.2_10dy_transport_or
.Do we need ICE and SSH for now? Or can you move forward without them?
I think I can move on without the other files. Thank you!
Based on the conversation with @StephenGriffies, it sounds like we are missing the key pme_river
field, which is needed for surface E-P-R. Shancie, can you reach out to the SODA folks and ask about this? Cc me and Steve. Thanks.
The transport data is now at
gs://pangeo-forge-us-central1/pangeo-forge/soda/soda3.4.2_10dy_transport_or
.Do we need ICE and SSH for now? Or can you move forward without them?
Hi @rabernat , I am going to need the ICE files. When you can could you also create a recipe for this please?
ICE files: https://dsrs.atmos.umd.edu/DATA/soda3.4.2/ORIGINAL/ice/ (5-day, 77MB)
Will ask @cisaacstern to help with this. Should be straightforward.
For reference, these SODA recipes were the very first ones I ran at scale.
@stb2145, happy to help. Do you have a hard deadline for when you need the ICE zarr store completed? Want to make sure I prioritize appropriately.
Thanks @cisaacstern! I do not have a hard deadline, but would appreciate it if you could complete this in the next couple of weeks :)
SODA 3.4.2 Dataset
SODA version 3.4.2 is an ocean reanalysis that is available on the native grid in netcdf format. The native interlaced horizonal velocity and conserved tracer (e.g. temperature and salinity) grids form a tripolar Arakawa-B grid, varying from 0.1°x0.25° at high latitude to 1/4°x1/4° in the tropics (quasi-isotropic grid spacing increases from ~11.7km at 65 latitude to ~28.0km at the Equator, 1440x1070 grid points). The (1.5Gb sized) topography map (created by Whit Anderson of GFDL) is here.
Transformation / Alignment / Merging
NA
Output Dataset
Output of pipeline would ideally be stored as zarr files for optimal cloud computing performance.