ocean-transport / lcs-ml

Lagrangian Coherent Structure identification for machine learning
1 stars 0 forks source link

Determine standards for saving ensemble #35

Open hscannell opened 2 years ago

hscannell commented 2 years ago

The ensemble will adapt the standards used in pyqg to stay consistent.

  1. Pickup file generated from the spin up simulation. Contains everything available from the model.to_dataset() method in pyqg.

  2. Each member in the ensemble will contain daily snapshots of 4 upper layer properties:

The dimensions of all 4 variables are (time, y0, x0).

Other things to consider:

hscannell commented 2 years ago

I've copied meta data and attributes from the pyqg pickup file. I've also included both long_name and units for the new variable (strain, vort, x, and y). The dimensions of all data variables are expanded to include ensemble member.

https://github.com/ocean-transport/lcs-ml/blob/32815173cdddb45fda8258267bcef967072a43f9/ensemble_particle_generator/ensemble_generator.py#L94-L110

A single zarr store will be created for each ensemble member. This gets appended to at daily increments during run_with_snapshots.

https://github.com/ocean-transport/lcs-ml/blob/32815173cdddb45fda8258267bcef967072a43f9/ensemble_particle_generator/ensemble_generator.py#L113-L119

rabernat commented 2 years ago

Thanks for this great work Hillary!

A few comments:

Could you call print(ds) and print(ds.info()) on one of the datasets and share the full output here on this issue?

hscannell commented 2 years ago
ds.info()
xarray.Dataset {
dimensions:
    n = 1 ;
    time = 77 ;
    y0 = 1024 ;
    x0 = 1024 ;

variables:
    int64 n(n) ;
        n:long_name = ensemble member ;
        n:units = none ;
    float64 strain(n, time, y0, x0) ;
        strain:long_name = particle strain magnitude ;
        strain:units = unitless ;
    timedelta64[ns] time(time) ;
        time:long_name = model time ;
    float64 vort(n, time, y0, x0) ;
        vort:long_name = particle relative vorticity ;
        vort:units = second -1 ;
    float64 x(n, time, y0, x0) ;
        x:long_name = particle position in the x direction ;
        x:units = grid point ;
    float64 x0(x0) ;
        x0:long_name = real space grid points in the x direction ;
        x0:units = grid point ;
    float64 y(n, time, y0, x0) ;
        y:long_name = particle position in the y direction ;
        y:units = grid point ;
    float64 y0(y0) ;
        y0:long_name = real space grid points in the y direction ;
        y0:units = grid point ;

// global attributes:
    :pyqg:L = 1200000 ;
    :pyqg:M = 262144 ;
    :pyqg:W = 1200000 ;
    :pyqg:beta = 1.3e-11 ;
    :pyqg:del2 = 0.8 ;
    :pyqg:delta = 0.25 ;
    :pyqg:dt = 600.0 ;
    :pyqg:filterfac = 23.6 ;
    :pyqg:nk = 257 ;
    :pyqg:nl = 512 ;
    :pyqg:ntd = 6 ;
    :pyqg:nx = 512 ;
    :pyqg:ny = 512 ;
    :pyqg:nz = 2 ;
    :pyqg:rd = 15000 ;
    :pyqg:rek = 5.787037037037037e-07 ;
    :pyqg:taveint = 86400.0 ;
    :pyqg:tavestart = 86400 ;
    :pyqg:tc = 141120 ;
    :pyqg:tmax = 311040000 ;
    :pyqg:twrite = 50000 ;
    :reference = https://pyqg.readthedocs.io/en/latest/index.html ;
    :title = pyqg: Python Quasigeostrophic Model ;
}
print(ds)

Screen Shot 2021-12-01 at 1 03 16 PM

hscannell commented 2 years ago

I'll fix the units for x,y,x0,y0 and remove the units=none.

Do we leave time as timedelta64[ns] and change units to nanoseconds?? Time prints out as nanoseconds even if I save timedelta64[D].

print(ds.time.values[0])
numpy.timedelta64(86400000000000,'ns')
rabernat commented 2 years ago

Thanks!

Strain is not unitless. Its units are the same as vorticity.

Don't worry about the timedelta64[ns] stuff. It's fine. That's just how xarray decodes all time variables. You can turn it off by opening the file with decode_times=False.

hscannell commented 2 years ago

Oh cool, I didn't realize that decode_times=False essentially assigns date time units depending on the input.

rabernat commented 2 years ago

When xarray opens any dataset, by default it does decoding of CF attributes. So the values you see in the opened dataset are not necessarily the same ones you see on disk.

If you do decode_times=False, you bypass this decoding and just get the raw values that are stored on disk. Note that your decoded time variable has no units, but the encoded one does. Look at ds.time.encoding on the decoded dataset to see these hidden parameters.

hscannell commented 2 years ago

Very cool, thanks! I see that units show up on the decoded dataset below, but not on the dataset with raw values.

ds = xr.open_zarr('/burg/abernathey/users/hillary/lcs/pyqg_ensemble/001.zarr', decode_times=False)
print(ds.time.encoding)

{'chunks': (1,), 'preferred_chunks': {'time': 1}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, 'dtype': dtype('int64')}

ds = xr.open_zarr('/burg/abernathey/users/hillary/lcs/pyqg_ensemble/001.zarr', decode_times=True)
print(ds.time.encoding)

{'chunks': (1,), 'preferred_chunks': {'time': 1}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, 'units': 'days', 'dtype': dtype('int64')}

hscannell commented 2 years ago

The output of the ensemble now includes the stream function p(n, time, y, x), where x and y are model grid points (length 512).

https://github.com/ocean-transport/lcs-ml/blob/5580d9e631ddd902fc381afb249a299695e9dfab/ensemble_particle_generator/ensemble_generator.py#L83-L84

I see that p was added to pyqg/xarray_output.py in this PR 🎉 , but because I never call m.to_dataset() here, I have to manually add it.

I renamed the x and y particle positions xpos(n, time, y0, x0) and ypos(n, time, y0, x0) respectively, where y0 and x0 are the particle grid points (length 1024). Does this make sense?

xarray.Dataset {
dimensions:
    n = 1 ;
    time = 1 ;
    y = 512 ;
    x = 512 ;
    y0 = 1024 ;
    x0 = 1024 ;

variables:
    int64 n(n) ;
        n:long_name = ensemble member ;
    float64 p(n, time, y, x) ;
        p:long_name = streamfunction in real space ;
        p:units = meters squared second ^-1 ;
    float64 strain(n, time, y0, x0) ;
        strain:long_name = particle strain magnitude ;
        strain:units = second ^-1 ;
    timedelta64[ns] time(time) ;
        time:long_name = model time ;
    float64 vort(n, time, y0, x0) ;
        vort:long_name = particle relative vorticity ;
        vort:units = second ^-1 ;
    float64 x(x) ;
        x:long_name = model grid points in the x direction ;
        x:units = meters ;
    float64 x0(x0) ;
        x0:long_name = particle grid points in the x direction ;
        x0:units = meters ;
    float64 xpos(n, time, y0, x0) ;
        xpos:long_name = particle position in the x direction ;
        xpos:units = meters ;
    float64 y(y) ;
        y:long_name = model grid points in the y direction ;
        y:units = meters ;
    float64 y0(y0) ;
        y0:long_name = particle grid points in the y direction ;
        y0:units = meters ;
    float64 ypos(n, time, y0, x0) ;
        ypos:long_name = particle position in the y direction ;
        ypos:units = meters ;

// global attributes:
    :pyqg:L = 1200000 ;
    :pyqg:M = 262144 ;
    :pyqg:W = 1200000 ;
    :pyqg:beta = 1.3e-11 ;
    :pyqg:del2 = 0.8 ;
    :pyqg:delta = 0.25 ;
    :pyqg:dt = 600.0 ;
    :pyqg:filterfac = 23.6 ;
    :pyqg:nk = 257 ;
    :pyqg:nl = 512 ;
    :pyqg:ntd = 6 ;
    :pyqg:nx = 512 ;
    :pyqg:ny = 512 ;
    :pyqg:nz = 2 ;
    :pyqg:rd = 15000 ;
    :pyqg:rek = 5.787037037037037e-07 ;
    :pyqg:taveint = 86400 ;
    :pyqg:tavestart = 86400 ;
    :pyqg:tc = 777600 ;
    :pyqg:tmax = 466560000 ;
    :pyqg:twrite = 50000 ;
    :reference = https://pyqg.readthedocs.io/en/latest/index.html ;
    :title = pyqg: Python Quasigeostrophic Model ;
}
hscannell commented 2 years ago

Screen Shot 2021-12-03 at 3 40 54 PM

Should we call vorticity and strain something else besides vort and strain? Like q and s? Although there is already a q in pyqg, so that might get confusing.