"Preprocessing" Outline

jbusecke commented 2 months ago

I am trying to understand, on an abstract level, all the steps that are needed/optional for the preprocessing.

With preprocessing I am assuming that we will take an arbitrary GCM output, run it through a bunch of steps and write out a copy of it as a handoff point for the ML side of things.

What I have gathered so far (please chime in with corrections, additions, etc @adam-subel @LaureZanna @suryadheeshjith @IamShubhamGupto ):

Check that all 'metrics' needed are present or reconstruct them:
- cell area (for tracer, u, v, cells)
- (wet) masks (again for tracer, u/v cells)
Derive additional variables on the native grid (e.g. advective tracer fluxes example
Regrid (both vertically/horizontally) all desired variables onto a 'standard' grid
- ? This should probably always preserve extensive quantities like heat, advective fluxes?
- ? Should properties on the cell edges always be converted onto a unified grid point (tracer cell)?
- ? Do we need to construct new mask/metrics for the regridded dataset? It seems that is done here at least for the wet mask.
Optional: Convert variables into dimension (to pull only a single chunk per time step)
Save them out to a suitable location

So that at the end of this process regardless of input we will have a dataset of known dimensions [x y (z) * t], and a known set of variables.

adam-subel commented 2 months ago

This is a good outline. At the moment, the second bullet is not something we are worrying about most of the time (all our data is direct output from the model. The only exception that has come up is wind stress, but most models have that as an output).

For the regridding, no matter what method we use our goal is to preserve the area weighted means. In xesmf that means using conservative regridding, for coarsening we do area averaging.

We collapse everything on to a single point, we are using the location of thetao, which should be on a tracer cell.

The mask we use when regridding is from the the regridded thetao field from CESM2. When using coarsen we define the new wet mask by making all cells that are 50% or more ocean in the high resolution grid ocean cells in the coarse grid.

The final variable shape should be [t (z) x y]. In practice though the z dimension should collapse into the channel dimension as the final data being passed to the model would be [t c x y], where c includes both different depths and different variables.

jbusecke commented 3 weeks ago

See #12 for an updated outline base on our most recent meeting.

m2lines / ocean_emulators

"Preprocessing" Outline #10