m2lines / ocean_emulators

https://m2lines.github.io/ocean_emulators/
Apache License 2.0
2 stars 1 forks source link

"Preprocessing" Outline #10

Open jbusecke opened 2 months ago

jbusecke commented 2 months ago

I am trying to understand, on an abstract level, all the steps that are needed/optional for the preprocessing.

With preprocessing I am assuming that we will take an arbitrary GCM output, run it through a bunch of steps and write out a copy of it as a handoff point for the ML side of things.

What I have gathered so far (please chime in with corrections, additions, etc @adam-subel @LaureZanna @suryadheeshjith @IamShubhamGupto ):

So that at the end of this process regardless of input we will have a dataset of known dimensions [x y (z) * t], and a known set of variables.

adam-subel commented 2 months ago

This is a good outline. At the moment, the second bullet is not something we are worrying about most of the time (all our data is direct output from the model. The only exception that has come up is wind stress, but most models have that as an output).

For the regridding, no matter what method we use our goal is to preserve the area weighted means. In xesmf that means using conservative regridding, for coarsening we do area averaging.

We collapse everything on to a single point, we are using the location of thetao, which should be on a tracer cell.

The mask we use when regridding is from the the regridded thetao field from CESM2. When using coarsen we define the new wet mask by making all cells that are 50% or more ocean in the high resolution grid ocean cells in the coarse grid.

The final variable shape should be [t (z) x y]. In practice though the z dimension should collapse into the channel dimension as the final data being passed to the model would be [t c x y], where c includes both different depths and different variables.

jbusecke commented 3 weeks ago

See #12 for an updated outline base on our most recent meeting.