refactor data step script into library (API) and consumer (CLI)

raehik commented 1 year ago

There are some pain points with the current data step.

user is not able to select output path
- MLflow places artifacts in the working directory, under mlruns. It uses 2 long random strings.
the mlflow run CLI is clunky
- Appears restrictive -- no mutually-exclusive options?
- CLI is partially defined with argparse in cmip26.py, partially with MLflow (via MLproject, which gets used by mlflow run) i.e. some positional arguments are upgraded to (required) options in MLflow
the top-level data step is defined in a Python script, cmip26.py
- It does clean module calling inside, but as is it's not ready to be packaged up.

This PR largely rewrites the data step. Unused code is removed. Stateful operations (globals) are moved into functions. The top-level script is now just a CLI and a handful of operations, mirroring how one would use it directly in Python.

CLI is cleaner
- You may also pass a YAML file with the CLI options in instead. (Makes sharing configurations much easier.)
Internals are clearer and safer, using Python typing stuff
- e.g. BoundingBox, CO2 increase handling
Whole step functionalized: Python interface is clear, though not explicitly documented

Some of the training step is touched too. Larger refactoring will be in another changeset.

Not done:

Loading and processing of dataset is somewhat general, but various internals still expect CM2.6 (and various CM2.6 coordinates/data variables).
Jupyter notebooks are not updated. MLflow running will not work properly -- they should be replaced by explicit python calls and explicit data locations instead of run IDs.

To-dos:

[x] Move from new.
[x] Move training step refactoring
[x] Tweak training step subdomain loading
[x] Re-add prints, progress bars
How to make --co2-increase flag work in MLproject
- Appears to be a limitation -- rewritten readme to use direct invocation example
[x] Update CLI invocations in documentation

Related work to do post-merge:

Update Jupyter notebooks

raehik commented 1 year ago

I think the data step is ready, just needs some touching up before review. I'm adding some work on the training step here too, I'll move it out before review.

raehik commented 1 year ago

I can't seem to get the MLflow interface working nicely with the simplified CLI. By simplified, I mean --global_ {0,1}, --co2 {0,1} being replaced with --cyclize, --co2-increase. But that type of no-value option aren't supported by MLproject. I can't tell why, it seems like a very simple feature.

raehik commented 1 year ago

On testing, this produces forcing data ~x4 larger than currently. Not sure what sort of errors would result in that, but I can go through the changes again. Lines that touch gaussian_filter and further up the call chain seem most likely.

raehik commented 12 months ago

Likely candidates:

eddy_forcing was misused: both forcing_coarse and the edited u_v_dataset were returned as a tuple, but the function signature stated it returned a single dataset, and it was used as such. Maybe my simplifying changed behaviour here...?
scipy.ndimage.gaussian_filter was used weirdly, more erroneous type annotations. Probably fine, but needed some inspection.
...a lib call had the grid and velocities dataset args the wrong way round...

raehik commented 12 months ago

No, I misread some clauses, like this early return (debug_mode is unused):

https://github.com/m2lines/gz21_ocean_momentum/blob/fff986c83e0b8288d84db5a302fe0ee8b30ee562/src/gz21_ocean_momentum/data/coarse.py#L192-L194

raehik commented 12 months ago

There were many small mistakes! I'm now getting identical outputs to main for the same configuration. Need to clean up the history and rejig some code I re-messied.

raehik commented 11 months ago

Cleaned up history and logging/debugging setup, sorted all the to-dos I can (prior-existing ones that I'm unsure how to resolve are annotated and left). Ready for review.

raehik commented 11 months ago

yoooo it automatically merged? I had no idea that would happen. I rebased dev onto data-step-refactor locally and pushed, and that's been processed as a merge on GitHub!

m2lines / gz21_ocean_momentum

refactor data step script into library (API) and consumer (CLI) #85