Closed dorchard closed 1 year ago
@arthurBarthe , can you please identify which of the notebooks can be deleted and which are needed to reproduce the paper, and also how to get the data/setup to use them?
For example, there is mention of a load_paper_net()
function, but to load data, it must be present somewhere from a run or so. Also, if I try running, for example, https://github.com/m2lines/gz21_ocean_momentum/blob/main/examples/jupyter-notebooks/offline_test_SWM.ipynb with the data from a previous mlflow run (does not work with a run without mlflow), the data is not found - it seems to look for netcdf files, I think, but in mlflow run as described in the README seems to not to create them (although it is marked as FINISHED, not FAILED, in the dataset)?
Hi Marion,
I think one good notebook to start with would be generate-figure-1.ipynb or _generate-figure-2.ipynb. Those should only require data from mlflow runs for the data processing step. Later on we will need to look at _test_globalcontrol.ipynb but this requires to run a test step (the notebook does not run the test, it is only to produce figures based on the output from the test step).
I would start with generate-figure-2.ipynb, this is probably the simplest one: it requires in the notebook to select an experiment (should be the one used in the data processing step) and a run_id from which to load the processed data, and the rest is just about making the plot using this data.
Thank you!
That helps me a bit further.
I just noticed that in generate_figure-2, at the bottom it calls plot_training_subdomains
, but we do not do any training with the data here, as you said. Function fails this line: https://github.com/m2lines/gz21_ocean_momentum/blob/f83a5c8f81cea6ffe84f38f12ae31689918b18e6/src/gz21_ocean_momentum/analysis/utils.py#L624 , potentially because there is only one data set, or because there is no training. Do you know which of the two it is?
Ah, I think I know where the issue is coming from. Basically we want to plot the "rectangular" subdomains used for training. The way this information is stored has changed, and can instead be accessed by reading the file training_subdomains.yaml.
As for generate-figure1.ipynb, if you want to run it, you will need two data runs: one run with the CO2 parameter equal to 0, and another one with the CO2 parameter equal to 1.
generate-figure-1.ipynb correspond to Figure 6 of the published paper.
generate-figure-2.ipynb file corresponds to Figure 1 of the paper.
I managed to reproduce this figure:
There are still a couple of crutches in the code which need cleaning up, some special cases which need catching, and some docs/instructions I will need to write in/for the notebook.
Also, I only checked whether this is the correct figure by comparing it visually with the (contorted) figure 1b in the paper.
This looks good, thanks @MarionBWeinzierl ! For now "eye checking" should be fine, we will compare numerical metrics in the inference / test step. As for the table, I don't think we need to worry about it, we can probably remove that from the code.
Is it correct that only figure 1b is plotted, or should 1a also be plotted, @arthurBarthe ?
I created #82 for this issue as a draft PR which tracks the branch.
@arthurBarthe , in notebook 'generate-paper-figure-6', (which was,in the above comment before the renaming, 'generate-figure-1'), I can either get by handing the script the version is CO2=1 first and CO2=0 second, or by doing it the other way around.
Comparing with figure 6a from the paper, I think that the second version is right, but I am not sure.
I also get a runtime warning: gz21_ocean_momentum/venv/lib/python3.11/site-packages/dask/array/numpy_compat.py:43: RuntimeWarning: invalid value encountered in divide x = np.divide(x1, x2, out)
(no info on where in the code), but it seems not to affect the run
@arthurBarthe , I am now starting to look at 'test-global-control'. You said that a test step has to be run before that. What do you mean by that - is that an inference run?
Yes it should be CO2=0 first and CO2=1 second. As for the inference step, I am working on it: I run the data step and the training step, so now I should be able to try and run the inference step.
Quick question, in order to produce those figures, what is the size of the data you are using along the time dimension?
Quick question, in order to produce those figures, what is the size of the data you are using along the time dimension?
I am using the call from the README, which has ntimes=100
Ah, we should be using the whole dataset, which I believe has more or less 3000 time points. I ran the data step on the HPC here with that parameter, with 4 nodes (each 8GB memory) and it ran in 30 minutes. This should explain the remaining difference with the figure from the paper. I should also update the readme accordingly.
We might not need to update the repository readme, as this is a sufficiently small problem to run on a laptop and a nice starter problem.
Rather, we should add a comment in the notebook, or maybe better the readme in the notebook folder, and define there which calls to run to create the data for the paper figures.
This would be the best place: https://github.com/m2lines/gz21_ocean_momentum/blob/notebooks-cleanup/examples/jupyter-notebooks/README.md
As for the inference step, I am working on it: I run the data step and the training step, so now I should be able to try and run the inference step.
Is that then all the data which is need for test-global-control
?
Which figure from the paper does that notebook generate?
We need data from the inference step, where we run the trained neural network (from the train step) on both data from CO2=0 and CO2=1 datasets. Then that should be it yes, and it should produce figures 4, 5, 7 I believe, depending on which run ids we request in the notebook.
I added the information about which data to use etc to the jupyter notebook readme on the notebooks-cleanup
branch.
I fixed the test_global_control script (will need a bit of clean-up now).
These are the plots run with a reduced set with ntimes=100: plots.zip
Due to the reduced ntimes, I needed to also adapt some parameters in the plots for this.
@arthurBarthe , could you have a look and see whether they make roughly sense?
Also @arthurBarthe , can I delete the parts that say "IGNORE THIS" in test_global_control?
Another question @arthurBarthe : You said I need the data for CO2=1 and CO2=0 for the test_global_control script. However, it is just asking for data once. Does that mean that, for reproducing the paper figures, I have to run this notebook twice, once with each setting?
Also, in the plot I have currently set the "merge" parameter to "none", and it did not ask for the separate data and inference steps. Do I need to have separately named experiments, so that I hand in those as tuples in the "merge" parameter, in order for the rest to run through as expected?
@arthurBarthe , how does the data need to be organised for test_global_control?
I noticed that in the pushed notebook on main, the experiment selected is called "data-global". Does that mean that for test_global_control the parameter global in the running of the data script should be set to 1?
OK, we have established that the notebook has to be run twice, once with CO2=1 and once with CO2=1. Global control can stay zero. I am updating the readme accordingly.
Various scripts generate the figures. We should provide an explanation of how to regenerate the key figures from the paper, referring to them by figure name. This could be separated out into a separate script or notebook, as long as its clear from the README where to go to find this.