noaa-ocs-modeling / EnsemblePerturbation

perturbation of coupled model input over a space of input variables
https://ensembleperturbation.readthedocs.io
Creative Commons Zero v1.0 Universal
5 stars 3 forks source link

Bayesian analysis for calibration of input parameters #88

Open SorooshMani-NOAA opened 1 year ago

SorooshMani-NOAA commented 1 year ago

There's a package called PyMC3 that can be used for Bayesian analysis.

There's a good "hacker book" available at: https://nbviewer.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Prologue/Prologue.ipynb

Initial discussion:

SorooshMani-NOAA commented 1 year ago

@WPringle based on the discussion we had today, we only need to run best track for a couple of storms and with a couple of different ensemble sizes with Korobov sampling right?

WPringle commented 1 year ago

@SorooshMani-NOAA because it is best track we should change how we do the ensemble perturbation. I think we need to work out that strategy first, then we do ensemble then check that the ensemble results encompass the observations.

This paper is relevant which also look at best track: Sochala, P., Chen, C., Dawson, C., & Iskandarani, M. (2020). A polynomial chaos framework for probabilistic predictions of storm surge events. Computational Geosciences, 24, 109–128. https://doi.org/10.1007/s10596-019-09898-5

See section 4.2. In there they treat the errors as uniform random variables with some amplification factor which we can tune. Check out Figure 5 and 6.

I think we should follow their methodology for doing the track and intensity perturbations.

WPringle commented 1 year ago

@SorooshMani-NOAA You can see they also perturb manning's roughness too.

SorooshMani-NOAA commented 1 year ago

Just for the sake of completeness of this discussion: We discussed how to generate ensembles and eventually decided to just run ensembles of different storms with different sizes for now. This is due to the limitation we had on using the available resources in terms of time.

Later we'll explore other parameter perturbations.

SorooshMani-NOAA commented 1 year ago

@WPringle have you tried running KL-NN code on the sample data? During our meeting (and also in the slides) it says the data is time dependent. But looking at the ydata.txt and the code that was shared with us, it seems that it's using data that is of the shape ensemble x mesh-node. Do you know if the plots in the slides where generated using the same code or not?

I haven't tried running it yet. I was just reviewing the slides and the code.

WPringle commented 1 year ago

I did run the code using the sample data and the plots on the slides are the result of that execution.

I wouldn't worry about whether it is "x" or "t", it is somewhat arbitrary and methodology can handle spatiotemporal in theory. But I am pretty sure "x" is referring to a time dimension here based on how it looks

SorooshMani-NOAA commented 1 year ago

I see, thank you

SorooshMani-NOAA commented 1 year ago

@WPringle I forgot to also ask, have you tried it on an ensemble too? Was there any issue I need to be aware of if I want to run this on combine maxelev ensemble dataset?

WPringle commented 1 year ago

No I did not. Just be aware of the dimensions of the data, number of parameters, number of nodes, and number of ensembles and how that fits into that code. It should be able to run through to the part where it requires the observations

SorooshMani-NOAA commented 1 year ago

OK, thank you

SorooshMani-NOAA commented 1 year ago

@WPringle I'm testing the KLNN (without MCMC part) on the same ensemble I used with your KLPC. I actually subsetted the maxelev dataset using your subsetting part of the KLPC script (from examples/karhunenloeve_polynomialchaos/klpc_wetonly.py). Although the dataset now fits into memory, still the KLE.build takes a lot of time. Would it be OK if we replace the KL part of Khachik's script with the same PCA that you were using in the following code? https://github.com/noaa-ocs-modeling/EnsemblePerturbation/blob/33141b8c4586c22a6e3923d6f3fff7ed95c4b9b3/ensembleperturbation/uncertainty_quantification/karhunen_loeve_expansion.py#L39

Out of curiosity, is there any reason behind using KLE/PCA for reducing dimensions? Why not use other methods?

WPringle commented 1 year ago

Yes, please use the PCA method we use in EnsemblePerturbation

What other methods are there? For the KLPC we have an analytical solution to convert the KL fit surrogate back to the full-dimension surrogate, which is at least one advantage I can think of...

SorooshMani-NOAA commented 1 year ago

Yes, please use the PCA method we use in EnsemblePerturbation

OK thanks!

What other methods are there?

I was talking in general, I know there are a bunch of dimension reduction techniques which might be better than PCA based certain ways/applications; so I was wondering if there's any specific reason. e.g.: https://towardsdatascience.com/11-dimensionality-reduction-techniques-you-should-know-in-2021-dcb9500d388b

... to convert the KL fit surrogate back to the full-dimension surrogate...

I see, I didn't really think about this part. It would still be interesting to see what other methods have to offer if we get the chance to explore

WPringle commented 1 year ago

yeah that's interesting! For sure explore them if you want to, we can have a meeting with Khachik sometime to discuss this, he might be able to say why we would prefer one type of method over another.

SorooshMani-NOAA commented 1 year ago

Copying the email for future reference:

As I was going over the KL-NN-MCMC code thinking about how to use USGS data within the model, or later how to use the temporal data, I decided to focus more on the PCA part of the code and read more about it. I stumbled upon this not so new paper:

https://www.researchgate.net/publication/263138615_Principal_Component_Analysis_on_Spatial_Data_An_Overview

This is an overview of PCA use for spatial data. For the standard PCA that we are using, it defines different modes of running, and neither of them incorporates all time, location and attributes at the same time.

Based on my understanding, in our case node index is "location", each ensemble is an "attribute", and the elevation output time in the time series is "time" (for later). We could also consider both ensemble-member and the timeseries as "time", but I'm not sure if that makes a lot of sense. So right now for max-elev analysis we are actually in Q-mode, but I don't know which mode we should use for (later) time series analysis.

Another question that I have is how to see if we can find a way to "train" KLNN on all the nodes, but use USGS data on ~270 HWM locations for MCMC tuning. The alternative obvious approach is to train KLNN on the station locations and then do MCMC on the same station locations (noting that since we don't have a lot of computational time, we interpolate field outputs to the HWM locations for now!).

SorooshMani-NOAA commented 1 year ago

I tried using the sklearn PCA and the KLNN part goes through successfully now. I downloaded USGS data and selected all the locations that are within the polygon of the mesh. Since the stations were not available at the time of simulation, I'm interpolating results to the station locations for the analysis. Now I'm trying to fix nan data and/or dimension mismatches to run the full KLNNMCMC.

WPringle commented 1 year ago

@SorooshMani-NOAA Could you please share the code you have so far for the results we looked at yesterday with Khachik?

SorooshMani-NOAA commented 1 year ago

@WPringle I uploaded the code on a private repo for now. I just added you and Saeed as collaborator. Note that the code must run on an already subsetted dataset. I tried to get it to work with minimum changes possible.

SorooshMani-NOAA commented 1 year ago

This is the plot I get for the MCMC results. If I discard the first half and pick every 10th: image

Each plot is for one of the parameters values in Y axis vs X axis, the sample index. It seems that gamma in the code plays a major role in determining how white-noise like the parameters are; this if for gamma=0.01

Update These plots are not in the repo code yet!

SorooshMani-NOAA commented 1 year ago

I also increased the 2000 epochs to 20,000 and it definitely improved the NN: image and image

SorooshMani-NOAA commented 1 year ago

@WPringle if I want to train on the whole model but only MCMC on the USGS HWM stations, what should the non-station dimension value be when transforming observations using PCA? Should it be 0?

WPringle commented 1 year ago

@WPringle if I want to train on the whole model but only MCMC on the USGS HWM stations, what should the non-station dimension value be when transforming observations using PCA? Should it be 0?

@SorooshMani-NOAA That's a good question. I suppose 0 makes sense because in the matrix multiplication zeros will add nothing. Or if it doesn't work then we need to probably do the MCMC on the full result, rather than the dimensionally reduced result. First let's try zero and see...

Just to be clear, you are asking about what dummy value to put in obs_data at non-station locations right?

SorooshMani-NOAA commented 1 year ago

Just to be clear, you are asking about what dummy value to put in obs_data at non-station locations right?

@WPringle, sorry for late reply. Yes that's what I meant. I'll try and see how it goes

SorooshMani-NOAA commented 1 year ago

On a separate note, I plotted the MCMC results vs obs as well as model vs obs and it doesn't really look good. Note that I've already cut off HWM locations that are higher than 2 meter or less that 1 meter outside the min/max of HWM station resutls in the model and still it doesn't look good. I think I need to actually rerun the simulations: image where each thin bar is an station and the limits indicate max/min results from all the 30. and image

SorooshMani-NOAA commented 1 year ago

Next

SorooshMani-NOAA commented 1 year ago

This is a diagnostic plot for best track model vs obs bt_vs_obs_schism_n_adcirc I couldn't find a recent ADCIRC run, but this is a Florence sim using ADCIRC from Zach's older files. It seems that the ADCIRC setup would result in better results in this specific case. I need to also check my other runs and if the best track for all of them is not better, maybe I need to use a smaller mesh for the run.

WPringle commented 1 year ago

What's the reason for the different number of points? Maybe the mesh region? I think it would help to plot the same points

SorooshMani-NOAA commented 1 year ago

I think it would help to plot the same points

Yeah, that'd make more sense. As you said the difference is due to nan values I get during interpolation which is related to the meshed region.

SorooshMani-NOAA commented 1 year ago

This is the result, I think I need to look at my workflow ensemble results to see if it's any better. If not I might just need to improve the mesh and rerun. bt_vs_obs_schism_n_adcirc

btw, I forgot to include this one. It is the same fit plot, but without low quality hwm data: fit

And this is the ensemble model results vs observation for high quality hwm data: model_vs_obs Update In the last plot ignore the blue dots, they're there by mistake!

SorooshMani-NOAA commented 1 year ago

Using the full model + interpolated data on stations to train and then only stations for MCMC with 0 in other nodes:

loss_history_log model_surr

It actually looks worse! fit

model_vs_obs

Update In the last plot ignore the blue dots, they're there by mistake!

SorooshMani-NOAA commented 1 year ago

The next (from before) is to:

SorooshMani-NOAA commented 1 year ago

The results of the two tasks in the previous comment were not very promising, I don't see any points in sharing the plots. One interesting point I noticed is that the ensemble from the workflow seem to have elevation biased compared to COOPS observations. I'm investigating it by plotting time series. However the strange part is that the test set-up was not different from the workflow setup other than the number of members & also the mesh (I'm not sure about the mesh), which shouldn't affect the best track, so how can a difference in best track results be explained?!

SorooshMani-NOAA commented 1 year ago

After discussing the issues in a couple of the meetings the following tasks were defined:

Apart from that due to the convergence issues with NN

WPringle commented 1 year ago

@SorooshMani-NOAA About the NN size. Actually it should be mean of the total size of input and output. So input is size ndim and output is size neig, which means you can set the neuron size to be = (neig+ndim)/2. read here You may need one or two more to do well perhaps. But note the original size of 10 from Khachik was because ndim was 10 in his data. Although I question whether three hidden layers is necessary. Maybe 2 at most..

SorooshMani-NOAA commented 1 year ago

All of the ensemble we ran during September 2022 used a version of SCHISM where parametric wind doesn't work correctly (commit 5282a27). This means all of the ensembles we currently have are of no use and we have to run a new set!

I'm adding NWM and 48-hour before landfall perturbation, and we know that paramwind is fixed; so hopefully we'll get good quality results this time!

SorooshMani-NOAA commented 1 year ago

There's still some issues with NWM in the ensemble set-up. I'll just run a couple of ensemble with the fixed parametric wind to have preliminary results for ensemble analysis

SorooshMani-NOAA commented 1 year ago

For a new ensemble run for Florence this is what I get (observations from high quality coastal HWM):

Tracks storm_tracks

Observations vs best track bt_vs_obs

Ensemble vs observation model_vs_obs

Surrogate vs Model model_surr

Convergence of NN loss_history_log

Observation and Errorbars of MCMC fit

SorooshMani-NOAA commented 1 year ago

The result is not good for sure, but this time there seems to be a method to this madness! I mean there seems to be a bias involved here in the MCMC errorbar results. Still not sure how to interpret it, what do you think @WPringle ? Also please let me know how to share the .nc files. The combined for the 50 ensemble (42-43 went through) is around 36 GB, the subset is ~10GB

Updated sizes

WPringle commented 1 year ago

I guess the bias mostly comes from the fact that model results are generally under predicted. But it's a bit weird we would think the MAP crosses should be closest to the observations, instead they are lower down on the error bars. Could you add the MCMC error bars to the Ensemble vs observation plot? So we can compare original ensemble to MCMC.

WPringle commented 1 year ago

@SorooshMani-NOAA I think we can use this function to transfer Gaussian to uniform https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.ndtr.html#scipy.special.ndtr

This will give you values of [0,1] then can rescale it to [-1,1] by multiply by 2 then subtract 1.

WPringle commented 1 year ago

@SorooshMani-NOAA I put my jupyter notebook and some modifications to mcmc.py and utils.py for using PCs on the globus Ensemble research collection under KLPC_MCMC_William

WPringle commented 1 year ago

@SorooshMani-NOAA Also check out https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html for trying to figure out best hyperparameters for the ANN model (used by this paper: https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2022JD037617)

SorooshMani-NOAA commented 1 year ago

@WPringle I think we missed one point during our discussion about Physical vs KL space:

When using Polynomial Chaos and we're in physical space, it's clear what is the input to the physical surrogate, but for the NN case, even if MCMC takes care of transforming back to physical, still we don't know what to pass to the NN in the first place.

NN only accepts KL space input, so we still run into the same issue of transforming data from observation-points-only to KL space, to then pass it to NN. So we should find a way to "transform" the neural network to accept physical inputs. Does that make sense?

WPringle commented 1 year ago

No I think it is the same because the “input” to the surrogate is just the value of the set of parameters while the length of the surrogate is either the number of eigenmodes or number of physical nodes depending on if in KL space of physics space. Of course we need to also provide the observed data and the corresponding indices in the physical surrogate that map to the data.


From: Soroosh Mani @.> Sent: Tuesday, January 31, 2023 12:39:44 PM To: noaa-ocs-modeling/EnsemblePerturbation @.> Cc: Pringle, William @.>; Mention @.> Subject: Re: [noaa-ocs-modeling/EnsemblePerturbation] Bayesian analysis for calibration of input parameters (Issue #88)

@WPringlehttps://github.com/WPringle I think we missed one point during our discussion about Physical vs KL space:

When using Polynomial Chaos and we're in physical space, it's clear what is the input to the physical surrogate, but for the NN case, even if MCMC takes care of transforming back to physical, still we don't know what to pass to the NN in the first place.

NN only accepts KL space input, so we still run into the same issue of transforming data from observation-points-only to KL space, to then pass it to NN. So we should find a way to "transform" the neural network to accept physical inputs. Does that make sense?

— Reply to this email directly, view it on GitHubhttps://github.com/noaa-ocs-modeling/EnsemblePerturbation/issues/88#issuecomment-1410886019, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFBHFHQL4EPL6RIG7KRZURTWVFL7BANCNFSM6AAAAAAQLUYNOU. You are receiving this because you were mentioned.Message ID: @.***>

SorooshMani-NOAA commented 1 year ago

“input” to the surrogate is just the value of the set of parameters

Oh I see my mistake ... thanks! I'll just need to inverse_transform the KL space predictions from NN and then get the ones on obs location for MCMC to use!

Apart from that, I'm not sure if I use cross validation term correctly. Based on the description here https://scikit-learn.org/stable/modules/grid_search.html, this term is used for hyperparameter tuning (e.g. number of hidden layers or neurons in case of NN) and not model parameter selection. What we're doing really is looking at train/test for tuning parameters.

In the case of PC model, the existing code has:

cross_validator = LeaveOneOut()
regression_model = ElasticNetCV(
    fit_intercept=False,
    cv=cross_validator,
    l1_ratio=0.5,
    selection='random',
    random_state=666,
)

based on https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNetCV.html my understanding is that cross validation is actually not used here, since l1_ratio is not a list of possible values.

WPringle commented 1 year ago

In the case of PC model, the existing code has:

cross_validator = LeaveOneOut()
regression_model = ElasticNetCV(
    fit_intercept=False,
    cv=cross_validator,
    l1_ratio=0.5,
    selection='random',
    random_state=666,
)

based on https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNetCV.html my understanding is that cross validation is actually not used here, since l1_ratio is not a list of possible values.

Yep so it still is doing the CV for the alpha parameter in the cost function just not the l1_ratio but yes that could also be added as a list.

SorooshMani-NOAA commented 1 year ago

doing the CV for the alpha

Oh, yeah I missed it. In any case, what I'm trying to say is that unless we want to really find a good # of neurons or layers, CV approach might not make much sense for NN and we can do just the internal validation it already has for choosing parameters

WPringle commented 1 year ago

@SorooshMani-NOAA hmm I don't know about that. Maybe those CV functions aren't appropriate but I think it's important to do CV so we reduce the over training problem.

SorooshMani-NOAA commented 1 year ago

There seems to be other methods to avoid overfitting, e.g.: https://www.kdnuggets.com/2019/12/5-techniques-prevent-overfitting-neural-networks.html In any case, this is the results by using val input but without any further change to the NN code: loss_history_log The best validation loss happens early in the epochs, so doing more and more epochs doesn't really help based on this graph.

And this is the observation vs ensemble (with dry node extrapolation): model_vs_obs

Also the results from SCHISM ensemble also improved by using physical-space-surrogate: fit

SorooshMani-NOAA commented 1 year ago

Action items as of 2/21/23 meeting:

Soroosh:

William:

What do we need for finalizing/publishing?