noaa-ocs-modeling / EnsemblePerturbation

perturbation of coupled model input over a space of input variables
https://ensembleperturbation.readthedocs.io
Creative Commons Zero v1.0 Universal
7 stars 3 forks source link

Consider timeseries for building the surrogate model #108

Open SorooshMani-NOAA opened 11 months ago

SorooshMani-NOAA commented 11 months ago

Currently only max water elevation is used to train the surrogate model. We'd like to consider the whole timeseries to see how it affects the surrogate output.

Tasks:

@saeed-moghimi-noaa @WPringle @SorooshMani-NOAA

SorooshMani-NOAA commented 6 months ago

@FariborzDaneshvar-NOAA since you started exploring this item, can you please either link an existing ticket or just use this ticket to document your progress and impediments (like https://github.com/noaa-ocs-modeling/EnsemblePerturbation/issues/128)

FariborzDaneshvar-NOAA commented 6 months ago

With the stacking suggestion in https://github.com/noaa-ocs-modeling/EnsemblePerturbation/issues/129#issuecomment-1885667131, I was able to execute the subset_dataset() function with stacked time&node! But the conversion of the KL surrogate model to the overall surrogate for each node step (execution of surrogate_from_karhunen_loeve() function) failed with MemoryError!

One suggestion was using a chunk of time steps. Here I will provide updates on that regard.

FariborzDaneshvar-NOAA commented 6 months ago

Building surrogate model for the first 100 time steps:

time_chunk = elev_timeseries.sel(time=slice("2018-08-30T13:00:00.000000000", "2018-09-03T16:00:00.000000000"))
time_chunk_stack = time_chunk.rename(
    nSCHISM_hgrid_node='node'
).stack(
    stacked=('time','node'), create_index=False
).swap_dims(
    stacked='node'
)
subset = subset_dataset(ds=time_chunk_stack, ...)
It went through and here are plots: kL eigenvalues KL fit
KL_eigenvalues KL_fit
KL-surrogate fit validation boxplots
kl_surrogate_fit validation_boxplot
sensitivities model vs surrogate
sensitivities validation_vortex_4_variable_korobov_1

This results look weird! and to me the KL fit didn't work correctly! One possibility is that the first 100 time steps used here are long before landfall and minimal variation might exist between them. It also reveals the issue in the plotting function I mentioned earlier here https://github.com/noaa-ocs-modeling/EnsemblePerturbation/issues/132

Despite these results, I couldn't make percentile and probability plots due to MemoryError : Unable to allocate 1.15 TiB for an array with shape (15772912, 10000) and data type float64

FariborzDaneshvar-NOAA commented 6 months ago

I also tried opening subset.nc with dask (chunk=auto), but it didn't change the outcome of memory error (still getting the same message for percentile and probability plots! But interestingly, the sensitivity plots for along-track were different! (see below) @SorooshMani-NOAA how that might be possible?! image

SorooshMani-NOAA commented 6 months ago

@FariborzDaneshvar-NOAA about the memory issue, the problem is that in the function you showed me the other day it is calling numpy function directly, which means it will get all values to memory and then executes the function (as far as I understand). So you need to also change the function where the numpy method is called.

I'm not sure what is happening in the plots. Are you sure that mapping back to physical space is done correctly? Since we have a time-node dimension where neither times nor nodes are necessarily aligned, so we have to be very careful when reshaping.

I'm not sure if the plots we get are actually meaningful!

FariborzDaneshvar-NOAA commented 6 months ago

@SorooshMani-NOAA thanks for your comment, you brought up a good point about results! I didn't reshape it back to time/node, which might explain these plots, but it's not clear to me at which step it should be reshaped!

This new memory issue is different from what I mentioned before (for the numpy function in the surrogate expansion, when I used the entire time step), but you are right, it should be addressed separately.