noaa-ocs-modeling / EnsemblePerturbation

perturbation of coupled model input over a space of input variables
https://ensembleperturbation.readthedocs.io
Creative Commons Zero v1.0 Universal
7 stars 3 forks source link

Performance of subsetting perturbation timeseries dataset #129

Closed SorooshMani-NOAA closed 8 months ago

SorooshMani-NOAA commented 9 months ago

The combined dataset for all perturbation runs is a 3D dataset with dimensions (run, member no. and time). In order to process is using the existing KLPC method, we first stack the data into 2D (as shown in the borrowed snapshot below)

Image

Passing this 2D dataset to the subsetting function fails to generate results, even if we just take 2 steps.

Ticket created from https://github.com/noaa-ocs-modeling/SurgeTeamCoordination/issues/172#issuecomment-1876292717 Test code at https://github.com/noaa-ocs-modeling/SurgeTeamCoordination/issues/172#issuecomment-1877314968

SorooshMani-NOAA commented 9 months ago

The issue seems to be related to how the stacking operation works. This is how we currently stack:

subset_time_stack = subset_time_chunk.stack(node=('time','nSCHISM_hgrid_node'))

The stacking operation of the two dims, by default, creates a multi-index dimension referring to the original stacked dimensions: in this case tuples of (time, schism_index). https://docs.xarray.dev/en/latest/generated/xarray.DataArray.stack.html

During the subsetting call on this stacked dataset, the node dimension operations will be different from the normal case when it's just numbers (now it's tuples of dates and numbers).

In specific the performance bottleneck seems to be during processing of the element in the dataset due to isin check. The right fix is (needs testing) to set the stacked coordinate to just be equal to the repeated nodes array. We can also explore what happens if we simply remove the element from the dataset (i.e. is this the only bottleneck?)

SorooshMani-NOAA commented 9 months ago

Example of how to do the stacking:

ds = xr.Dataset()
ds = ds.assign_coords(node=np.array([1,2,3,1,2,3]), time=np.arange(7))
ds['v'] = (('node', 'time'), np.arange(42).reshape(6,7))
ds = ds.stack(stacked=('time','node'), create_index=False).swap_dims(stacked='node')

Still I think since we want to reshape back the whole thing at some point, maybe we need to make sure we don't drop any of the times(?). Either that, or when we reshape back, we can use the still existing time vector which is now dependent on node

Current working code:

subset_time_stack = subset_time_chunk.rename(
         nSCHISM_hgrid_node='node'
).stack(
         stacked=('time','node'), create_index=False
).swap_dims(
         stacked='node'
)
FariborzDaneshvar-NOAA commented 8 months ago

Thanks @SorooshMani-NOAA looks like this approach is working. Here are execution time of subset_dataset() function for a subset of timeseries based on nodes that will be wet at least once (223,482 out of 701,988):

Execution of subset_dataset() function for all 701,988 nodes (instead of 223,482 nodes from max elev. subset), completed in ~26 min., but with warning messages regarding large graph sizes.

Do you recommend conducting two step subsetting (first based on max elevation, and then timeseries), or one single step?

SorooshMani-NOAA commented 8 months ago

@FariborzDaneshvar-NOAA based on our discussion yesterday, do you think we should keep this open or can we close it?

FariborzDaneshvar-NOAA commented 8 months ago

@SorooshMani-NOAA The subsetting part is working fine, so I guess you can close this ticket. The memory issue for surrogate expansion can be addressed separately.

SorooshMani-NOAA commented 8 months ago

Let's address that separately, since this was related to being able to subset, that one is related to analyzing the data.