Closed SorooshMani-NOAA closed 8 months ago
The issue seems to be related to how the stacking operation works. This is how we currently stack:
subset_time_stack = subset_time_chunk.stack(node=('time','nSCHISM_hgrid_node'))
The stacking operation of the two dims, by default, creates a multi-index dimension referring to the original stacked dimensions: in this case tuples of (time, schism_index)
.
https://docs.xarray.dev/en/latest/generated/xarray.DataArray.stack.html
During the subsetting call on this stacked dataset, the node
dimension operations will be different from the normal case when it's just numbers (now it's tuples of dates and numbers).
In specific the performance bottleneck seems to be during processing of the element
in the dataset due to isin
check. The right fix is (needs testing) to set the stacked coordinate to just be equal to the repeated nodes array. We can also explore what happens if we simply remove the element
from the dataset (i.e. is this the only bottleneck?)
Example of how to do the stacking:
ds = xr.Dataset()
ds = ds.assign_coords(node=np.array([1,2,3,1,2,3]), time=np.arange(7))
ds['v'] = (('node', 'time'), np.arange(42).reshape(6,7))
ds = ds.stack(stacked=('time','node'), create_index=False).swap_dims(stacked='node')
Still I think since we want to reshape back the whole thing at some point, maybe we need to make sure we don't drop any of the times(?). Either that, or when we reshape back, we can use the still existing time
vector which is now dependent on node
Current working code:
subset_time_stack = subset_time_chunk.rename(
nSCHISM_hgrid_node='node'
).stack(
stacked=('time','node'), create_index=False
).swap_dims(
stacked='node'
)
Thanks @SorooshMani-NOAA looks like this approach is working.
Here are execution time of subset_dataset()
function for a subset of timeseries based on nodes that will be wet at least once (223,482 out of 701,988):
Execution of subset_dataset()
function for all 701,988 nodes (instead of 223,482 nodes from max elev. subset), completed in ~26 min., but with warning messages regarding large graph sizes.
Do you recommend conducting two step subsetting (first based on max elevation, and then timeseries), or one single step?
@FariborzDaneshvar-NOAA based on our discussion yesterday, do you think we should keep this open or can we close it?
@SorooshMani-NOAA The subsetting part is working fine, so I guess you can close this ticket. The memory issue for surrogate expansion can be addressed separately.
Let's address that separately, since this was related to being able to subset, that one is related to analyzing the data.
The combined dataset for all perturbation runs is a 3D dataset with dimensions (run, member no. and time). In order to process is using the existing KLPC method, we first stack the data into 2D (as shown in the borrowed snapshot below)
Passing this 2D dataset to the subsetting function fails to generate results, even if we just take 2 steps.
Ticket created from https://github.com/noaa-ocs-modeling/SurgeTeamCoordination/issues/172#issuecomment-1876292717 Test code at https://github.com/noaa-ocs-modeling/SurgeTeamCoordination/issues/172#issuecomment-1877314968