Large netCDF files - Githubissues

JuliaKukulies commented 5 years ago

I am wondering how to deal with very large datasets. Say you want to perform the tracking on 30-min time slots for several years. What method would you suggest for the tracking? Since you would get a Memory Error to save all the data to one iris.cube, you could split up the data into several iris.cubes (e.g. by month or year). However, in this case I guess that the algorithm would fail to track cloud systems at the boundaries (e.g. one that persists from the last day of one data cube to the first day of the next data cube). So what is the best solution?

w-k-jones commented 5 years ago

Chunking is definitely the best solution for this. With the current methods implemented in tobac chunking by time for the feature detection and segmentation poses no problems. However, as you mention this would be problematic for the actual tracking as information won't be propagated over the boundaries.

I think the best approach would be to chunk the data for feature detection and segmentation (e.g.month by month), as this is the most data intensive, and then recombine the resulting dataframes to produce one input to the tracking function for the full time period. Note that you would need to update the frame numbers in the dataframes to make sure that they're sequential.

JuliaKukulies commented 5 years ago

Thanks a lot for this useful hint! Your suggestion seems to work fine. I was worried first, because the function for linking the trajectories takes _fieldin as an input parameter. But with a look at the function, it seems enough to recombine the dataframes for features and segmentation and not for the actual data.

mheikenfeld commented 5 years ago

The field_in parameter is not actually used at the moment, so as you said, things should work fine by just recombining the dataframes from the feature detection. The input field (or in the end rather only it's dimensions) were used in the extrapolation/gap filling of the found trajectories, but that part is now commented out. I will try to clean up that function over the next days!

JuliaKukulies commented 5 years ago

Here is a notebook, which provides an example how to use tobac for a large high resolution precipitation dataset:

https://github.com/JuliaKukulies/mcs_tracking/blob/master/GPM_IMERG/example_precip_tracking_large_dataset.ipynb

and a bash script, in order to preprocess high resolution input data from the NASA GES DISC platform, in order to use it with tobac (see more detailed description in notebook):

https://github.com/JuliaKukulies/mcs_tracking/blob/master/GPM_IMERG/tobac_prepare.sh

JuliaKukulies commented 5 years ago

As an additional remark:

Some analysis functions could probably easily be modified, in order to expand their applicability for chunked data. Many functions take, for instance, the segmentation mask as input parameter. While the feature dataframes for data chunks by time can easily be merged into one large dataframe (like you suggested), it is more difficult to merge high-resolution netcdffiles into one large iris cube. This means that the best way of using the analysis functions is to perform the analysis, just as the feature detection and segmentation, by month/day.

However, one idea would be to modify functions which have both the segmentation/feature/cell mask and feature or trajectory dataframes as input parameters in the way that the time span of the dataframe must not coincide with the time span of the mask (as it is now). This could facilitate the analysis by avoiding to split and recombine the dataframes after each analysis step.

To be more specific, this could be a simple modification for _tobac.analysis.calculatearea, which allows the area calculation for only parts of the feature dataframe, whereas the mask is given as monthly input:


def calculate_area(features,mask,method_area=None):
    from tobac.utils import mask_features_surface,mask_features
    from iris import Constraint
    from iris.analysis.cartography import area_weights

    features['area']=np.nan

    mask_coords=[coord.name() for coord in mask.coords()]
    if method_area is None:
        if ('projection_x_coordinate' in mask_coords) and ('projection_y_coordinate' in mask_coords):
            method_area='xy'
        elif ('latitude' in mask_coords) and ('longitude' in mask_coords):
                method_area='latlon'
        else:
            raise ValueError('either latitude/longitude or projection_x_coordinate/projection_y_coordinate have to be present to calculate distances')
    logging.debug('calculating area using method '+ method_area)
    if method_area=='xy':
        if not (mask.coord('projection_x_coordinate').has_bounds() and mask.coord('projection_y_coordinate').has_bounds()):
            mask.coord('projection_x_coordinate').guess_bounds()
            mask.coord('projection_y_coordinate').guess_bounds()
        area=np.outer(np.diff(mask.coord('projection_x_coordinate').bounds,axis=1),np.diff(mask.coord('projection_y_coordinate').bounds,axis=1))
    elif method_area=='latlon':
        if (mask.coord('latitude').ndim==1) and (mask.coord('latitude').ndim==1):
            if not (mask.coord('latitude').has_bounds() and mask.coord('longitude').has_bounds()):
                mask.coord('latitude').guess_bounds()
                mask.coord('longitude').guess_bounds()
            area=area_weights(mask,normalize=False)
        elif mask.coord('latitude').ndim==2 and mask.coord('longitude').ndim==2:
            raise ValueError('2D latitude/longitude coordinates not supported yet')
            # area=calculate_areas_2Dlatlon(mask.coord('latitude'),mask.coord('longitude'))
        else:
            raise ValueError('latitude/longitude coordinate shape not supported')
    else:
        raise ValueError('method undefined')

    # get start and end time steps from input mask 
    time_steps = mask.coord('time')
    start_date= time_steps.units.num2date(tiem_steps.points[0])
    end_date = time_steps.units.num2date(time_steps.points[-1])

    time_mask = (features['time'] > start_date) & (features['time'] <= end_date)
    feat_select = features.loc[time_mask]

    for time_i,features_i in feat_select.groupby('time'):
        logging.debug('timestep:'+ str(time_i))
        constraint_time = Constraint(time=time_i)
        mask_i=mask.extract(constraint_time)

        for i in features_i.index: 
            if len(mask_i.shape)==3:
                mask_i_surface = mask_features_surface(mask_i, features_i.loc[i,'feature'], z_coord='model_level_number')
            elif len(mask_i.shape)==2:            
                mask_i_surface=mask_features(mask_i,features_i.loc[i,'feature'])
            area_feature=np.sum(area*(mask_i_surface.data>0))
            features.at[i,'area']=area_feature

    return features

JuliaKukulies commented 2 years ago

I am thinking about closing this issue if you do not have any objections @w-k-jones @freemansw1. My main questions has been answered and the speed up of the feature detection and tracking should also solve parts of this (not the memory issue, but it makes it at least possible to apply both on chunked bits of very large high-resolution datasets). The suggestion I made here on the analysis function seems not very relevant anymore. If at all, I can keep this point in my head as we proceed with the discussion in #146.

freemansw1 commented 2 years ago

This is actually something that I'm hoping to address with further revisions to the documentation, either as an addendum to #150 or as a separate PR, so I'm inclined to keep this open? You and I have both independently had this issue, so I think it's worth attacking in the docs somewhere.

I agree with your thoughts on the analysis files!

JuliaKukulies commented 2 years ago

Ah OK, that is actually a good point to address this in the documentation! Will keep it open until we have solved that then :)

freemansw1 commented 2 years ago

We can close this with #186 being merged!

tobac-project / tobac

Large netCDF files #17