Closed JuliaKukulies closed 2 years ago
Chunking is definitely the best solution for this. With the current methods implemented in tobac chunking by time for the feature detection and segmentation poses no problems. However, as you mention this would be problematic for the actual tracking as information won't be propagated over the boundaries.
I think the best approach would be to chunk the data for feature detection and segmentation (e.g.month by month), as this is the most data intensive, and then recombine the resulting dataframes to produce one input to the tracking function for the full time period. Note that you would need to update the frame numbers in the dataframes to make sure that they're sequential.
Thanks a lot for this useful hint! Your suggestion seems to work fine. I was worried first, because the function for linking the trajectories takes _fieldin as an input parameter. But with a look at the function, it seems enough to recombine the dataframes for features and segmentation and not for the actual data.
The field_in parameter is not actually used at the moment, so as you said, things should work fine by just recombining the dataframes from the feature detection. The input field (or in the end rather only it's dimensions) were used in the extrapolation/gap filling of the found trajectories, but that part is now commented out. I will try to clean up that function over the next days!
Here is a notebook, which provides an example how to use tobac for a large high resolution precipitation dataset:
and a bash script, in order to preprocess high resolution input data from the NASA GES DISC platform, in order to use it with tobac (see more detailed description in notebook):
https://github.com/JuliaKukulies/mcs_tracking/blob/master/GPM_IMERG/tobac_prepare.sh
As an additional remark:
Some analysis functions could probably easily be modified, in order to expand their applicability for chunked data. Many functions take, for instance, the segmentation mask as input parameter. While the feature dataframes for data chunks by time can easily be merged into one large dataframe (like you suggested), it is more difficult to merge high-resolution netcdffiles into one large iris cube. This means that the best way of using the analysis functions is to perform the analysis, just as the feature detection and segmentation, by month/day.
However, one idea would be to modify functions which have both the segmentation/feature/cell mask and feature or trajectory dataframes as input parameters in the way that the time span of the dataframe must not coincide with the time span of the mask (as it is now). This could facilitate the analysis by avoiding to split and recombine the dataframes after each analysis step.
To be more specific, this could be a simple modification for _tobac.analysis.calculatearea, which allows the area calculation for only parts of the feature dataframe, whereas the mask is given as monthly input:
def calculate_area(features,mask,method_area=None):
from tobac.utils import mask_features_surface,mask_features
from iris import Constraint
from iris.analysis.cartography import area_weights
features['area']=np.nan
mask_coords=[coord.name() for coord in mask.coords()]
if method_area is None:
if ('projection_x_coordinate' in mask_coords) and ('projection_y_coordinate' in mask_coords):
method_area='xy'
elif ('latitude' in mask_coords) and ('longitude' in mask_coords):
method_area='latlon'
else:
raise ValueError('either latitude/longitude or projection_x_coordinate/projection_y_coordinate have to be present to calculate distances')
logging.debug('calculating area using method '+ method_area)
if method_area=='xy':
if not (mask.coord('projection_x_coordinate').has_bounds() and mask.coord('projection_y_coordinate').has_bounds()):
mask.coord('projection_x_coordinate').guess_bounds()
mask.coord('projection_y_coordinate').guess_bounds()
area=np.outer(np.diff(mask.coord('projection_x_coordinate').bounds,axis=1),np.diff(mask.coord('projection_y_coordinate').bounds,axis=1))
elif method_area=='latlon':
if (mask.coord('latitude').ndim==1) and (mask.coord('latitude').ndim==1):
if not (mask.coord('latitude').has_bounds() and mask.coord('longitude').has_bounds()):
mask.coord('latitude').guess_bounds()
mask.coord('longitude').guess_bounds()
area=area_weights(mask,normalize=False)
elif mask.coord('latitude').ndim==2 and mask.coord('longitude').ndim==2:
raise ValueError('2D latitude/longitude coordinates not supported yet')
# area=calculate_areas_2Dlatlon(mask.coord('latitude'),mask.coord('longitude'))
else:
raise ValueError('latitude/longitude coordinate shape not supported')
else:
raise ValueError('method undefined')
# get start and end time steps from input mask
time_steps = mask.coord('time')
start_date= time_steps.units.num2date(tiem_steps.points[0])
end_date = time_steps.units.num2date(time_steps.points[-1])
time_mask = (features['time'] > start_date) & (features['time'] <= end_date)
feat_select = features.loc[time_mask]
for time_i,features_i in feat_select.groupby('time'):
logging.debug('timestep:'+ str(time_i))
constraint_time = Constraint(time=time_i)
mask_i=mask.extract(constraint_time)
for i in features_i.index:
if len(mask_i.shape)==3:
mask_i_surface = mask_features_surface(mask_i, features_i.loc[i,'feature'], z_coord='model_level_number')
elif len(mask_i.shape)==2:
mask_i_surface=mask_features(mask_i,features_i.loc[i,'feature'])
area_feature=np.sum(area*(mask_i_surface.data>0))
features.at[i,'area']=area_feature
return features
I am thinking about closing this issue if you do not have any objections @w-k-jones @freemansw1. My main questions has been answered and the speed up of the feature detection and tracking should also solve parts of this (not the memory issue, but it makes it at least possible to apply both on chunked bits of very large high-resolution datasets). The suggestion I made here on the analysis function seems not very relevant anymore. If at all, I can keep this point in my head as we proceed with the discussion in #146.
This is actually something that I'm hoping to address with further revisions to the documentation, either as an addendum to #150 or as a separate PR, so I'm inclined to keep this open? You and I have both independently had this issue, so I think it's worth attacking in the docs somewhere.
I agree with your thoughts on the analysis files!
Ah OK, that is actually a good point to address this in the documentation! Will keep it open until we have solved that then :)
We can close this with #186 being merged!
I am wondering how to deal with very large datasets. Say you want to perform the tracking on 30-min time slots for several years. What method would you suggest for the tracking? Since you would get a Memory Error to save all the data to one iris.cube, you could split up the data into several iris.cubes (e.g. by month or year). However, in this case I guess that the algorithm would fail to track cloud systems at the boundaries (e.g. one that persists from the last day of one data cube to the first day of the next data cube). So what is the best solution?