watem-sedem / rfactor

R-factor
https://watem-sedem.github.io/rfactor/
GNU Lesser General Public License v3.0
6 stars 2 forks source link

WIP: changes in maximum intensity #68

Closed daarende closed 1 year ago

daarende commented 1 year ago

Differences Handling ‘unknown Values’ in Maximum_intesity

The difference between the two different Maximum_intensity functions (Maximum_Intensity_Matlab_Clone and Maximum_Intensity (FLUVES)) mostly lies in the difference of how the handle ‘unknown values. ‘Unknown Values’, here meaning: values which where potentially lost due the format of the data set, namely a non-zero time series, occur when in the original dataset has a NAN-value for a particular timestep or if the measured rain was 0.00 mm in a particular timestep. This removal of timesteps, creates an uncertainty about what happened in removed timesteps, since we are unable to know if a removed value was NAN or a 0.00mm measurement. In the two methods, distinct assumptions are made, leading to a divergent result in the maximum intensity (and therefore in the R-Factor, as well). The Matlab_Clone function assumes that ‘unknown values’ are erratic values and thus interpolates between the closest measured rain values in order to get a rain value for that timestep. The FLUVES method, on the other hand, assumes that all ‘unknown values’ are 0.00mm values, and uses this in the calculations of the maximum intensity.

Although this is the common distinction between the two methods, this interpolation does not cause the biggest difference between the two calculated maximum intensities. The Matlab_Clone function has some particularities, which lead to different values, and could be problematic, especially for high rain intensities in short periods of time. When the duration of a rainfall event is shorter than 30 min, a special condition has been implemented in the Matlab_clone code, namely only the rainfall of the first timestep is considered in order to calculate the maximum intensity for the rainfall event. There are 2 errors in the code of this condition, however. Firstly, the time interval is wrongly defined: (timestamps[-1] - timestamps[0] <= 30, should be: < 30, since a timestep contains 10 min i.e. timestep 10 to 40 contain 40 minutes of rainfall data, however it would be considered by the criterium). Secondly, there is a *2 too many in the formula, this causes extreme events to be overexaggerated, which leads to the biggest differences in both methods. The implementation of this *2 is probably taken from Wischmeier & Smith (1978) which suggest that for the rain intensity for events with a shorther duration than 30min, the rainfall depth should be multiplied by 2. The coding implementation here leads to a multiplication by 4, so the first *2 can be deleted. The initial implementation also only considers the rainfall of the first 10min interval, which should thus be changed to the sum of all the rainfall in the rain event shorter than 30 min (therefor: np.sum(rain)).

In addition, in the for-loop of the Matlab_clone function, the iteration is only done up to the penultimate measurement (for idx in range(len(df)-1)). This leads to the omission of the last measurement, which is problematic when this value is the highest value of rainfall event and a hiatus occurs just before this measurement. This is the case, because this high value would only be considered in the interpolation of the penultimate timestep, but not for the final timestep itself, which could still be the highest intensity, especially if the last measurement value is high. This then leads to an underestimation of the maximum intensity of the particular rainfall event. The error could be, at least to my knowledge, be due to the translation from MatLab to Python code, since the indexing is different in both languages (1 and 0, resp.) This could, however, also have been included purposely, but why I do not see clearly. Does anyone know more about this?

I added the changes to the GIT-code and added an explicit rounding (to eliminate rounding differences in both methods). All mentioned changes have reduced the difference between both methods in maximum error per rainfall event from 70.02 to 6.36, the minimum error per rainfall event (highest negative error) form -33.34 to 0.00 and the total absolute error from 4808.09 to 646.08 for all rainfall events up to 2021.

The final difference (646.08) can than be considered as the real difference between both methods (namely the interpolation over unknown values in rain event in the Matlab_Clone). If these changes seem reasonable to you, the Matlab_Clone should be implemented as this. It would maybe be interesting to implement these changes in a new function and keeping Matlab_Clone unchanged, I however have not been able to manipulate the script in such manner.

github-actions[bot] commented 1 year ago

Binder :point_left: Launch a binder notebook on this branch

Sachagobeyn commented 1 year ago

@daarende @SethCallewaert : I'll review as soon as I can :)

Sachagobeyn commented 1 year ago

Hi @SethCallewaert ,

Differences Handling ‘unknown Values’ in Maximum_intesity

The difference between the two different Maximum_intensity functions (Maximum_Intensity_Matlab_Clone and Maximum_Intensity (FLUVES)) mostly lies in the difference of how the handle ‘unknown values. ‘Unknown Values’, here meaning: values which where potentially lost due the format of the data set, namely a non-zero time series, occur when in the original dataset has a NAN-value for a particular timestep or if the measured rain was 0.00 mm in a particular timestep. This removal of timesteps, creates an uncertainty about what happened in removed timesteps, since we are unable to know if a removed value was NAN or a 0.00mm measurement. In the two methods, distinct assumptions are made, leading to a divergent result in the maximum intensity (and therefore in the R-Factor, as well). The Matlab_Clone function assumes that ‘unknown values’ are erratic values and thus interpolates between the closest measured rain values in order to get a rain value for that timestep. The FLUVES method, on the other hand, assumes that all ‘unknown values’ are 0.00mm values, and uses this in the calculations of the maximum intensity.

From this description I read that we will handle NaN and 0 the same way?

Although this is the common distinction between the two methods, this interpolation does not cause the biggest difference between the two calculated maximum intensities. The Matlab_Clone function has some particularities, which lead to different values, and could be problematic, especially for high rain intensities in short periods of time. When the duration of a rainfall event is shorter than 30 min, a special condition has been implemented in the Matlab_clone code, namely only the rainfall of the first timestep is considered in order to calculate the maximum intensity for the rainfall event. There are 2 errors in the code of this condition, however. Firstly, the time interval is wrongly defined: (timestamps[-1] - timestamps[0] <= 30, should be: < 30, since a timestep contains 10 min i.e. timestep 10 to 40 contain 40 minutes of rainfall data, however it would be considered by the criterium). Secondly, there is a *2 too many in the formula, this causes extreme events to be overexaggerated, which leads to the biggest differences in both methods. The implementation of this *2 is probably taken from Wischmeier & Smith (1978) which suggest that for the rain intensity for events with a shorther duration than 30min, the rainfall depth should be multiplied by 2. The coding implementation here leads to a multiplication by 4, so the first *2 can be deleted. The initial implementation also only considers the rainfall of the first 10min interval, which should thus be changed to the sum of all the rainfall in the rain event shorter than 30 min (therefor: np.sum(rain)).

I read from this that we should consider the currrent (0') and last two (-10', -20') records to compute the exception, and in addition drop the '2'? Ok, for me to drop '2', agree for only considering 0', -10' and -20'.

In addition, in the for-loop of the Matlab_clone function, the iteration is only done up to the penultimate measurement (for idx in range(len(df)-1)). This leads to the omission of the last measurement, which is problematic when this value is the highest value of rainfall event and a hiatus occurs just before this measurement. This is the case, because this high value would only be considered in the interpolation of the penultimate timestep, but not for the final timestep itself, which could still be the highest intensity, especially if the last measurement value is high. This then leads to an underestimation of the maximum intensity of the particular rainfall event. The error could be, at least to my knowledge, be due to the translation from MatLab to Python code, since the indexing is different in both languages (1 and 0, resp.) This could, however, also have been included purposely, but why I do not see clearly. Does anyone know more about this?

I did a little test (reproducing https://github.com/cn-ws/rfactor/blob/master/src/rfactor/matlab/core.m#L36, https://github.com/cn-ws/rfactor/blob/master/src/rfactor/matlab/core.m#L40 and https://github.com/cn-ws/rfactor/blob/master/src/rfactor/matlab/core.m#L120), Matlab:

a = [1, 2, 3, 4];
m = size(a);
m = m(2);

for i=1:m
    disp(a(i))
end

Output:

     1
     2
     3
     4

Python (reproducting https://github.com/cn-ws/rfactor/blob/master/src/rfactor/rfactor.py#L73):

df = [1, 2, 3, 4]
m = len(df)

for i in range(m-1):
  print(df[i])

Output:

     1
     2
     3

So indeed range(m) is in line with Matlab.

I added the changes to the GIT-code and added an explicit rounding (to eliminate rounding differences in both methods). All mentioned changes have reduced the difference between both methods in maximum error per rainfall event from 70.02 to 6.36, the minimum error per rainfall event (highest negative error) form -33.34 to 0.00 and the total absolute error from 4808.09 to 646.08 for all rainfall events up to 2021.

Ok, let's discuss on teams

The final difference (646.08) can than be considered as the real difference between both methods (namely the interpolation over unknown values in rain event in the Matlab_Clone). If these changes seem reasonable to you, the Matlab_Clone should be implemented as this. It would maybe be interesting to implement these changes in a new function and keeping Matlab_Clone unchanged, I however have not been able to manipulate the script in such manner.

@SethCallewaert : please name you branches accordingly