Replace hist() with histcounts()

fpavogt commented 4 years ago

I realized that matlab's hist() does not behave at all like Python's histogram(), which causes havoc in the Python's nb_tie().

I cannot reproduce hist() in Python, because it is not very well documented. Also, it is apparently unstable. Following these recommendations, I would suggest the following changes to the matlab code:

use histcounts() instead of hist() in nb_tie()
specify the bin edges explicitly (rather than simply a bin count)

Point 2. is important to make sure I can code the same "rules" for the inclusion/exclusion of edges as in matlab. What do you think @mcollaudcoen ?

mcollaudcoen commented 4 years ago

I propose to change to

t2=histcounts(real(data),[nanmin(data):resolution:nanmax(data)])';

For Nb_tie_test1_out=Nb_tie(Nb_tie_test1_in,2); the result is [1;1;1;1;1;0;0;2;2;1] Is it similar to python ?

fpavogt commented 4 years ago

Not 100%. Python returns: 1, 1, 1, 1, 1, 0, 0, 2, 2, 1, 1. Matlab seems to be missing a number, no ?

The data contains 11 valid values: 2., 4. , 7.5, nan, 9.5, 17. , 11.5, 17. , 18.5, 21. , 23.5, 19. But summing the matlab counts, I only get up to 10 ... ? I think it is related to the last bin (matlab seems to assign a dedicated bin to the upper-bound. But then if the upper bound is a floating point, it may not get included because of floating point errors ...

What are the bins that you use ? I.e. what is the content of [nanmin(data):resolution:nanmax(data)] ?

fpavogt commented 3 years ago

I've tweaked the Python code as follows:

# Compute the bins using linspace (exact) rather than arange (subject to floating point errors).
nbins = int((np.nanmax(data)-np.nanmin(data))//resolution + 1)
bins = np.linspace(np.nanmin(data), np.nanmin(data) + nbins * resolution, num=nbins + 1)

# Then compute the number of elements in each bin.
return np.histogram(data, bins=bins)[0]

Digging further, it appears that np.histogram() and histcounts() use the same boundary conditions. So the main issue becomes floating point errors when setting these ... hence the linspace() approach implemented here.

mcollaudcoen commented 3 years ago

OK, please give me the results (edges of the bins, nb of bins and final t vector for the test example @fpavogt

mcollaudcoen commented 3 years ago

It's ok, Matlab and python produce similar results for Nb_ties

abigmo commented 3 years ago

Hello,

pardon me but I have to reopen this issue :-D

Using the solution proposed by Frederic is not working in R, or at least it seems so. Using test data 1, with matlab (2019b) I have: 2 0 1 1 1 0 2 2 1 1

(BTW this is different from the output file in the test_data folder)

in R I get: 2 0 1 1 1 0 0 2 2 1 1

I tested another histogram function in R, but I get the same result. Pardon me, but I am a little lost... BTW: the solution in R was the same even before the python-based workaround.

in R I can set: • a vector giving the breakpoints between histogram cells, • a function to compute the vector of breakpoints,

or, as a suggestion only, I can set • a single number giving the number of cells for the histogram, • a character string naming an algorithm to compute the number of cells (see ‘Details’), • a function to compute the number of cells. These 3 are not an option: not stable.

Maybe histcounts in matlab, which allows to set explicitly the nbins is a good option :-)

Thanks, Alessandro

mcollaudcoen commented 3 years ago

Hi,

I will try to open to have a look again at my computer but it is not possible before this afternoon. I also had some pendencies regarding the test variables but I put my time in correcting the paper first. If I remember it well , the R results with 11 bins is right and is also the mat lab results if the present code is applied. I can sit in front of my computer this afternoon to send you a complete answer, but not now. Cheers Martine

Le 17 oct. 2020 à 10:40, abigmo notifications@github.com a écrit :

Reopened #3https://github.com/mannkendall/Matlab/issues/3.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHubhttps://github.com/mannkendall/Matlab/issues/3#event-3889003030, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AI3B5VFPZPEMDXFSPFGUQMTSLFJ7JANCNFSM4RZU4Q6Q.

mcollaudcoen commented 3 years ago

I don't know (or don't remember, sorry) what is data 1. Do you use the tests given in mannkendall/test_data ? What we did is to define clearly the bins (see description by Frederic on the 28th of September.

The corresponding Matlab code is right on the develop_mco branch but I just remark that it is false on the master branch. The code is:

 step=floor(interval/resolution)+1;

 % if the time series is too big and generate a memory error

 % the number of bins can be decreased with usually low impact on the Mann-Kendall results

 t=histcounts(real(data),'BinLimits',[nanmin(data),nanmin(data)+step*resolution],'BinWidth',resolution)';

Does it help ?

abigmo commented 3 years ago

I was using the develop branch, but I downloaded it right before this change. Now I am working on the latest version.

Thanks to the R magics I can decide whether to have the histogram cell to be right-closed (left open) or not. Default in R was right close (differently from histcount in matlab) so that was one of the problems.

So now it's ok for the first dataset, the matlab behaviour is reproduced, but I still have troubles with dataset 2.

With the python-based workaround I get the same result as with R. The edges of the bins for hist in R and histcounts in Matlab are the same, so it's the behaviour of the histogram commands themselves.

It would be good to know why R assigns a data to the i-th bin some data, while matlab assignes it to the i+1-th bin as matlab. We have 3 decimal digits, this cannot be the reason. From a rough check R in my system moves the data to the next bin when it's higher than 1+E-6 in case of closed right edge.

Now I move to something else, before my brain melts. I hope I won't have to write from scratch the histcounts command in R to replicate it :-D

The difference is little, the "Tukey five numbers" (min, 25th quantile, 50th, 75th, max) are -10, 0, 0, 0, 12 producing a negligible difference in the final result.

I'll double check, but AFAIR also the computation of the acf coefficient is not exactly the same between R and matlab.

Please note that in the test data: the Nb_tie output are duplicated: the txt file is for the old version, the csv is for the new one. I surely have the admin permission to fix it...but... I don't know how to do it :-D

Alessandro

mcollaudcoen commented 3 years ago

So it seems that R and Python give similar results but not Matlab. If I understand you well, if the difference occurs when it is higher than 1e-6, then it can be round problem of Matlab ? Anyhow this is not a strong problem since the number if ties is not that important in the final results. I would really go on with the rest of the function and this can probably be solved after the publication of the code if needed. If the problem is in Matlab , I will try to solve it, but after my holiday since it is really not critical ! --> you don't have to write the histcounts from scratch, i will prefer to prove you that this difference is not relevant.

fpavogt commented 3 years ago

Hi @abigmo. This is weird, because Python gives me the same result as matlab for both Nb_test1_in.csv and Nb_test2_in.csv. Could you tell me which bins you have (and how many) ?

For Nb_test1_in.csv:

nbins = 11, bins = array([ 2., 4., 6., 8., 10., 12., 14., 16., 18., 20., 22., 24.]), out = array([1, 1, 1, 1, 1, 0, 0, 2, 2, 1, 1])

For Nb_test2_in.csv:

nbins = 2913, bins = array([-1.265, -1.255, -1.245, ..., 27.845, 27.855, 27.865]), out = array([1, 0, 0, ..., 0, 0, 1])

It doesn't make sense that only one case is consistent. One thing to note is that I am using the .csv files for checking the results (there are two .txt files in the test-data repo that need to be removed before merging).

Can you confirm that you get the same bin counts and bin edges as listed here ?

abigmo commented 3 years ago

I agree, it's very odd this behaviour.

this R code (the same of matlab)

step <- floor(interval / resolution) + 1
bins <- seq(from = m, to = m + step * resolution, by = resolution)

with the 1st test case I have

$breaks
 [1]  2  4  6  8 10 12 14 16 18 20 22 24

$counts
 [1] 1 1 1 1 1 0 0 2 2 1 1

For the second test case I have: breaks (the edges) are 2914

 t$breaks
   [1] -1.265 -1.255 -1.245 -1.235 -1.225 -1.215 -1.205 -1.195 -1.185 -1.175
  [11] -1.165 -1.155 -1.145 -1.135 -1.125 -1.115 -1.105 -1.095 -1.085 -1.075
  [21] -1.065 -1.055 -1.045 -1.035 -1.025 -1.015 -1.005 -0.995 -0.985 -0.975
  [31] -0.965 -0.955 -0.945 -0.935 -0.925 -0.915 -0.905 -0.895 -0.885 -0.875
  [41] -0.865 -0.855 -0.845 -0.835 -0.825 -0.815 -0.805 -0.795 -0.785 -0.775
  [51] -0.765 -0.755 -0.745 -0.735 -0.725 -0.715 -0.705 -0.695 -0.685 -0.675
  [61] -0.665 -0.655 -0.645 -0.635 -0.625 -0.615 -0.605 -0.595 -0.585 -0.575
  [71] -0.565 -0.555 -0.545 -0.535 -0.525 -0.515 -0.505 -0.495 -0.485 -0.475
...
[2891] 27.635 27.645 27.655 27.665 27.675 27.685 27.695 27.705 27.715 27.725
[2901] 27.735 27.745 27.755 27.765 27.775 27.785 27.795 27.805 27.815 27.825
[2911] 27.835 27.845 27.855 27.865

and 2913 counts for each bin

t$counts
   [1]  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0
  [25]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  [49]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  [73]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  [97]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 [121]  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 [145]  0  0  0  0  1  0  0  1  1  1  0  0  1  1  0  0  0  0  1  2  1  3  5  5
 [169]  2  1  4  3  3  3  3  4  7  6  7  4  7 11 12  4 10  3  8 12  9  8  7  8
 [193] 10  9  4 11  8  7  9 15  7  8 10  9  7 11 13 14 12 18 10 10 11  8 13  9
 [217]  9 17 14 16 13  9 11 12 11 12  8 14  7  9 11 11 13 21  7 12  8 12 21 12
...
[2713]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
[2737]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
[2761]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
[2785]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
[2809]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
[2833]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
[2857]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
[2881]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
[2905]  0  0  0  0  0  0  0  0  1

The python-based workaround is

nbins  <- as.integer( floor((M - m) / resolution) + 1)
bins2 <- seq(from = m, to = m + step * resolution, length.out = nbins + 1)

and nbins and bins2 are the same as in the matlab like version.

It's a little weird indeed. Anyway the difference with the reference output of, e.g. Prewhite, is 0 if defined as percentage (R - Mat)/Mat and rounded to the 5th digits. In absolute values the difference is between 10E-10 and 10E-16 roughly.

TTYL, Alessandro

fpavogt commented 3 years ago

Ok, I found the problem. It's indeed related to floating point accuracy. All the misbehaving bins are associated with a data point that is the same as a bin edge. The first mismatch in the example above is bin 147 with edges [0.205, 0.215).

The data contains the number 0.215, which should belong to bin 148 with edges [0.215, 0.225) (all bins are semi-open on the right). But in my case, it falls into bin 147 because, when plotted fully, the edges are: [0.20500000000000007, 0.21500000000000008). That is, floating point errors make the bin edge slightly larger than it should be.

So the real question we have to answer to fix this mismatch: can we implement a histogram algorithm that is robust to floating point errors ?

abigmo commented 3 years ago

Wow, I thought it was not a floating point error... so if this is the case, I guess it is also operative system dependent, software dependent and hardware dependent, isn't it?

I guess it is quite beyond our scope :-D

On 18/10/2020 19:40, Frédéric P.A. Vogt wrote:

Ok, I found the problem. It's indeed related to floating point accuracy. All the misbehaving bins are associated with a data point that is the same as a bin edge. The first mismatch in the example above is bin 147 with edges |[0.205, 0.215)|.

The data contains the number 0.215, which should belong to bin 148 with edges |[0.215, 0.225)| (all bins are semi-open on the right). But in my case, it falls into bin 147 because, when plotted fully, the edges are: |[0.20500000000000007, 0.21500000000000008)|. That is, floating point errors make the bin edge slightly larger than it should be.

So the real question we have to answer to fix this mismatch: can we implement a histogram algorithm that is robust to floating point errors ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mannkendall/Matlab/issues/3#issuecomment-711323295, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHLYNNVEWYHNK2QDCC25EALSLMR7DANCNFSM4RZU4Q6Q.

fpavogt commented 3 years ago

Yes to all that. Trying the to do the same thing with 3 different languages probably makes it more visible. But ultimately the problem here is with the concept of the histogram in itself. This requires to compare numbers directly, and that's a straight road to floating point errors ... but if you have only one code, you'd never see them ...

Still, I wonder if we could do something to improve the situation ... we can't fix the problem ... but maybe make the code more robust ? Something for the long term, evidently.

mannkendall / Matlab

Replace hist() with histcounts() #3