pydata / bottleneck

Fast NumPy array functions written in C
BSD 2-Clause "Simplified" License
1.04k stars 101 forks source link

Unexpected behavior for bn.move_std with float32 array #164

Open d-chambers opened 7 years ago

d-chambers commented 7 years ago

Hello,

A colleague and I have noticed that using the move_std method on a large float 32 array can return some strange results, mainly zeros when the std of the window is not 0. Moreover, if the move_std is calculated again for just the window of interest it returns the correct result. When the array is cast to dtype np.float64 it seems to work fine. Is float32 not supported? If so the operation should probably raise an exception, or at least a warning.

I have attached a jupyter notebook and the data demonstrating the issue. Basically:

import bottleneck as bn
import numpy as np
import pandas as pd
# read data
input_arr = np.load('PXZ.EHZ.npy')
std = bn.move_std(input_arr, 121)
zeros = np.where(std == 0)
assert not len(zeros[0]) 

fails because zeros were found in the std array, but

input_arr = np.load('PXZ.EHZ.npy').astype(np.float64)
std = bn.move_std(input_arr, 121)
zeros = np.where(std == 0)
assert not len(zeros[0]) 

does not fail as there are no zeros in the std array.

bad_move.zip

kwgoodman commented 7 years ago

Hi there.

Can you paste a small segment of the big array? Perhaps 200 elements that contain a 121 element window that gives a std of zero.

d-chambers commented 7 years ago

Sure, What is interesting though, as you can see in the notebook, if a smaller window is used it calculates the result more or less correctly. So when I perform the calculation on the window posted below it doesn’t return any 0s.

-2.5284750462 -6.2919397354 -1.3226305246 -4.7851758003 -3.4860665798 -10.1597480774 -0.3701832891 13.9742021561 6.5433959961 9.634223938 3.0750653744 -12.6853837967 -16.4847431183 -4.0147423744 7.9780387878 2.9790670872 6.7037324905 6.0012392998 -2.1859781742 -0.5457134247 -2.6305334568 -5.8820185661 -6.7868590355 -5.2417421341 7.3230443001 14.1062440872 13.3010320663 -7.6624073982 -19.7685203552 1.7452163696 0.0990015566 -2.9588396549 6.7365036011 4.5946087837 1.9612246752 -2.0452444553 -0.4105802476 0.5827489495 -5.1045341492 0.777156651 3.9382896423 -7.2001643181 -5.1068882942 6.7578649521 4.9806294441 3.4133441448 5.1996588707 -1.8428272009 -7.0068020821 -9.3145847321 -5.6427431107 3.9759426117 5.8565983772 7.0034828186 5.9576721191 -1.3350609541 -4.2324938774 -1.5859234333 -2.2288668156 -4.2464022636 -5.2180476189 -1.6682722569 5.5020709038 10.4394388199 7.052025795 -4.0992021561 -6.9343070984 -2.9708602428 0.5221287012 -0.689814508 -4.2820692062 0.5706662536 7.7491393089 0.2529479265 -1.6038390398 4.6071925163 0.3350660801 -4.9038901329 -2.1772441864 0.3203429878 -5.3912835121 0.9395498037 4.5799665451 2.1893494129 4.1454181671 -0.318448782 -2.8871643543 -4.2155075073 -0.9479162693 1.1749888659 2.3553519249 2.1778562069 -1.5604808331 1.935454011 -0.0063718222 -7.5537371635 -7.497610569 6.4680433273 14.7130641937 4.2097949982 -10.5383806229 -11.7676992416 -0.9247381091 5.4548268318 12.6308956146 5.8996295929 -11.7950487137 -11.125579834 1.5490256548 6.2535657883 -0.1072820574 0.792358458 7.2746787071 1.5382611752 -8.3699884415 -4.8096442223 0.2626971006 -1.2519851923 3.4611811638 5.8206295967 2.9740099907 0.1990164816 -6.9784965515 -7.3488740921 2.0996313095 4.6389508247 5.0002398491 -2.8655395508 -3.1748898029 3.862085104 -2.5001163483 -1.1817802191 6.3094935417 -0.5251377225 -10.1771764755 3.0875940323 3.9778904915 -2.3620967865 3.7947404385 2.8270330429 0.4795057774 -8.6607837677 -7.7135744095 6.3234648705 4.7404131889 1.6732549667 5.7153167725 -1.5910204649 -9.7333593369 -4.3945922852 4.1555552483 4.5729961395 2.9083378315 2.3797662258 -1.7646039724 -6.0924057961 -4.2171339989 1.7789765596 5.4465098381 0.6530832052 -3.2589774132 -0.5568723679 2.4662780762 8.0061149597 -0.5557681322 -15.4680337906 -3.8965210915 7.6350045204 4.0393557549 4.1909799576 3.2838480473 -1.5727992058 -4.0110783577 -3.5144693851 -8.7388019562 -3.2404100895 10.052942276 9.6923313141 5.9719820023 -4.6670451164 -7.6512498856 -1.4481172562 -3.516307354 -0.3272907436 2.1782839298 4.4382796288 4.0346212387 -5.1898503304 -9.1939210892 2.59324646 11.3465366364 4.2283205986 -3.2852146626 -7.3183469772 -1.5758877993 5.887509346 -0.4564013779 -6.5575261116 -4.3141489029 0.2136210501 3.84120965 5.5500483513 10.3696727753 4.7897620201 -9.656493187 -12.9593143463 -0.4358349442 2.6325674057 -8.4255027771 -0.7223656178 12.4292955399 16.2506198883 8.3799123764 -9.7949790955 -16.6930789948 -12.6975183487 -4.5630512238 3.2427294254 10.6418161392 14.281580925 10.4933614731 6.1766161919 -12.498128891 -26.3931293488 -5.6738448143 10.4533243179 2.6849167347 3.6418528557 13.0387125015 4.4636654854 -8.4340057373 -5.8683085442 -5.9987282753 -9.5630569458 5.0076861382 15.0639715195 6.5557694435 1.4748768806 -5.5807547569 -15.1983718872 -9.2029418945 8.4750299454 11.1006193161 2.8714213371 6.2090907097 1.6388434172 -12.5529975891 -8.2857303619 -7.8802065849 -3.5799911022 14.1643867493 14.4701957703 10.8143577576 1.8028203249 -12.4945497513 -18.0881004333 -11.1219739914 -2.3140103817 5.5928707123 20.4235877991 13.0771608353 -2.1964263916 -1.1900795698 -1.5465049744 -14.0739383698 -16.6487941742 0.1947135627 5.295147419 5.7168540955 12.7346935272 12.4139919281 -0.8161699772 -9.9928359985 -9.2182407379 -6.0655498505 -0.3915526271 2.9949274063 2.920861721 -0.1381985247 6.3467025757 4.5117349625 -2.6839256287 -2.6819007397 -5.9622364044 -4.4439377785 2.3083894253 4.4756808281 -0.7431480885 5.8435502052 4.8975563049 -4.7688260078 -1.6921800375 -5.5173192024 -6.3787221909 5.0028986931 1.7507245541 2.7408602238 8.6380777359 -0.7344329953 -10.08979702 -10.8039121628 7.0985417366 11.0808601379 -5.0774583817 1.0460007191 8.5950498581 -0.1733292639 -8.4080076218 -12.6891260147 -2.3999023438 7.407122612 5.8537073135 1.7318782806 5.0176405907 6.2909750938 -3.8044655323 -6.6019325256 -9.0147886276 -2.2527375221 1.9441357851 -2.5591933727 6.0285921097 11.2414560318 4.0100588799 -4.4588136673 -3.5896859169 -4.2076358795 -6.7087316513 -2.4861242771 6.6369481087 12.6313762665 3.2913250923 -9.1072063446 -11.2915477753 -4.3825678819 10.3941774368 10.9443969727 0.434533149 -3.4727339745 -12.7594032288 -16.9814491272 -3.2974574566 7.6645927429 4.2998857498 7.9698944092 17.899154663125 4.2738776207 2.6338267326 -0.0810393319 10.3941774368 10.9443969727 0.434533149 -3.4727339745 -12.7594032288 -16.9814491272 -3.2974574566 7.6645927429 4.2998857498 7.9698944092 17.8991546631

kwgoodman commented 7 years ago

I'm trying to get a small example that I can easily play with. So if you have something along the lines of

In [1]: a = np.random.rand(20)
In [2]: bn.move_std(a, 5)
Out[2]: 
array([        nan,         nan,         nan,         nan,  0.29160087,
        0.22065454,  0.21211704,  0.22380035,  0.15067442,  0.15432015,
        0.16127856,  0.15335015,  0.23545563,  0.25747258,  0.24552778,
        0.24936578,  0.24384096,  0.19625498,  0.22205896,  0.25267315])
In [3]: a
Out[3]: 
array([ 0.08338058,  0.24385437,  0.57172368,  0.9237718 ,  0.56234984,
        0.45308886,  0.27521289,  0.73293583,  0.56480881,  0.60684832,
        0.39607517,  0.84789734,  0.13896913,  0.22843558,  0.46661061,
        0.53213284,  0.83079184,  0.41385475,  0.13899303,  0.17730305])

Then I can copy and paste and play.

kwgoodman commented 7 years ago

Oh, wait, so the array snippet you gave doesn't reproduce the problem? Yeah, there is a path dependence. So the starting point can matter.

d-chambers commented 7 years ago

Unfortunately I haven’t been able to reproduce it on a small dataset

kwgoodman commented 7 years ago

I can reproduce the zero std with float32. First create an array:

In [1]: rs = np.random.RandomState(10)
In [2]: a = rs.rand(100000000)
In [3]: a[a < 0.05] = 100

No problems with float64:

In [4]: np.where(bn.move_std(a, 121) == 0)
Out[4]: (array([], dtype=int64),)

But problems with float32:

In [5]: np.where(bn.move_std(a.astype(np.float32), 121) == 0)
Out[5]: 
(array([16085106, 16085107, 16085119, 17870732, 17870737, 17870739,
        18208491, 25424792, 25445994, 25445995, 25445997, 25445998,
        25548672, 25548674, 25548679, 25548680, 25636840, 25636869,
        25654926, 27905892, 27905893, 27905898, 27905899, 28106880,
        28106881, 28106882, 28106885, 28106889, 28106890, 28106891,
        28106895, 28106896, 28106897, 28106898, 28283068, 32105056,
        32105057, 32105087, 32448657, 32448665, 33170366, 33170367,
        33170370, 33170371, 33170373, 33170376, 33170377]),)

Difference in std:

In [6]: bn.move_std(a, 121)[16085106]
Out[6]: 0.24272340779335716
In [7]: bn.move_std(a.astype(np.float32), 121)[16085106]
Out[7]: 0.0

It might help if I could create a smaller example. But right now I've got a headache from caulking the bathroom sink. That stuff is vile.

d-chambers commented 7 years ago

Ok, by line 3 it looks like it has something to do with the spikes in the data? Interesting.

kwgoodman commented 7 years ago

I suspect that all this is related to https://github.com/kwgoodman/bottleneck/issues/99 and the corresponding PR.

I found a smaller example:

In [4]: for i in range(1000000):
   ...:     rs = np.random.RandomState(i)
   ...:     a = rs.rand(10)
   ...:     a[a < rs.rand()] = 100
   ...:     w32 = np.where(bn.move_std(a.astype(np.float32), 3)==0)[0]
   ...:     w64 = np.where(bn.move_std(a, 3)==0)[0]
   ...:     if w32.size > w64.size:
   ...:         print i
   ...:         print a
   ...:         print w32
   ...:         print w64
   ...:         break
   ...:     
1
[ 100.            0.72032449  100.          100.          100.          100.
  100.          100.          100.            0.53881673]
[4 5 6 7 8]
[]
In [5]: a
Out[5]: 
array([ 100.        ,    0.72032449,  100.        ,  100.        ,
        100.        ,  100.        ,  100.        ,  100.        ,
        100.        ,    0.53881673])
In [6]: bn.move_std(a.astype(np.float64),window=3, min_count=1)
Out[6]: 
array([  0.00000000e+00,   4.96398378e+01,   4.68008879e+01,
         4.68008879e+01,   5.50604123e-07,   5.50604123e-07,
         5.50604123e-07,   5.50604123e-07,   5.50604123e-07,
         4.68864514e+01])
In [7]: bn.move_std(a.astype(np.float32),window=3, min_count=1)
Out[7]: 
array([  0.        ,  49.63983917,  46.80088806,  46.80088806,
         0.        ,   0.        ,   0.        ,   0.        ,
         0.        ,  46.88644791], dtype=float32)

The zeros probably come from this line in the move_std C code:

                if (assqdm < 0) {
                    assqdm = 0;
                }

which I think is to prevent negative std due to round-off error.

kwgoodman commented 7 years ago

The change was made to handle this situation:

In [10]: a = np.array([1, 2, 3]) + 1e9
In [11]: bn.move_std(a, 2)
Out[11]: array([ nan,  0.5,  0.5])

That used to not work before.

matteodefelice commented 7 years ago

This issue is possibly related with this one: https://github.com/pydata/xarray/issues/1346

adament commented 7 years ago

@kwgoodman I am not sure about the smaller example presented in your next to last post. If I understand the code correctly you run a 3 element window over the data which has 7 repeated 100. Thus there should be 5 cases where the window consists entirely of 100s and thus has a sample standard deviation of 0. In this case the float64 example seems to suffer from some residual value from the earlier computations following over into the later windows and thus does not give 0.0 exactly. I am by no means an expert on floating points but 5.50604123e-07**2 ≅ 3.0e-13 which I think is on the border of the precision of 64-bit floating points, thus probably as good as can be expected from the variance computation.

But a fix for the real issue in the earlier example could be to do the calculations in a higher precision, i.e. using float64 for delta, amean and assqdm and just output as float32? This would probably lead to a slowdown for float32 computations.

kwgoodman commented 7 years ago

@adament Keeping the intermediate values in float64 sounds like a great idea. I'll keep it on my list of things to test.

kwgoodman commented 7 years ago

@adament In order to test the idea of using float64 intermediate values, I need a unit test that demonstrates the problem. Does anyone have one? It may take me a while to get around to testing this so anyone is welcome to give it a try.

adament commented 7 years ago

I think this is a small illustrative example (though not minimal):

a = np.empty(10)
a[:5] = 1.0e9
a[5::2] = 1.0
a[6::2] = 0.0
bn.move_std(a, 3)
bn.move_std(a.astype(np.float32), 3)

In this case neither the 64-bit nor the 32-bit floating point computation is correct in the end. As there is too large a residual from the earlier large values.

kwgoodman commented 7 years ago

If it works for neither 32 nor 64 bit then I don't think I can use it for a unit test (?)

adament commented 7 years ago

@kwgoodman I am sorry for the late reply, you can fix the test case such that it works for the 64-bit case by making it a bit less extreme:

a = np.empty(10)
a[:5] = 1.0e6
a[5::2] = 1.0
a[6::2] = 0.0
bn.move_std(a, 3)
bn.move_std(a.astype(np.float32), 3)
bn.move_std(a, 3) - np.asarray([np.nan, np.nan] + [[i:i+3].std() for i in range(8)])

Then the error of the 64-bit computation compared to the non-rolling computation is on the order of 1e-5 which is about 4 digits agreement. I suspect it will be hard to improve significantly on this while retaining a moving window method of computation that accumulates precision error.

jkadbear commented 2 years ago

@adament In order to test the idea of using float64 intermediate values, I need a unit test that demonstrates the problem. Does anyone have one? It may take me a while to get around to testing this so anyone is welcome to give it a try.

Is the idea of "using float64 intermediate values" still in the TODO list? I have tested this idea and it actually improves the precision problem when using float32. And it only requires few lines of code modification. For example:

https://github.com/pydata/bottleneck/blob/b8f4c52dd5a4cf26a58c9906bc823b8838ee5bac/bottleneck/src/move_template.c#L159

Just change it to npy_float64 asum, ai, aold, count_inv;.

rdbisme commented 2 years ago

@adament In order to test the idea of using float64 intermediate values, I need a unit test that demonstrates the problem. Does anyone have one? It may take me a while to get around to testing this so anyone is welcome to give it a try.

Is the idea of "using float64 intermediate values" still in the TODO list? I have tested this idea and it actually improves the precision problem when using float32. And it only requires few lines of code modification. For example:

https://github.com/pydata/bottleneck/blob/b8f4c52dd5a4cf26a58c9906bc823b8838ee5bac/bottleneck/src/move_template.c#L159

Just change it to npy_float64 asum, ai, aold, count_inv;.

Hello @jkadbear, more information on the current state of the project can be found in the latest messages here: #388. TL,DR: the state of the master branch was partially incomplete, therefore we reverted to what it was at the latest release and applied some bugfixes, especially for build, in order to allow the package to be installable again with new versions of Python. The previous state is in develop branch.

PR accepted for this feature :)

jkadbear commented 2 years ago

PR is made. https://github.com/pydata/bottleneck/pull/407.