Different results with the same input for each run if the input arrays are `float`

yqiuu commented 4 years ago

General information

Corrfunc version: Version: 2.3.0
platform: Ozstar
installation method (pip/source/other?): pip

Issue description

I am using Corrfunc.theory.xi or Corrfunc.theory.DD to compute the pair counts with the same inputs. However, I find that the results will be different if the data type of input arrays is float.

Expected behavior

The results should be the same for every runs.

Minimal failing example

x = np.array([154.85118333,  65.02488473,  92.93124316,  40.22289262,
        75.90588271,  71.07882935,  90.2276351 ,  56.82908333,
       124.85204405,  51.25859749, 105.98860566, 124.08584903,
        41.57950397,  62.49098214,  20.04656058,  83.8430118 ,
        38.02530046,  12.23478396,  84.34372912,  70.35375921,
       135.23040918, 148.44966689,  43.6155909 , 152.17542268,
        98.72749309,  17.43531892,  68.68113758,  33.19744844,
        94.98721209, 108.9545778 , 106.76818106,  56.58635991,
        63.77404712,  64.45492508, 117.5809074 , 114.54570256,
       109.16422456,  12.51235956,  14.91553044,  91.43144244,
        93.17743966, 137.08102013,  71.79823884,  33.97104644,
        17.37037482,  21.04725355,  39.55814064, 103.60637876,
       119.50138402,  65.22024506, 126.14864194,  79.31653661,
        37.61338972, 151.55829705,  16.918109  ,  42.58004861,
        88.24799598,  17.26530421, 104.79261089, 130.37734033,
        55.98887676, 107.7661497 , 153.48073547, 143.89398281,
       151.39961079, 132.10825392, 102.30646915,  67.70804164,
        21.07608865, 117.19790333,  32.72904734,  44.89578377,
       134.21451835,  84.39619546, 143.77014508,  86.49336511,
       108.47361749, 119.62750913, 151.93028566,   0.60726268,
        29.4392829 ,  39.18624891, 105.35469322, 104.3481228 ,
       116.57153727,  68.10018809,  78.96779652,  70.45179203,
        61.96438758, 153.89263378, 147.14038545,  55.45310755,
       111.79159143,  42.79884491, 118.94616899,  25.81515871,
        91.21910386, 128.88428732, 145.02140727,  97.66512176], dtype='f4')

y = np.array([ 77.84571458, 104.33599813,  98.36817661,  56.01917286,
       128.62743074, 121.20510679,  17.87770117, 134.62825949,
        65.8837504 ,  18.83440626,  94.63609492,  10.78794766,
        78.63290409,  56.02817646, 106.80757157,  26.52846038,
       115.66551694,   0.57943425,  80.84303102,  43.22577345,
       138.48881521, 119.16408847,  85.78498942,  86.94018769,
        51.73889238,  41.38358536, 141.34486077,  71.30358166,
        38.37348283,  34.75856299, 108.2558653 , 109.35637168,
       100.31682143, 114.3775827 , 121.04954873,  41.72214442,
        42.33817669,  96.23527593,  57.01260846,   8.62908876,
        68.40903289,  96.97740384,  60.18823752,  30.05540619,
       134.92406776,  48.98088582,  73.98392646, 153.58756323,
       113.18718949,  98.60237829,  49.20476909,  66.63729536,
       154.40101166,  40.02914827, 122.37432484,  65.49914572,
        25.22895614,  45.79567158,  13.7069279 , 105.96681219,
       103.66028928, 111.07902101,  74.40591928,  76.14828706,
       110.91063698,  59.79873176,  18.19348773,  80.97048094,
       103.63965253,  32.41450066,  19.44116411,   2.02361062,
        70.24952264,  92.03907644, 108.42900496, 141.46389902,
        70.13522273,  82.91563049,  21.09319773, 118.64183219,
        83.81475967,  89.15656255,  94.79367725, 139.71787273,
        23.93532312,  89.31817772, 116.96391399,  91.33201325,
        18.15420355,  49.7023854 ,  12.29338998,   6.80305659,
        72.73188431,  41.81897229,  45.84016126,  80.37461317,
       118.70727459,  20.8124234 , 117.52035398, 144.9063349 ], dtype='f4')

z = np.array([126.1673404 , 143.7068247 ,  84.37634413,  97.99701078,
       149.8424944 , 115.77037146, 151.22285246,  60.36288831,
       128.19731295,  57.00157346,  15.41354678,  82.09723167,
        60.87504181,  62.24186358, 128.92126519,  93.15800702,
        65.15668227, 104.79176826, 133.15048701,  79.01644248,
        20.86810594,   9.10687744,  59.53074378, 125.21154312,
        30.20577459,  63.15010443,  35.06682781, 109.28355682,
       120.63974431,  75.14186164,  85.97929056,  91.00807882,
        14.85747274, 115.57200751, 136.30156333,   4.08199701,
        22.31122535,  76.0298412 , 153.72828323,  68.48769663,
       107.35316796,  70.56704775,   6.85630288, 101.02374911,
       106.85065383,  18.72715028,  43.61241522,  20.34629251,
         5.84743879,   9.93942087,  89.93757267,  71.73314076,
       117.52762254, 121.73349393, 141.55942129,  32.16055404,
        64.98742215,  16.8992682 , 134.73327741,  71.71104291,
        74.14155019,   2.53680358, 126.76626808,   1.15549836,
       106.90033722, 115.41227723, 129.36993892,  10.98956214,
       141.12778976, 108.77384425, 146.20094887, 149.70766571,
       120.14409683,  91.83213675,  83.63854162,  91.2718239 ,
       130.32999158,  59.86162484, 138.89096181, 101.20543133,
        23.54259728, 128.99386573,  95.05906749, 140.36261292,
       138.92321152,  27.40138813,  76.20526792,  29.28753575,
       136.84337428, 126.84161815,   3.56608488,  12.68966642,
        35.14077512,   1.69097959,  12.78896413,  78.45003927,
       130.49529943,  36.46715599, 115.43333227, 137.62133515], dtype='f4')

import Corrfunc
r_bins = np.logspace(0, 1., 20)
Corrfunc.theory.DD(1, 4, r_bins, x, y, y)['npairs']

manodeep commented 4 years ago

Thanks for reporting the issue. Are the pair-counts different between DD and xi for the same precision? Or are the pair-counts different if the precision is different?

The second case is expected; the first one is not.

manodeep commented 4 years ago

@yqiuu Could you please attach the differences that you see in the pair-counts?

yqiuu commented 4 years ago

The pair-counts are different with different runs using 'float'.

with the above arrays, if I run

import Corrfunc
r_bins = np.logspace(0, 1., 20)

for i in range(10):
    print(Corrfunc.theory.DD(1, 4, r_bins, x, y, y)['npairs'])

I got

[ 0 2 0 0 0 0 0 0 2 0 2 4 4 6 6 10 6 8 14] [ 0 0 0 2 0 0 0 0 2 0 2 4 4 6 6 10 6 8 14] [ 0 2 0 0 0 0 0 0 2 0 2 4 4 6 6 10 6 8 14] [ 0 0 0 2 0 0 0 0 2 0 2 4 4 6 6 10 6 8 14] [ 0 2 0 0 0 0 0 0 2 0 2 4 4 6 6 10 6 8 14] [ 0 0 0 2 0 0 0 0 2 0 2 4 4 6 6 10 6 8 14] [ 0 2 0 0 0 0 0 0 2 0 2 4 4 6 6 10 6 8 14] [ 0 0 0 2 0 0 0 0 2 0 2 4 4 6 6 10 6 8 14] [ 0 2 0 0 0 0 0 0 2 0 2 4 4 6 6 10 6 8 14] [ 0 0 0 2 0 0 0 0 2 0 2 4 4 6 6 10 6 8 14]

This happens by chance. Sometimes, it can give the same results. It seems that the difference only appears at small scale bins.

lgarrison commented 4 years ago

I can't reproduce this. Even with 10K iterations, I always get:

[ 0  0  0  0  0  0  0  0  2  0  2  4  4  6  6 10  6  8 14]

(no floating "2" in the small bins).

@yqiuu Ozstar is an AVX-512 machine, right? I currently don't have access to my AVX-512 machines, so I'm testing on AVX. Can you try isa="avx" or isa="fallback"? And separately nthreads=1?

manodeep commented 4 years ago

@yqiuu Is the repeating y for the 3rd co-ordinate intended or should that be z?

yqiuu commented 4 years ago

It should be z. However, if this is set to z, the test arrays are not suitable to show the problem, because the small scale bins are always zero. Sorry for the confusion.

I have tried the following:


import numpy as np

x = np.array([154.24363377,  71.10038029,  96.5765405 ,  48.12103686,
        82.58892783,  78.97697358,  93.87293244,  63.51212846,
       124.85204405,  57.94164262, 107.20370478, 124.08584903,
        50.69274732,  68.56647771,  23.08430836,  86.27321003,
        44.70834559,  14.66498218,  87.3814769 ,  78.25190344,
       135.23040918, 147.84211733,  55.15903248, 151.56787312,
       101.15769132,  21.08061626,  76.57928182,  38.66539445,
        98.63250943, 108.9545778 , 107.37573062,  63.26940503,
        69.84954268,  70.53042064, 115.75825873, 112.72305389,
       109.16422456,  15.55010734,  18.56082778,  94.46919022,
        96.822737  , 135.86592101,  79.08883352,  39.43899245,
        21.01567216,  24.08500134,  47.45628488, 105.42902743,
       117.67873535,  71.29574062, 126.14864194,  83.56938351,
        44.29643485, 150.95074749,  20.56340634,  52.90839107,
        92.50084288,  20.91060155, 106.00771001, 130.98488988,
        62.67192188, 108.37369926, 152.26563635, 143.28643325,
       150.79206123, 132.71580348, 103.52156827,  74.99863632,
        24.72138599, 114.76770511,  38.8045429 ,  55.22412623,
       134.21451835,  87.43394324, 143.16259552,  89.53111289,
       108.47361749, 117.80486046, 151.3227361 ,   1.21481224,
        35.51477847,  46.47684359, 105.96224277, 106.17077147,
       114.7488886 ,  75.39078277,  83.82819297,  78.95748582,
        68.03988315, 152.67753467, 146.53283589,  61.52860312,
       111.18404188,  53.12718738, 117.12352032,  29.46045605,
        94.8644012 , 129.49183688, 143.80630815, 100.70286954], dtype='f4')

y = np.array([ 74.20041724, 112.23414236,   4.19799531, 104.6231374 ,
        73.94797063,  93.25782718,  70.73451261, 122.47726835,
        54.94785838,  23.08725315,  21.7301481 ,   5.32000165,
        35.49688556,  22.61295084, 152.98133789, 138.92512839,
       132.67690453, 130.5950394 ,  25.55602135, 139.82615298,
        79.5565082 , 119.16408847,  25.03003374, 145.26494514,
        26.221811  , 104.56873926,  35.02368834, 117.47734797,
        39.58858194,  71.81908595,  74.23309012, 116.64696636,
       125.22635326,  82.78500574, 108.8985576 ,  58.73353201,
        83.65154655, 102.3107715 ,  40.00122087,  20.7800799 ,
        23.45036569,  79.96601625,  92.99591359,  64.07818137,
       133.70896864,  55.05638139,  30.24035837,  86.75711198,
       139.31182043,  41.49271995,  50.41986821,  84.86378206,
        65.69877637, 145.74277115, 150.929154  , 115.92575893,
        42.84789328, 104.72797859,  57.45049599,  54.32509986,
        87.8640008 ,  14.47864149, 144.88166787,  63.38974637,
       153.43910595,  70.12707422,  95.95983099,  92.51392252,
        49.56774198, 144.2036191 ,  72.90552511,  98.62399015,
        72.67972087,  25.81617475,  87.16477047, 145.10919636,
        53.73138469, 117.54595522,  78.81040562,  17.78860576,
       103.25634548, 107.38304925,  92.97102858,   7.27206935,
        20.29002578,  94.17857418,  16.11068757,  28.75440891,
        91.66769992, 144.48011626, 126.51270665,  48.11642645,
        35.06381179,  72.19645013,  38.54956658, 149.63526264,
        22.10689507,  61.5182437 ,  65.87864166,  30.68701823], dtype='f4')

z = np.array([146.21647577,  61.08008498,  97.13488482, 127.15938951,
        55.6723131 ,  89.64574052,  47.93942781,  44.56659984,
       117.26142093, 136.5905654 ,  64.01751132,  82.70478123,
       150.79237621, 130.89496349,   2.55095738,  44.55404248,
       100.39455657,   4.54609139,  63.88983754,  29.80492839,
        97.41935009,  53.45799508, 144.58768173, 114.88320066,
        17.4472339 ,   4.82534698,  53.29331451,  12.6831773 ,
        90.26226647,  92.76079878,  92.05478613,  63.06079921,
        55.56329305,  71.22088986,   0.81801217,  81.84834028,
        92.78697394,  68.13169696, 126.38855318,  52.0838586 ,
       144.41369092,  96.69167869,  90.09059216,  32.97819875,
       140.87342901,  97.70859266,  16.88023473,  32.49728365,
        81.79113339,  76.16232256,  89.93757267, 133.09564599,
        28.82538725, 113.22780013,  11.54381614,  26.69260803,
        31.57219653,  58.21263806,  44.81594301,  91.15262872,
        79.6094962 ,  15.29534428, 115.83037606, 105.65402212,
        27.31134528,  83.21215073,  33.9846585 ,  89.36345496,
       126.5466004 , 139.75887164,   8.89474904,  94.42065605,
        99.4874119 ,  63.88485714,  91.53668585,  39.02256202,
        32.51451294, 109.6806885 , 138.28341225, 144.34144986,
        20.5048495 , 121.70327105,  22.15312068,  19.46025113,
       134.06281506, 122.78666854,  46.43533964, 108.87652769,
        28.09200362, 116.51327568,  91.05322106, 144.52792024,
        38.17852291, 131.09903518,  48.63438798, 147.10313919,
        47.86855971,  86.28621964, 105.10498981,  51.34929809], dtype='f4')

import Corrfunc
r_bins = np.logspace(0, 1., 20)
print("iax='avx'")
for i in range(10):
    print(Corrfunc.theory.DD(1, 4, r_bins, x, y, z, isa='avx')['npairs'])

print("iax='fallback'")
for i in range(10):
    print(Corrfunc.theory.DD(1, 4, r_bins, x, y, z, isa='fallback')['npairs'])

print("nthreads=1")
for i in range(10):
    print(Corrfunc.theory.DD(1, 1, r_bins, x, y, z)['npairs'])

I can have iax='avx' [ 0 10 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 0 10 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 0 10 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 0 10 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 0 10 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] iax='fallback' [ 0 10 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 0 10 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 0 10 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 0 10 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 0 10 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] nthreads=1 [ 0 10 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 0 10 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 0 0 0 10 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 0 0 0 10 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16]

The results of each run are different.

lgarrison commented 4 years ago

Okay, so it does not appear to be a multithreading or instruction set issue. Can you also try enable_min_sep_opt=False, and separately copy_particles=False? I can't imagine why either of these would be platform-dependent though...

If neither of those work, could you try building the latest version of Corrfunc from Github? It's usually pretty straightforward: https://github.com/manodeep/Corrfunc/#preferred-install-method. Just make sure that it's not finding your old pip version, perhaps by creating a new conda environment.

@manodeep, have you been able to reproduce this? I am still unable to. Could you test it on Ozstar?

manodeep commented 4 years ago

@yqiuu I can not reproduce this on ozstar.

import numpy as np

x = np.array([154.24363377,  71.10038029,  96.5765405 ,  48.12103686,
              82.58892783,  78.97697358,  93.87293244,  63.51212846,
              124.85204405,  57.94164262, 107.20370478, 124.08584903,
              50.69274732,  68.56647771,  23.08430836,  86.27321003,
              44.70834559,  14.66498218,  87.3814769 ,  78.25190344,
              135.23040918, 147.84211733,  55.15903248, 151.56787312,
              101.15769132,  21.08061626,  76.57928182,  38.66539445,
              98.63250943, 108.9545778 , 107.37573062,  63.26940503,
              69.84954268,  70.53042064, 115.75825873, 112.72305389,
              109.16422456,  15.55010734,  18.56082778,  94.46919022,
              96.822737  , 135.86592101,  79.08883352,  39.43899245,
              21.01567216,  24.08500134,  47.45628488, 105.42902743,
              117.67873535,  71.29574062, 126.14864194,  83.56938351,
              44.29643485, 150.95074749,  20.56340634,  52.90839107,
              92.50084288,  20.91060155, 106.00771001, 130.98488988,
              62.67192188, 108.37369926, 152.26563635, 143.28643325,
              150.79206123, 132.71580348, 103.52156827,  74.99863632,
              24.72138599, 114.76770511,  38.8045429 ,  55.22412623,
              134.21451835,  87.43394324, 143.16259552,  89.53111289,
              108.47361749, 117.80486046, 151.3227361 ,   1.21481224,
              35.51477847,  46.47684359, 105.96224277, 106.17077147,
              114.7488886 ,  75.39078277,  83.82819297,  78.95748582,
              68.03988315, 152.67753467, 146.53283589,  61.52860312,
              111.18404188,  53.12718738, 117.12352032,  29.46045605,
              94.8644012 , 129.49183688, 143.80630815, 100.70286954], dtype='f4')

y = np.array([ 74.20041724, 112.23414236,   4.19799531, 104.6231374 ,
               73.94797063,  93.25782718,  70.73451261, 122.47726835,
               54.94785838,  23.08725315,  21.7301481 ,   5.32000165,
               35.49688556,  22.61295084, 152.98133789, 138.92512839,
               132.67690453, 130.5950394 ,  25.55602135, 139.82615298,
               79.5565082 , 119.16408847,  25.03003374, 145.26494514,
               26.221811  , 104.56873926,  35.02368834, 117.47734797,
               39.58858194,  71.81908595,  74.23309012, 116.64696636,
               125.22635326,  82.78500574, 108.8985576 ,  58.73353201,
               83.65154655, 102.3107715 ,  40.00122087,  20.7800799 ,
               23.45036569,  79.96601625,  92.99591359,  64.07818137,
               133.70896864,  55.05638139,  30.24035837,  86.75711198,
               139.31182043,  41.49271995,  50.41986821,  84.86378206,
               65.69877637, 145.74277115, 150.929154  , 115.92575893,
               42.84789328, 104.72797859,  57.45049599,  54.32509986,
               87.8640008 ,  14.47864149, 144.88166787,  63.38974637,
               153.43910595,  70.12707422,  95.95983099,  92.51392252,
               49.56774198, 144.2036191 ,  72.90552511,  98.62399015,
               72.67972087,  25.81617475,  87.16477047, 145.10919636,
               53.73138469, 117.54595522,  78.81040562,  17.78860576,
               103.25634548, 107.38304925,  92.97102858,   7.27206935,
               20.29002578,  94.17857418,  16.11068757,  28.75440891,
               91.66769992, 144.48011626, 126.51270665,  48.11642645,
               35.06381179,  72.19645013,  38.54956658, 149.63526264,
               22.10689507,  61.5182437 ,  65.87864166,  30.68701823], dtype='f4')

import Corrfunc
r_bins = np.logspace(0, 1., 20)
nthreads = 4
autocorr = 1
isas = ['avx512f', 'avx', 'fallback']
for isa in isas:
    print("isa={}".format(isa))
    for i in range(10):
        print(Corrfunc.theory.DD(autocorr, nthreads, r_bins, x, y, y,
                                 isa=isa)['npairs'])

nthreads = 1
print("nthreads= {}".format(nthreads))
for i in range(10):
    print(Corrfunc.theory.DD(autocorr, nthreads, r_bins, x, y, y)['npairs'])

The output is:

[~/temp/test_yisheng_bug @farnarkle1] python test_script.py
isa=avx512f
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
isa=avx
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
isa=fallback
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
nthreads= 1
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]
[ 0  0  0  2  2  2  0  4  2  0  4 16  2  4  6 10 24 18 22]

@yqiuu I will also note that the results I get are different from what you get.

Modules loaded on ozstar:

[~/temp/test_yisheng_bug @farnarkle1] ml

Currently Loaded Modules:
  1) slurm/.latest  (H,S)   3) gcccore/7.3.0   5) gcc/7.3.0       7) openblas/0.2.20   9) scalapack/2.0.2-openblas-0.2.20  11) python/3.6.4               13) gsl/2.5
  2) nvidia/.latest (H,S)   4) binutils/2.30   6) openmpi/3.0.0   8) fftw/3.3.7       10) sqlite/3.21.0                    12) numpy/1.14.2-python-3.6.4

[~/temp/test_yisheng_bug @farnarkle1] python
Python 3.6.4 (default, Mar 26 2018, 15:30:40) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import Corrfunc
>>> print(Corrfunc.__version__)
2.3.0

yqiuu commented 4 years ago

@manodeep. In the second example, I use z instead of y. Otherwise, we got the same results. However, I still have the "different results" problem. I need to think about this.

manodeep commented 4 years ago

@yqiuu Thanks for letting me know.

It must be that something is wrongly set in your environment. My recommendation would be the following steps:

ml purge
unset PYTHONPATH
ml numpy/1.14.2-python-3.6.4 gsl/2.5
python -m pip install Corrfunc --user (note that this will install Corrfunc v2.3.1)
python test_script.py

manodeep commented 4 years ago

@yqiuu I can now reproduce this behaviour on ozstar

manodeep commented 4 years ago

@yqiuu Upon further digging, it looks like a combination of AVX512F and gcc7.3.0 is causing the non-determinism. This issue might take a while to solve but for now, I would recommend using icc2018 to compile Corrfunc.

manodeep commented 4 years ago

For reference, I could not reproduce the issue on these platforms + toolchains:

gstar with gcc7.3
sstar with gcc7.3
ozstar with icc2018
ozstar with gcc7.3 but with AVX512F disabled

If you do need to stick to gcc7.3 then you can disable AVX512F by adding -mno-avx512f in the common.mk file in the root directory and running make distclean and then installing..

manodeep commented 4 years ago

@lgarrison Do you have any ideas on how we could work around this compiler issue or at least warn the users with AVX512F kernels compiled with gcc7.3?

lgarrison commented 4 years ago

Is it fair to say that we don't know it's a compiler bug yet? It could just be a race condition (or use of uninitialized data, etc) with a lower reproduction rate with other compilers. Maybe the best workaround would be to detect the compiler version in the Makefile and issue a warning or disable AVX512F appropriately.

Speaking of uninitialized data, we've never gotten valgrind to work with AVX512, have we? Maybe we could try Intel Inspector?

manodeep commented 4 years ago

You make a good point - I was simply going with the fact that the code works with icc. There is still no valgrind for avx512 (open issue here)

What is Intel Inspector?

lgarrison commented 4 years ago

It's what came up when I googled "intel valgrind alternative"! I've never used it myself. This old article suggests it has Valgrind-like functionality though: https://software.intel.com/en-us/articles/transitioning-from-valgrind-tools-to-intel-inspector-xe

And since it's an Intel tool I would hope that it would have full AVX-512 support.

manodeep commented 4 years ago

Could not get Inspector to work on ozstar but it does look promising.

Btw, I now remember why I think this is a compiler bug -- the issue is in all the AVX512F/AVX/SSE kernels when AVX512F is enabled but disappears from the AVX/SSE kernels when AVX512F is disabled. And we have tested the AVX/SSE kernels on the use of uninitialised data, and I confirmed with running valgrind on this specific example on the sstar node. Running with all of the -fsanitize options on AVX512F machine did not produce any errors, but the error itself might be beyond the purview of fsanitize anyway.

I also played around a bit with the data yesterday - turns out whether the issue occurs (or, more accurately, how frequently the issue occurs) is sensitive to the total number of bins. Also, what the garbage values depend on Rmin, i.e., the incorrect values produced depend on the correct result. I will also note that I have intermittent difficulty producing the wrong result from the shell - I have shell script that will frequently run ~ 10^5 times without detecting any mis-match between consecutive runs.

All that is to say, regardless of how/where this is coming from, it is a weird bug.

lgarrison commented 4 years ago

That does sound more like a compiler bug! Have you tried disabling AVX512 for all compilation units but the kernel itself? That might narrow down the possible sources considerably.

On Wed, Sep 18, 2019 at 7:33 PM Manodeep Sinha notifications@github.com wrote:

Could not get Inspector to work on ozstar but it does look promising.

Btw, I now remember why I think this is a compiler bug -- the issue is in all the AVX512F/AVX/SSE kernels when AVX512F is enabled but disappears from the AVX/SSE kernels when AVX512F is disabled. And we have tested the AVX/SSE kernels on the use of uninitialised data, and I confirmed with running valgrind on this specific example on the sstar node. Running with all of the -fsanitize options on AVX512F machine did not produce any errors, but the error itself might be beyond the purview of fsanitize anyway.

I also played around a bit with the data yesterday - turns out whether the issue occurs (or, more accurately, how frequently the issue occurs) is sensitive to the total number of bins. Also, what the garbage values depend on Rmin, i.e., the incorrect values produced depend on the correct result. I will also note that I have intermittent difficulty producing the wrong result from the shell - I have shell script that will frequently run ~ 10^5 times without detecting any mis-match between consecutive runs.

All that is to say, regardless of how/where this is coming from, it is a weird bug.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/manodeep/Corrfunc/issues/193?email_source=notifications&email_token=ABLA7S5P5I2HA7BQQFW3M5TQKK3GLA5CNFSM4ISICW32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7BX7QA#issuecomment-532905920, or mute the thread https://github.com/notifications/unsubscribe-auth/ABLA7S6LHIZOPQFG4DV2PLTQKK3GLANCNFSM4ISICW3Q .

-- Lehman Garrison, lgarrison@flatironinstitute.org Flatiron Research Fellow, Cosmology X Data Science Group Center for Computational Astrophysics, Flatiron Institute https://lgarrison.github.io

manodeep commented 4 years ago

I might do the opposite for starters, only disable AVX512F everywhere except within the kernels file and see what happens. And then try what you suggested.

The bottleneck is that I am unsure how to push and pop compile options. Will give it a shot

lgarrison commented 4 years ago

I was able to reproduce the bug and track it to the optimization flag -fvect-cost-model added by -O3. From https://gcc.gnu.org/onlinedocs/gcc-7.4.0/gcc/Optimize-Options.html#Optimize-Options:

-fvect-cost-model=model

Alter the cost model used for vectorization. The model argument should be one of ‘unlimited’, ‘dynamic’ or ‘cheap’. With the ‘unlimited’ model the vectorized code-path is assumed to be profitable while with the ‘dynamic’ model a runtime check guards the vectorized code-path to enable it only for iteration counts that will likely execute faster than when executing the original scalar loop. The ‘cheap’ model disables vectorization of loops where doing so would be cost prohibitive for example due to required runtime checks for data dependence or alignment but otherwise is equal to the ‘dynamic’ model. The default cost model depends on other optimization flags and is either ‘dynamic’ or ‘cheap’.

Using -O3 -fvect-cost-model=cheap seems to fix the problem.

I've found two mentions of a similar bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?format=multiple&id=78807 and https://github.com/godotengine/godot/issues/4623, both of which point the finger at GCC not respecting alignment. This seems plausible, given that the issue only appears when we introduce single precision arrays.

I suspect we could dig deeper and find the bugged loop and make a minimal reproducer, but in the meantime we should push a fix soon.

@manodeep, can you confirm the fix on your end? @yqiuu, it would be great if you could test it too. Just add -fvect-cost-model=cheap to this line: https://github.com/manodeep/Corrfunc/blob/master/common.mk#L205.

manodeep commented 4 years ago

Thanks @lgarrison. I am curious - how did you track this down?

Given that even the floating point works frequently, alignment seems likely to be the culprit. But even when the results are wrong, the results are always the same wrong value. Does this mean that what matters is the alignment for the npairs arrays?

lgarrison commented 4 years ago

First, I found that copy_particles=False lead to a higher reproduction rate (nearly 100%). Then, since we thought it could be a compiler bug, I backed off to -O0 and saw it didn't break, so I increased the optimization level until it did. The reproduction rate went from 0 to 100% between -O2 and -O3, so from there I just had to try all the -O3 flags until one broke it.

Maybe the npairs alignment is the problem? But it should always be 8-byte aligned, while the floating point arrays will be 4-byte aligned half the time (I hope!). And since the code doesn't break with 8-byte floating point arrays, I'd guess the 4-byte alignment was the culprit.

But whatever loop/code/array is at fault must be basically identical between all kernels (or outside the kernels?), because the behavior is the same for all isa.

So, do we add code to common.mk to parse the GCC version and add this flag as necessary? Or do we try to figure out what code pattern triggers it and avoid writing it?

Also just noting that this bug is present in GCC 7.4 as well.

manodeep commented 4 years ago

If the copy_particles=False leads to a higher failure rate, then the alignment of the input arrays might matter. In your script, will you please print out the memory address for the X1/Y1/Z1/npairs arrays, and then we can see whether the bug has anything to do with the alignment. (My internet connection in India is too unstable currently for productive work)

In the short term, perhaps adding in the -fvect-cost-model-cheap into common.mk while we figure out a workaround in the code (assuming we can pinpoint the problem-causing section).

manodeep commented 4 years ago

@lgarrison I can confirm that adding the -fvect-cost-model=cheap seems to fix the problem on the ozstar platform (where the issue was originally detected).

lgarrison commented 4 years ago

The cause appears to be that the rupp_sqr array is not computed correctly. I.e. this loop goes in with correct rupp values but sometimes comes out with the first few rupp_sqr elements wrong (zero, nan, or garbage):

for(int i=0; i < nrpbin;i++) {
    rupp_sqr[i] = rupp[i]*rupp[i];
}

Source is: https://github.com/manodeep/Corrfunc/blob/master/theory/DD/countpairs_impl.c.src#L394-L396

Note that rupp is double and rupp_sqr is DOUBLE (i.e. float). There are a couple of rupp/rupp_sqr 64-byte misalignment values that seem to break it, like 48/48; and some that seem to work, like 16/48. But the exact pattern is unclear.

Seems easy enough to reproduce in Corrfunc, but it doesn't work in a minimal reproducer (with all these alignments and types forced by hand). I'll keep trying.

The loop does work if rupp is DOUBLE. But it's nice to preserve the full double value until after the multiplication.

And I've checked that we're allocating rupp memory correctly in setup_bins (at least by reading the code).

lgarrison commented 4 years ago

Okay, I have a reproducer! Believe it or not, even a simple loop that copies the elements (no squaring) fails. I've linked the example, can you run it and let me know if it also fails for you? If so, I will submit a GCC bug report.

https://gist.github.com/lgarrison/5ebb747d3f1126319fd3744f925298e3

manodeep commented 4 years ago

@lgarrison That's fantastic! And very disconcerting that such a simple piece of code can generate bugs!

That said, I copy-pasted the code in the gist on ozstar; but could not reproduce the bug :(

[~/temp/test_yisheng_bug @farnarkle2] ml

Currently Loaded Modules:
  1) nvidia/.latest (H,S)   4) binutils/2.30   7) openmpi/3.0.0  10) git/2.18.0       13) scalapack/2.0.2-openblas-0.2.20  16) numpy/1.14.2-python-3.6.4
  2) slurm/.latest  (H,S)   5) gcc/7.3.0       8) szip/2.1.1     11) openblas/0.2.20  14) sqlite/3.21.0
  3) gcccore/7.3.0          6) gsl/2.5         9) hdf5/1.10.1    12) fftw/3.3.7       15) python/3.6.4

[~/temp/test_yisheng_bug @farnarkle2] gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/apps/skylake/software/core/gcccore/7.3.0/libexec/gcc/x86_64-pc-linux-gnu/7.3.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../configure --enable-languages=c,c++,fortran --enable-lto --enable-checking=release --disable-multilib --enable-shared=yes --enable-static=yes --enable-threads=posix --enable-gold=default --enable-plugins --enable-ld --with-plugin-ld=ld.gold --prefix=/apps/skylake/software/core/gcccore/7.3.0 --with-local-prefix=/apps/skylake/software/core/gcccore/7.3.0 --enable-bootstrap --with-isl=/dev/shm/easybuild/GCCcore/7.3.0/dummy-/gcc-7.3.0/stage2_stuff
Thread model: posix
gcc version 7.3.0 (GCC) 

[~/temp/test_yisheng_bug @farnarkle2] gcc -mavx512f -O3 -fPIC corrfunc_bug_gh193.c -o corrfunc_bug_gh193
[~/temp/test_yisheng_bug @farnarkle2] ./corrfunc_bug_gh193 
loaded r (offset 16 from 64 bytes): 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 
loaded r_float (offset 32 from 64 bytes): 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 
No bug.

manodeep commented 4 years ago

I was trying to diagnose the errant code produced but got lost in the sea of assembly. Link to godbolt here

manodeep commented 4 years ago

@lgarrison Will you please change the declaration to

DOUBLE rupp_sqr[nrpbin] __attribute__((aligned(64)));;

and see if the bug still occurs? May be we could also try with posix_memalign for rupp.

The other option is to replace the stack array to a malloc'ed array for rupp_sqr, and seeing if that makes any difference.

lgarrison commented 4 years ago

Ahh, too bad it won't reproduce on ozstar! That's Skylake, right? Unfortunately, I cannot get Corrfunc to break on Skylake, only Cascade Lake.

I also note that the alignment values are different on Skylake; it's probably related. In earlier iterations of the bug, I did try using posix_memalign to force alignments that I had seen fail. But it didn't work; I suspect that (mis-)alignment is a necessary but not sufficient condition.

Forcing the stack alignment of rupp_sqr does seem to be one way to make the bug go away on Cascade Lake. But many other things (like removing unexecuted lines of code) also make the bug go away! Still, it may be our best option for working around the bug.

@manodeep, since you can get Corrfunc to break on ozstar, do you think it is worth it to try to make a reproducer there? That would strengthen the GCC bug report. On the other hand, the bug is pretty finicky, so it may not be worth the time to whittle down the code.

manodeep commented 4 years ago

@lgarrison Yup ozstar is Skylake cpus - this bug is too finicky! At least you have the reproducer on Cascade Lake - may be that will be sufficient. Are you okay with submitting the bug report?

lgarrison commented 4 years ago

Yes, except that I tried it on a different Cascade Lake machine and it won't trigger. But maybe I'll file anyway and see if someone else can reproduce...

On Tue, Oct 1, 2019, 3:00 AM Manodeep Sinha notifications@github.com wrote:

@lgarrison https://github.com/lgarrison Yup ozstar is Skylake cpus - this bug is too finicky! At least you have the reproducer on Cascade Lake - may be that will be sufficient. Are you okay with submitting the bug report?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/manodeep/Corrfunc/issues/193?email_source=notifications&email_token=ABLA7S75KWU7TM6HJ6LLWV3QMLYQBA5CNFSM4ISICW32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAAGV4A#issuecomment-536898288, or mute the thread https://github.com/notifications/unsubscribe-auth/ABLA7SZW4QAHUOHIQXPJVDDQMLYQBANCNFSM4ISICW3Q .

lgarrison commented 4 years ago

While preparing the GCC bug report, I found a report of a similar sounding issue: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91837

The root cause of which is this bug in the assembler in binutils 2.30: https://sourceware.org/bugzilla/show_bug.cgi?id=23465

My Cascade Lake platform is running binutils 2.30 and I can reproduce the simple assembly example that demonstrates the bug. So I'm optimistic that upgrading binutils may fix the Corrfunc bug! But now I have to figure out how to upgrade binutils...

In the meantime, could you check the bintuils version on Ozstar (as --version)?

lgarrison commented 4 years ago

Okay, I built bintuils 2.32 from source and told GCC to use it with gcc -B. It fixes the bug both in the reproducer and in Corrfunc proper!

So... mitigations? Unfortunately, my reading of the as bug is that it's rather general, meaning we could probably work around it in this one loop, but there's no guarantee it wouldn't pop up in another place in the code. But I also think it only happens for AVX-512 (the assembly example in the second link above only produces the wrong answer for %zmm0, not %ymm0). So maybe we check the binutils version at compile time and only emit AVX[2] code if it's a bad version?

manodeep commented 4 years ago

The output on ozstar:

[~/temp/test_yisheng_bug @farnarkle2] as --version
GNU assembler (GNU Binutils) 2.30
Copyright (C) 2018 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or later.
This program has absolutely no warranty.
This assembler was configured for a target of `x86_64-pc-linux-gnu'.
[~/temp/test_yisheng_bug @farnarkle2]

So the assembler is v2.30. I will put in a request to update the binutils on ozstar, if possible. In the meantime, perhaps we can issue a patch to disable AVX512F while compiling with gcc and with as <= 2.30. @lgarrison What do you think?

lgarrison commented 4 years ago

I will double check, but I think <=2.29 and >=2.31 might be safe. The bug might only be present in that one version. I'll run some tests.

lgarrison commented 4 years ago

Based on the binutils changelog and my tests, 2.30 and 2.31 have the bug, and <=2.29.1 and >=2.32 are safe. I can work on a patch to disable AVX512 for the bugged versions.

It makes sense to warn the user about this when building from source, but what about when installing via pip? Usually the pip installation is silent.

manodeep commented 4 years ago

This is only for gcc as right? Is the clang as also affected?

lgarrison commented 4 years ago

This bug is strictly for GAS (the GNU Assembler).

manodeep commented 4 years ago

Cool! The fix would be to simply check (within common.mk) if the assemler is one of the two buggy GAS versions. If so, we need to add -mno-avx512f to CFLAGS. This fix will work for both source installs and pip installs

lgarrison commented 4 years ago

Yep, I agree that would fix both installs. But how loudly do we want to warn the users that the AVX-512 kernels will not be available? It could be quite confusing to someone who knows they have an AVX-512 capable machine!

Also apparently icc uses GAS under the hood! Not sure what the best way to universally detect the underlying as is. It actually might be to compile a simple test and check if it produces bad assembly...

manodeep commented 4 years ago

Following the instructions here:

$(CC) -xc -c /dev/null -Wa,-v -o/dev/null 2>&1 | grep -c -E "GNU assembler version (2\.30|2\.31)"

This line only works for real gcc compilers but will also work when the user specifies the full-path to a (gcc) compiler. In that case, the assembler (as) resolved by PATH might not be the same as that would be invoked by the compiler. This check is only necessary if the -march=native enables AVX512F.

common.mk just keeps getting more complex! We might need a better strategy to keep things organised.

(Fixed extraneous bracket at the end)

manodeep commented 4 years ago

@yqiuu This should now be fixed. I am going to update the PyPI version sometime in the next week, if you are happy to wait for that to get the updated Corrfunc version. Otherwise, you can install from source through the git clone route. Note that with this update you will not be able to use the faster AVX512F instructions on ozstar.

I have also requested a separate toolchain with gcc8 and binutils >= 2.32 on ozstar. If that becomes available, then you will be able to use the AVX512 capabilities on ozstar.

lgarrison commented 4 years ago

Or use clang, if that's on ozstar!

On Mon, Oct 7, 2019 at 8:27 AM Manodeep Sinha notifications@github.com wrote:

@yqiuu https://github.com/yqiuu This should now be fixed. I am going to update the PyPI version sometime in the next week, if you are happy to wait for that to get the updated Corrfunc version. Otherwise, you can install from source through the git clone route. Note that with this update you will not be able to use the faster AVX512F instructions on ozstar.

I have also requested a separate toolchain with gcc8 and binutils >= 2.32 on ozstar. If that becomes available, then you will be able to use the AVX512 capabilities on ozstar.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/manodeep/Corrfunc/issues/193?email_source=notifications&email_token=ABLA7S5F4D7JISJJGHBAG23QNMTJ7A5CNFSM4ISICW32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAQEKBY#issuecomment-538985735, or mute the thread https://github.com/notifications/unsubscribe-auth/ABLA7S35UPNV6AC4HEM63GDQNMTJ7ANCNFSM4ISICW3Q .

-- Lehman Garrison lgarrison@flatironinstitute.org Flatiron Research Fellow, Cosmology X Data Science Group Center for Computational Astrophysics, Flatiron Institute lgarrison.github.io

manodeep / Corrfunc