Closed yqiuu closed 4 years ago
Thanks for reporting the issue. Are the pair-counts different between DD
and xi
for the same precision? Or are the pair-counts different if the precision is different?
The second case is expected; the first one is not.
@yqiuu Could you please attach the differences that you see in the pair-counts?
The pair-counts are different with different runs using 'float'.
with the above arrays, if I run
import Corrfunc
r_bins = np.logspace(0, 1., 20)
for i in range(10):
print(Corrfunc.theory.DD(1, 4, r_bins, x, y, y)['npairs'])
I got
[ 0 2 0 0 0 0 0 0 2 0 2 4 4 6 6 10 6 8 14] [ 0 0 0 2 0 0 0 0 2 0 2 4 4 6 6 10 6 8 14] [ 0 2 0 0 0 0 0 0 2 0 2 4 4 6 6 10 6 8 14] [ 0 0 0 2 0 0 0 0 2 0 2 4 4 6 6 10 6 8 14] [ 0 2 0 0 0 0 0 0 2 0 2 4 4 6 6 10 6 8 14] [ 0 0 0 2 0 0 0 0 2 0 2 4 4 6 6 10 6 8 14] [ 0 2 0 0 0 0 0 0 2 0 2 4 4 6 6 10 6 8 14] [ 0 0 0 2 0 0 0 0 2 0 2 4 4 6 6 10 6 8 14] [ 0 2 0 0 0 0 0 0 2 0 2 4 4 6 6 10 6 8 14] [ 0 0 0 2 0 0 0 0 2 0 2 4 4 6 6 10 6 8 14]
This happens by chance. Sometimes, it can give the same results. It seems that the difference only appears at small scale bins.
I can't reproduce this. Even with 10K iterations, I always get:
[ 0 0 0 0 0 0 0 0 2 0 2 4 4 6 6 10 6 8 14]
(no floating "2" in the small bins).
@yqiuu Ozstar is an AVX-512 machine, right? I currently don't have access to my AVX-512 machines, so I'm testing on AVX. Can you try isa="avx"
or isa="fallback"
? And separately nthreads=1
?
@yqiuu Is the repeating y
for the 3rd co-ordinate intended or should that be z
?
It should be z
. However, if this is set to z
, the test arrays are not suitable to show the problem, because the small scale bins are always zero. Sorry for the confusion.
I have tried the following:
import numpy as np
x = np.array([154.24363377, 71.10038029, 96.5765405 , 48.12103686,
82.58892783, 78.97697358, 93.87293244, 63.51212846,
124.85204405, 57.94164262, 107.20370478, 124.08584903,
50.69274732, 68.56647771, 23.08430836, 86.27321003,
44.70834559, 14.66498218, 87.3814769 , 78.25190344,
135.23040918, 147.84211733, 55.15903248, 151.56787312,
101.15769132, 21.08061626, 76.57928182, 38.66539445,
98.63250943, 108.9545778 , 107.37573062, 63.26940503,
69.84954268, 70.53042064, 115.75825873, 112.72305389,
109.16422456, 15.55010734, 18.56082778, 94.46919022,
96.822737 , 135.86592101, 79.08883352, 39.43899245,
21.01567216, 24.08500134, 47.45628488, 105.42902743,
117.67873535, 71.29574062, 126.14864194, 83.56938351,
44.29643485, 150.95074749, 20.56340634, 52.90839107,
92.50084288, 20.91060155, 106.00771001, 130.98488988,
62.67192188, 108.37369926, 152.26563635, 143.28643325,
150.79206123, 132.71580348, 103.52156827, 74.99863632,
24.72138599, 114.76770511, 38.8045429 , 55.22412623,
134.21451835, 87.43394324, 143.16259552, 89.53111289,
108.47361749, 117.80486046, 151.3227361 , 1.21481224,
35.51477847, 46.47684359, 105.96224277, 106.17077147,
114.7488886 , 75.39078277, 83.82819297, 78.95748582,
68.03988315, 152.67753467, 146.53283589, 61.52860312,
111.18404188, 53.12718738, 117.12352032, 29.46045605,
94.8644012 , 129.49183688, 143.80630815, 100.70286954], dtype='f4')
y = np.array([ 74.20041724, 112.23414236, 4.19799531, 104.6231374 ,
73.94797063, 93.25782718, 70.73451261, 122.47726835,
54.94785838, 23.08725315, 21.7301481 , 5.32000165,
35.49688556, 22.61295084, 152.98133789, 138.92512839,
132.67690453, 130.5950394 , 25.55602135, 139.82615298,
79.5565082 , 119.16408847, 25.03003374, 145.26494514,
26.221811 , 104.56873926, 35.02368834, 117.47734797,
39.58858194, 71.81908595, 74.23309012, 116.64696636,
125.22635326, 82.78500574, 108.8985576 , 58.73353201,
83.65154655, 102.3107715 , 40.00122087, 20.7800799 ,
23.45036569, 79.96601625, 92.99591359, 64.07818137,
133.70896864, 55.05638139, 30.24035837, 86.75711198,
139.31182043, 41.49271995, 50.41986821, 84.86378206,
65.69877637, 145.74277115, 150.929154 , 115.92575893,
42.84789328, 104.72797859, 57.45049599, 54.32509986,
87.8640008 , 14.47864149, 144.88166787, 63.38974637,
153.43910595, 70.12707422, 95.95983099, 92.51392252,
49.56774198, 144.2036191 , 72.90552511, 98.62399015,
72.67972087, 25.81617475, 87.16477047, 145.10919636,
53.73138469, 117.54595522, 78.81040562, 17.78860576,
103.25634548, 107.38304925, 92.97102858, 7.27206935,
20.29002578, 94.17857418, 16.11068757, 28.75440891,
91.66769992, 144.48011626, 126.51270665, 48.11642645,
35.06381179, 72.19645013, 38.54956658, 149.63526264,
22.10689507, 61.5182437 , 65.87864166, 30.68701823], dtype='f4')
z = np.array([146.21647577, 61.08008498, 97.13488482, 127.15938951,
55.6723131 , 89.64574052, 47.93942781, 44.56659984,
117.26142093, 136.5905654 , 64.01751132, 82.70478123,
150.79237621, 130.89496349, 2.55095738, 44.55404248,
100.39455657, 4.54609139, 63.88983754, 29.80492839,
97.41935009, 53.45799508, 144.58768173, 114.88320066,
17.4472339 , 4.82534698, 53.29331451, 12.6831773 ,
90.26226647, 92.76079878, 92.05478613, 63.06079921,
55.56329305, 71.22088986, 0.81801217, 81.84834028,
92.78697394, 68.13169696, 126.38855318, 52.0838586 ,
144.41369092, 96.69167869, 90.09059216, 32.97819875,
140.87342901, 97.70859266, 16.88023473, 32.49728365,
81.79113339, 76.16232256, 89.93757267, 133.09564599,
28.82538725, 113.22780013, 11.54381614, 26.69260803,
31.57219653, 58.21263806, 44.81594301, 91.15262872,
79.6094962 , 15.29534428, 115.83037606, 105.65402212,
27.31134528, 83.21215073, 33.9846585 , 89.36345496,
126.5466004 , 139.75887164, 8.89474904, 94.42065605,
99.4874119 , 63.88485714, 91.53668585, 39.02256202,
32.51451294, 109.6806885 , 138.28341225, 144.34144986,
20.5048495 , 121.70327105, 22.15312068, 19.46025113,
134.06281506, 122.78666854, 46.43533964, 108.87652769,
28.09200362, 116.51327568, 91.05322106, 144.52792024,
38.17852291, 131.09903518, 48.63438798, 147.10313919,
47.86855971, 86.28621964, 105.10498981, 51.34929809], dtype='f4')
import Corrfunc
r_bins = np.logspace(0, 1., 20)
print("iax='avx'")
for i in range(10):
print(Corrfunc.theory.DD(1, 4, r_bins, x, y, z, isa='avx')['npairs'])
print("iax='fallback'")
for i in range(10):
print(Corrfunc.theory.DD(1, 4, r_bins, x, y, z, isa='fallback')['npairs'])
print("nthreads=1")
for i in range(10):
print(Corrfunc.theory.DD(1, 1, r_bins, x, y, z)['npairs'])
I can have iax='avx' [ 0 10 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 0 10 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 0 10 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 0 10 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 0 10 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] iax='fallback' [ 0 10 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 0 10 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 0 10 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 0 10 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 0 10 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] nthreads=1 [ 0 10 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 0 10 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 0 0 0 10 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 0 0 0 10 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16] [ 2 2 0 0 2 0 2 2 2 2 0 6 2 0 0 6 2 8 16]
The results of each run are different.
Okay, so it does not appear to be a multithreading or instruction set issue. Can you also try enable_min_sep_opt=False
, and separately copy_particles=False
? I can't imagine why either of these would be platform-dependent though...
If neither of those work, could you try building the latest version of Corrfunc from Github? It's usually pretty straightforward: https://github.com/manodeep/Corrfunc/#preferred-install-method. Just make sure that it's not finding your old pip version, perhaps by creating a new conda environment.
@manodeep, have you been able to reproduce this? I am still unable to. Could you test it on Ozstar?
@yqiuu I can not reproduce this on ozstar.
import numpy as np
x = np.array([154.24363377, 71.10038029, 96.5765405 , 48.12103686,
82.58892783, 78.97697358, 93.87293244, 63.51212846,
124.85204405, 57.94164262, 107.20370478, 124.08584903,
50.69274732, 68.56647771, 23.08430836, 86.27321003,
44.70834559, 14.66498218, 87.3814769 , 78.25190344,
135.23040918, 147.84211733, 55.15903248, 151.56787312,
101.15769132, 21.08061626, 76.57928182, 38.66539445,
98.63250943, 108.9545778 , 107.37573062, 63.26940503,
69.84954268, 70.53042064, 115.75825873, 112.72305389,
109.16422456, 15.55010734, 18.56082778, 94.46919022,
96.822737 , 135.86592101, 79.08883352, 39.43899245,
21.01567216, 24.08500134, 47.45628488, 105.42902743,
117.67873535, 71.29574062, 126.14864194, 83.56938351,
44.29643485, 150.95074749, 20.56340634, 52.90839107,
92.50084288, 20.91060155, 106.00771001, 130.98488988,
62.67192188, 108.37369926, 152.26563635, 143.28643325,
150.79206123, 132.71580348, 103.52156827, 74.99863632,
24.72138599, 114.76770511, 38.8045429 , 55.22412623,
134.21451835, 87.43394324, 143.16259552, 89.53111289,
108.47361749, 117.80486046, 151.3227361 , 1.21481224,
35.51477847, 46.47684359, 105.96224277, 106.17077147,
114.7488886 , 75.39078277, 83.82819297, 78.95748582,
68.03988315, 152.67753467, 146.53283589, 61.52860312,
111.18404188, 53.12718738, 117.12352032, 29.46045605,
94.8644012 , 129.49183688, 143.80630815, 100.70286954], dtype='f4')
y = np.array([ 74.20041724, 112.23414236, 4.19799531, 104.6231374 ,
73.94797063, 93.25782718, 70.73451261, 122.47726835,
54.94785838, 23.08725315, 21.7301481 , 5.32000165,
35.49688556, 22.61295084, 152.98133789, 138.92512839,
132.67690453, 130.5950394 , 25.55602135, 139.82615298,
79.5565082 , 119.16408847, 25.03003374, 145.26494514,
26.221811 , 104.56873926, 35.02368834, 117.47734797,
39.58858194, 71.81908595, 74.23309012, 116.64696636,
125.22635326, 82.78500574, 108.8985576 , 58.73353201,
83.65154655, 102.3107715 , 40.00122087, 20.7800799 ,
23.45036569, 79.96601625, 92.99591359, 64.07818137,
133.70896864, 55.05638139, 30.24035837, 86.75711198,
139.31182043, 41.49271995, 50.41986821, 84.86378206,
65.69877637, 145.74277115, 150.929154 , 115.92575893,
42.84789328, 104.72797859, 57.45049599, 54.32509986,
87.8640008 , 14.47864149, 144.88166787, 63.38974637,
153.43910595, 70.12707422, 95.95983099, 92.51392252,
49.56774198, 144.2036191 , 72.90552511, 98.62399015,
72.67972087, 25.81617475, 87.16477047, 145.10919636,
53.73138469, 117.54595522, 78.81040562, 17.78860576,
103.25634548, 107.38304925, 92.97102858, 7.27206935,
20.29002578, 94.17857418, 16.11068757, 28.75440891,
91.66769992, 144.48011626, 126.51270665, 48.11642645,
35.06381179, 72.19645013, 38.54956658, 149.63526264,
22.10689507, 61.5182437 , 65.87864166, 30.68701823], dtype='f4')
import Corrfunc
r_bins = np.logspace(0, 1., 20)
nthreads = 4
autocorr = 1
isas = ['avx512f', 'avx', 'fallback']
for isa in isas:
print("isa={}".format(isa))
for i in range(10):
print(Corrfunc.theory.DD(autocorr, nthreads, r_bins, x, y, y,
isa=isa)['npairs'])
nthreads = 1
print("nthreads= {}".format(nthreads))
for i in range(10):
print(Corrfunc.theory.DD(autocorr, nthreads, r_bins, x, y, y)['npairs'])
The output is:
[~/temp/test_yisheng_bug @farnarkle1] python test_script.py
isa=avx512f
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
isa=avx
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
isa=fallback
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
nthreads= 1
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
[ 0 0 0 2 2 2 0 4 2 0 4 16 2 4 6 10 24 18 22]
@yqiuu I will also note that the results I get are different from what you get.
Modules loaded on ozstar
:
[~/temp/test_yisheng_bug @farnarkle1] ml
Currently Loaded Modules:
1) slurm/.latest (H,S) 3) gcccore/7.3.0 5) gcc/7.3.0 7) openblas/0.2.20 9) scalapack/2.0.2-openblas-0.2.20 11) python/3.6.4 13) gsl/2.5
2) nvidia/.latest (H,S) 4) binutils/2.30 6) openmpi/3.0.0 8) fftw/3.3.7 10) sqlite/3.21.0 12) numpy/1.14.2-python-3.6.4
[~/temp/test_yisheng_bug @farnarkle1] python
Python 3.6.4 (default, Mar 26 2018, 15:30:40)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import Corrfunc
>>> print(Corrfunc.__version__)
2.3.0
@manodeep. In the second example, I use z
instead of y
. Otherwise, we got the same results.
However, I still have the "different results" problem. I need to think about this.
@yqiuu Thanks for letting me know.
It must be that something is wrongly set in your environment. My recommendation would be the following steps:
ml purge
unset PYTHONPATH
ml numpy/1.14.2-python-3.6.4 gsl/2.5
python -m pip install Corrfunc --user
(note that this will install Corrfunc v2.3.1
)python test_script.py
@yqiuu I can now reproduce this behaviour on ozstar
@yqiuu Upon further digging, it looks like a combination of AVX512F
and gcc7.3.0
is causing the non-determinism. This issue might take a while to solve but for now, I would recommend using icc2018
to compile Corrfunc
.
For reference, I could not reproduce the issue on these platforms + toolchains:
gcc7.3
gcc7.3
icc2018
gcc7.3
but with AVX512F disabledIf you do need to stick to gcc7.3
then you can disable AVX512F
by adding -mno-avx512f
in the common.mk
file in the root directory and running make distclean
and then installing..
@lgarrison Do you have any ideas on how we could work around this compiler issue or at least warn the users with AVX512F
kernels compiled with gcc7.3
?
Is it fair to say that we don't know it's a compiler bug yet? It could just be a race condition (or use of uninitialized data, etc) with a lower reproduction rate with other compilers. Maybe the best workaround would be to detect the compiler version in the Makefile and issue a warning or disable AVX512F appropriately.
Speaking of uninitialized data, we've never gotten valgrind to work with AVX512, have we? Maybe we could try Intel Inspector?
You make a good point - I was simply going with the fact that the code works with icc
. There is still no valgrind for avx512 (open issue here)
What is Intel Inspector?
It's what came up when I googled "intel valgrind alternative"! I've never used it myself. This old article suggests it has Valgrind-like functionality though: https://software.intel.com/en-us/articles/transitioning-from-valgrind-tools-to-intel-inspector-xe
And since it's an Intel tool I would hope that it would have full AVX-512 support.
Could not get Inspector to work on ozstar
but it does look promising.
Btw, I now remember why I think this is a compiler bug -- the issue is in all the AVX512F/AVX/SSE kernels when AVX512F is enabled but disappears from the AVX/SSE kernels when AVX512F is disabled. And we have tested the AVX/SSE kernels on the use of uninitialised data, and I confirmed with running valgrind on this specific example on the sstar
node. Running with all of the -fsanitize
options on AVX512F machine did not produce any errors, but the error itself might be beyond the purview of fsanitize
anyway.
I also played around a bit with the data yesterday - turns out whether the issue occurs (or, more accurately, how frequently the issue occurs) is sensitive to the total number of bins. Also, what the garbage values depend on Rmin
, i.e., the incorrect values produced depend on the correct result. I will also note that I have intermittent difficulty producing the wrong result from the shell - I have shell script that will frequently run ~ 10^5 times without detecting any mis-match between consecutive runs.
All that is to say, regardless of how/where this is coming from, it is a weird bug.
That does sound more like a compiler bug! Have you tried disabling AVX512 for all compilation units but the kernel itself? That might narrow down the possible sources considerably.
On Wed, Sep 18, 2019 at 7:33 PM Manodeep Sinha notifications@github.com wrote:
Could not get Inspector to work on ozstar but it does look promising.
Btw, I now remember why I think this is a compiler bug -- the issue is in all the AVX512F/AVX/SSE kernels when AVX512F is enabled but disappears from the AVX/SSE kernels when AVX512F is disabled. And we have tested the AVX/SSE kernels on the use of uninitialised data, and I confirmed with running valgrind on this specific example on the sstar node. Running with all of the -fsanitize options on AVX512F machine did not produce any errors, but the error itself might be beyond the purview of fsanitize anyway.
I also played around a bit with the data yesterday - turns out whether the issue occurs (or, more accurately, how frequently the issue occurs) is sensitive to the total number of bins. Also, what the garbage values depend on Rmin, i.e., the incorrect values produced depend on the correct result. I will also note that I have intermittent difficulty producing the wrong result from the shell - I have shell script that will frequently run ~ 10^5 times without detecting any mis-match between consecutive runs.
All that is to say, regardless of how/where this is coming from, it is a weird bug.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/manodeep/Corrfunc/issues/193?email_source=notifications&email_token=ABLA7S5P5I2HA7BQQFW3M5TQKK3GLA5CNFSM4ISICW32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7BX7QA#issuecomment-532905920, or mute the thread https://github.com/notifications/unsubscribe-auth/ABLA7S6LHIZOPQFG4DV2PLTQKK3GLANCNFSM4ISICW3Q .
-- Lehman Garrison, lgarrison@flatironinstitute.org Flatiron Research Fellow, Cosmology X Data Science Group Center for Computational Astrophysics, Flatiron Institute https://lgarrison.github.io
I might do the opposite for starters, only disable AVX512F everywhere except within the kernels file and see what happens. And then try what you suggested.
The bottleneck is that I am unsure how to push and pop compile options. Will give it a shot
I was able to reproduce the bug and track it to the optimization flag -fvect-cost-model
added by -O3
. From https://gcc.gnu.org/onlinedocs/gcc-7.4.0/gcc/Optimize-Options.html#Optimize-Options:
-fvect-cost-model=model
Alter the cost model used for vectorization. The model argument should be one of ‘unlimited’, ‘dynamic’ or ‘cheap’. With the ‘unlimited’ model the vectorized code-path is assumed to be profitable while with the ‘dynamic’ model a runtime check guards the vectorized code-path to enable it only for iteration counts that will likely execute faster than when executing the original scalar loop. The ‘cheap’ model disables vectorization of loops where doing so would be cost prohibitive for example due to required runtime checks for data dependence or alignment but otherwise is equal to the ‘dynamic’ model. The default cost model depends on other optimization flags and is either ‘dynamic’ or ‘cheap’.
Using -O3 -fvect-cost-model=cheap
seems to fix the problem.
I've found two mentions of a similar bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?format=multiple&id=78807 and https://github.com/godotengine/godot/issues/4623, both of which point the finger at GCC not respecting alignment. This seems plausible, given that the issue only appears when we introduce single precision arrays.
I suspect we could dig deeper and find the bugged loop and make a minimal reproducer, but in the meantime we should push a fix soon.
@manodeep, can you confirm the fix on your end? @yqiuu, it would be great if you could test it too. Just add -fvect-cost-model=cheap
to this line: https://github.com/manodeep/Corrfunc/blob/master/common.mk#L205.
Thanks @lgarrison. I am curious - how did you track this down?
Given that even the floating point works frequently, alignment seems likely to be the culprit. But even when the results are wrong, the results are always the same wrong value. Does this mean that what matters is the alignment for the npairs
arrays?
First, I found that copy_particles=False
lead to a higher reproduction rate (nearly 100%). Then, since we thought it could be a compiler bug, I backed off to -O0
and saw it didn't break, so I increased the optimization level until it did. The reproduction rate went from 0 to 100% between -O2
and -O3
, so from there I just had to try all the -O3
flags until one broke it.
Maybe the npairs
alignment is the problem? But it should always be 8-byte aligned, while the floating point arrays will be 4-byte aligned half the time (I hope!). And since the code doesn't break with 8-byte floating point arrays, I'd guess the 4-byte alignment was the culprit.
But whatever loop/code/array is at fault must be basically identical between all kernels (or outside the kernels?), because the behavior is the same for all isa
.
So, do we add code to common.mk
to parse the GCC version and add this flag as necessary? Or do we try to figure out what code pattern triggers it and avoid writing it?
Also just noting that this bug is present in GCC 7.4 as well.
If the copy_particles=False
leads to a higher failure rate, then the alignment of the input arrays might matter. In your script, will you please print out the memory address for the X1/Y1/Z1/npairs
arrays, and then we can see whether the bug has anything to do with the alignment. (My internet connection in India is too unstable currently for productive work)
In the short term, perhaps adding in the -fvect-cost-model-cheap
into common.mk
while we figure out a workaround in the code (assuming we can pinpoint the problem-causing section).
@lgarrison I can confirm that adding the -fvect-cost-model=cheap
seems to fix the problem on the ozstar
platform (where the issue was originally detected).
The cause appears to be that the rupp_sqr
array is not computed correctly. I.e. this loop goes in with correct rupp
values but sometimes comes out with the first few rupp_sqr
elements wrong (zero, nan, or garbage):
for(int i=0; i < nrpbin;i++) {
rupp_sqr[i] = rupp[i]*rupp[i];
}
Source is: https://github.com/manodeep/Corrfunc/blob/master/theory/DD/countpairs_impl.c.src#L394-L396
Note that rupp
is double and rupp_sqr
is DOUBLE
(i.e. float). There are a couple of rupp
/rupp_sqr
64-byte misalignment values that seem to break it, like 48/48; and some that seem to work, like 16/48. But the exact pattern is unclear.
Seems easy enough to reproduce in Corrfunc, but it doesn't work in a minimal reproducer (with all these alignments and types forced by hand). I'll keep trying.
The loop does work if rupp
is DOUBLE
. But it's nice to preserve the full double value until after the multiplication.
And I've checked that we're allocating rupp
memory correctly in setup_bins (at least by reading the code).
Okay, I have a reproducer! Believe it or not, even a simple loop that copies the elements (no squaring) fails. I've linked the example, can you run it and let me know if it also fails for you? If so, I will submit a GCC bug report.
https://gist.github.com/lgarrison/5ebb747d3f1126319fd3744f925298e3
@lgarrison That's fantastic! And very disconcerting that such a simple piece of code can generate bugs!
That said, I copy-pasted the code in the gist on ozstar; but could not reproduce the bug :(
[~/temp/test_yisheng_bug @farnarkle2] ml
Currently Loaded Modules:
1) nvidia/.latest (H,S) 4) binutils/2.30 7) openmpi/3.0.0 10) git/2.18.0 13) scalapack/2.0.2-openblas-0.2.20 16) numpy/1.14.2-python-3.6.4
2) slurm/.latest (H,S) 5) gcc/7.3.0 8) szip/2.1.1 11) openblas/0.2.20 14) sqlite/3.21.0
3) gcccore/7.3.0 6) gsl/2.5 9) hdf5/1.10.1 12) fftw/3.3.7 15) python/3.6.4
[~/temp/test_yisheng_bug @farnarkle2] gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/apps/skylake/software/core/gcccore/7.3.0/libexec/gcc/x86_64-pc-linux-gnu/7.3.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../configure --enable-languages=c,c++,fortran --enable-lto --enable-checking=release --disable-multilib --enable-shared=yes --enable-static=yes --enable-threads=posix --enable-gold=default --enable-plugins --enable-ld --with-plugin-ld=ld.gold --prefix=/apps/skylake/software/core/gcccore/7.3.0 --with-local-prefix=/apps/skylake/software/core/gcccore/7.3.0 --enable-bootstrap --with-isl=/dev/shm/easybuild/GCCcore/7.3.0/dummy-/gcc-7.3.0/stage2_stuff
Thread model: posix
gcc version 7.3.0 (GCC)
[~/temp/test_yisheng_bug @farnarkle2] gcc -mavx512f -O3 -fPIC corrfunc_bug_gh193.c -o corrfunc_bug_gh193
[~/temp/test_yisheng_bug @farnarkle2] ./corrfunc_bug_gh193
loaded r (offset 16 from 64 bytes): 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
loaded r_float (offset 32 from 64 bytes): 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
No bug.
I was trying to diagnose the errant code produced but got lost in the sea of assembly. Link to godbolt here
@lgarrison Will you please change the declaration to
DOUBLE rupp_sqr[nrpbin] __attribute__((aligned(64)));;
and see if the bug still occurs? May be we could also try with posix_memalign
for rupp
.
The other option is to replace the stack array to a malloc'ed array for rupp_sqr
, and seeing if that makes any difference.
Ahh, too bad it won't reproduce on ozstar! That's Skylake, right? Unfortunately, I cannot get Corrfunc to break on Skylake, only Cascade Lake.
I also note that the alignment values are different on Skylake; it's probably related. In earlier iterations of the bug, I did try using posix_memalign
to force alignments that I had seen fail. But it didn't work; I suspect that (mis-)alignment is a necessary but not sufficient condition.
Forcing the stack alignment of rupp_sqr
does seem to be one way to make the bug go away on Cascade Lake. But many other things (like removing unexecuted lines of code) also make the bug go away! Still, it may be our best option for working around the bug.
@manodeep, since you can get Corrfunc to break on ozstar, do you think it is worth it to try to make a reproducer there? That would strengthen the GCC bug report. On the other hand, the bug is pretty finicky, so it may not be worth the time to whittle down the code.
@lgarrison Yup ozstar is Skylake cpus - this bug is too finicky! At least you have the reproducer on Cascade Lake - may be that will be sufficient. Are you okay with submitting the bug report?
Yes, except that I tried it on a different Cascade Lake machine and it won't trigger. But maybe I'll file anyway and see if someone else can reproduce...
On Tue, Oct 1, 2019, 3:00 AM Manodeep Sinha notifications@github.com wrote:
@lgarrison https://github.com/lgarrison Yup ozstar is Skylake cpus - this bug is too finicky! At least you have the reproducer on Cascade Lake - may be that will be sufficient. Are you okay with submitting the bug report?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/manodeep/Corrfunc/issues/193?email_source=notifications&email_token=ABLA7S75KWU7TM6HJ6LLWV3QMLYQBA5CNFSM4ISICW32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAAGV4A#issuecomment-536898288, or mute the thread https://github.com/notifications/unsubscribe-auth/ABLA7SZW4QAHUOHIQXPJVDDQMLYQBANCNFSM4ISICW3Q .
While preparing the GCC bug report, I found a report of a similar sounding issue: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91837
The root cause of which is this bug in the assembler in binutils 2.30: https://sourceware.org/bugzilla/show_bug.cgi?id=23465
My Cascade Lake platform is running binutils 2.30 and I can reproduce the simple assembly example that demonstrates the bug. So I'm optimistic that upgrading binutils may fix the Corrfunc bug! But now I have to figure out how to upgrade binutils...
In the meantime, could you check the bintuils version on Ozstar (as --version
)?
Okay, I built bintuils 2.32 from source and told GCC to use it with gcc -B
. It fixes the bug both in the reproducer and in Corrfunc proper!
So... mitigations? Unfortunately, my reading of the as
bug is that it's rather general, meaning we could probably work around it in this one loop, but there's no guarantee it wouldn't pop up in another place in the code. But I also think it only happens for AVX-512 (the assembly example in the second link above only produces the wrong answer for %zmm0
, not %ymm0
). So maybe we check the binutils version at compile time and only emit AVX[2] code if it's a bad version?
The output on ozstar
:
[~/temp/test_yisheng_bug @farnarkle2] as --version
GNU assembler (GNU Binutils) 2.30
Copyright (C) 2018 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or later.
This program has absolutely no warranty.
This assembler was configured for a target of `x86_64-pc-linux-gnu'.
[~/temp/test_yisheng_bug @farnarkle2]
So the assembler is v2.30
. I will put in a request to update the binutils on ozstar, if possible. In the meantime, perhaps we can issue a patch to disable AVX512F
while compiling with gcc and with as <= 2.30
. @lgarrison What do you think?
I will double check, but I think <=2.29 and >=2.31 might be safe. The bug might only be present in that one version. I'll run some tests.
Based on the binutils changelog and my tests, 2.30 and 2.31 have the bug, and <=2.29.1 and >=2.32 are safe. I can work on a patch to disable AVX512 for the bugged versions.
It makes sense to warn the user about this when building from source, but what about when installing via pip? Usually the pip installation is silent.
This is only for gcc as
right? Is the clang as
also affected?
This bug is strictly for GAS (the GNU Assembler).
Cool! The fix would be to simply check (within common.mk
) if the assemler is one of the two buggy GAS versions. If so, we need to add -mno-avx512f
to CFLAGS
. This fix will work for both source installs and pip installs
Yep, I agree that would fix both installs. But how loudly do we want to warn the users that the AVX-512 kernels will not be available? It could be quite confusing to someone who knows they have an AVX-512 capable machine!
Also apparently icc
uses GAS under the hood! Not sure what the best way to universally detect the underlying as
is. It actually might be to compile a simple test and check if it produces bad assembly...
Following the instructions here:
$(CC) -xc -c /dev/null -Wa,-v -o/dev/null 2>&1 | grep -c -E "GNU assembler version (2\.30|2\.31)"
This line only works for real gcc
compilers but will also work when the user specifies the full-path to a (gcc) compiler. In that case, the assembler (as
) resolved by PATH might not be the same as
that would be invoked by the compiler. This check is only necessary if the -march=native
enables AVX512F
.
common.mk
just keeps getting more complex! We might need a better strategy to keep things organised.
(Fixed extraneous bracket at the end)
@yqiuu This should now be fixed. I am going to update the PyPI version sometime in the next week, if you are happy to wait for that to get the updated Corrfunc version. Otherwise, you can install from source through the git clone
route. Note that with this update you will not be able to use the faster AVX512F instructions on ozstar.
I have also requested a separate toolchain with gcc8
and binutils >= 2.32
on ozstar. If that becomes available, then you will be able to use the AVX512
capabilities on ozstar.
Or use clang, if that's on ozstar!
On Mon, Oct 7, 2019 at 8:27 AM Manodeep Sinha notifications@github.com wrote:
@yqiuu https://github.com/yqiuu This should now be fixed. I am going to update the PyPI version sometime in the next week, if you are happy to wait for that to get the updated Corrfunc version. Otherwise, you can install from source through the git clone route. Note that with this update you will not be able to use the faster AVX512F instructions on ozstar.
I have also requested a separate toolchain with gcc8 and binutils >= 2.32 on ozstar. If that becomes available, then you will be able to use the AVX512 capabilities on ozstar.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/manodeep/Corrfunc/issues/193?email_source=notifications&email_token=ABLA7S5F4D7JISJJGHBAG23QNMTJ7A5CNFSM4ISICW32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAQEKBY#issuecomment-538985735, or mute the thread https://github.com/notifications/unsubscribe-auth/ABLA7S35UPNV6AC4HEM63GDQNMTJ7ANCNFSM4ISICW3Q .
-- Lehman Garrison lgarrison@flatironinstitute.org Flatiron Research Fellow, Cosmology X Data Science Group Center for Computational Astrophysics, Flatiron Institute lgarrison.github.io
General information
Issue description
I am using
Corrfunc.theory.xi
orCorrfunc.theory.DD
to compute the pair counts with the same inputs. However, I find that the results will be different if the data type of input arrays isfloat
.Expected behavior
The results should be the same for every runs.
Minimal failing example