Closed shihyuntang closed 3 years ago
Also, the use of scipy.optimize.least_squares is way faster then scipy.optimize.leastsq, if with the default tolerance. SO... there is the fitting result precision issue here.
Are the results consistent on a given machine? aka if you run it several times on what machine, do you get basically the same result? Or do the results vary as much within a machine as it does between machines? I would be surprised if scipy
varied a bunch between OS's or something, but could see things being sensitive to potential randomness in the initialization or something?
From what I remember, I did put a good amount of work into the optimization parameters to get things to consistently converge. So I am definitely hesitant to change the optimization method without a lot of validation work that I don't have time for now.
Yes, results are consistence on the same machine (rerun the demo data many times, so I am sure about this).
Currently, we tested the data on Mac, Ubuntu, and Red hat, all three gave different results. I am sure that the input data to the Fit
function are the same between machine, but i am not sure where goes wrong in the Fit
or other functions.
The replacement to scipy.optimize.least_squares will gave other issues later on... so, yes, agree that this part of code is critical for quick modification without testing. However, did you see this kind of discrepancy before between machine while testing telfit?
Thank you!
Hmm, that is odd and concerning. I don't recall seeing that in initial development, but not sure I explicitly checked. That was also a decade ago, so I don't really remember details at that level. I guess the last hail mary would be: are the python environments the same on the different computers? Especially the scipy version I guess? Or is there a known scipy issue with scipy.optimize.leastsq
? It seems like something people would have found in other contexts if that function gives different results on different OS's.
Hi @kgullikson88 ,
Sorry for the late reply. Here are some of my findings after more digging: First good news. After making sure all systems are on the same scipy and NumPy version, I can get identical results from two machine running \ MacOS 10.15.7 & 11.2.3 with the same fortran compiler:\ GNU Fortran (Homebrew GCC 7.5.0_3) 7.5.0.
The identical results means h2o
and X^2
values in the chisq_summary.dat
outputed from the examples/fit_example/Fit.py
are the same down to the 10th digits.
However, values between the ubuntu and Mac machine are still different.
On ubuntu:
h2o X^2
54 0.235288
54 0.235288
54 0.235288
55.2606176464 0.258027
53.8585871631 0.220858
55.0749262449 0.274078
53.5756656382 0.212992
54.703325034 0.25979
53.0106362097 0.199938
53.9608542816 0.222895
51.8794464668 0.183323
52.4733351045 0.190659
51.6553386305 0.179918
52.1784989922 0.193145
51.7132900214 0.180796
51.6868576119 0.181124
51.6584905597 0.179876
52.1826459104 0.189396
51.6521866944 0.180012
51.657188547 0.179998
51.6583603585 0.179888
51.6584775396 0.179968
51.6584892577 0.18013
51.6584904295 0.179876
On Mac:
h2o X^2
54 0.224346
54 0.224346
54 0.224346
54.891717562 0.266551
52.9307450765 0.2161
53.5850150648 0.21364
53.2878114005 0.205976
54.021484902 0.224917
52.5735224967 0.212621
53.0337941823 0.217462
53.2450652335 0.20502
53.9692383476 0.233988
53.162098951 0.2032
53.8678284146 0.220375
53.0053323898 0.200053
53.6761971659 0.215583
52.6917082273 0.213801
52.955406094 0.19945
53.6151633291 0.214001
52.9064245889 0.198321
53.5552826565 0.212461
52.8123522338 0.215056
52.8966683745 0.215796
52.9054453348 0.1983
53.5540854845 0.212455
52.903487989 0.215877
52.9052494839 0.215904
52.9054257497 0.198319
52.9054433763 0.215921
52.905445139 0.1983
52.9054452369 0.1983
52.9054452859 0.1983
It caught my eye that with the same h2o value, 54, both machine can already give different X^2 value. After some tracking, I found the discrepancy started at Line 707,
model = FittingUtilities.ReduceResolution(model, resolution)
in the GenerateModel function.
Results are identical between ubuntu and Mac at Line 676
model = self.Modeler.MakeModel(pressure, temperature, wavenum_start, wavenum_end, angle, h2o, co2, o3, n2o, co,
ch4, o2, no, so2, no2, nh3, hno3, lat=lat, alt=alt, wavegrid=None,
resolution=None, vac2air=air_wave_overwrite)
print(model.x, and model.y)
we have:\ On ubuntu
[630.7348461410271 630.7349935640831 630.735140987139 ...
681.7274492307195 681.7275966537753 681.7277440768313]
[0.9999813437461853 0.9999809465631525 0.9999805829784827 ...
0.9998752856131503 0.9998762534948515 0.9998770952224731]
On Mac
[630.7348461410271 630.7349935640831 630.735140987139 ...
681.7274492307195 681.7275966537753 681.7277440768313]
[0.9999813437461853 0.9999809465631525 0.9999805829784827 ...
0.9998752856131503 0.9998762534948515 0.9998770952224731]
BUT, deviation started at Line 707,
model = FittingUtilities.ReduceResolution(model, resolution)
print(model.x, and model.y)
we have:\ On ubuntu
[630.7348461410271 630.7349935640831 630.735140987139 ...
681.7274492307195 681.7275966537753 681.7277440768313]
[0.9999106645364759 0.9999124500210417 0.9999141916047274 ...
0.9999050645314573 0.9999069698621474 0.9999088371642161]
On mac
[630.7348461410271 630.7349935640831 630.735140987139 ...
681.7274492307195 681.7275966537753 681.7277440768313]
[0.9999106645364759 0.9999124500210416 0.999914191604727 ...
0.9999050645314572 0.9999069698621474 0.9999088371642163]
the deviation carried on to Line 708,
model = FittingUtilities.RebinData(model, data.x)
print(model.x, and model.y)
we have:\ On ubuntu
[650.9096 650.91289 650.91618 ... 661.91209 661.91415 661.91621]
[0.9986356206429381 0.9988175270366134 0.9989386076560012 ...
0.9997714869438686 0.9998380392219246 0.9998752451923993]
On mac
[650.9096 650.91289 650.91618 ... 661.91209 661.91415 661.91621]
[0.9986356206429381 0.9988175270366138 0.9989386076560012 ...
0.9997714869438689 0.9998380392219246 0.9998752451923993]
As we call the GenerateModel function
multiple times during the leastsq fitting... you can see how this end up.
Do you have any idea why functions under FittingUtilities
(or cython) are trying to give us a hard time?
Thank you.
Updates:
By replacing
model = FittingUtilities.ReduceResolution(model, resolution)
with
model = FittingUtilities.ReduceResolution2(model, resolution)
I am able to get identical results until line 732:
data.cont = FittingUtilities.Continuum(data.x, resid, fitorder=self.continuum_fit_order, lowreject=2,
highreject=3)
Discrepancy again to show because of line 144 under FittingUtilities.pyx
fit = np.poly1d(np.polyfit(x2 - x2.mean(), y2, fitorder))
The fitted coefficient are different stating at around the 8th digits, an example:
array([ 4.8989581802489894e+05, -7.0047669989025399e+03,
-2.2587957736986473e+02, 1.5087547446587456e+02,
1.1385585807807672e+01, -1.0128377043463260e+01,
-1.0876553899647047e-01, 1.9202965875675754e-01])
vs
array([ 4.8989581803430786e+05, -7.0047669935640715e+03,
-2.2587957797811950e+02, 1.5087547496647005e+02,
1.1385585653610732e+01, -1.0128377142560113e+01,
-1.0876553165186138e-01, 1.9202966156868048e-01])
I had also tried with
np.polynomial.polynomial.polyfit(x2 - np.mean(x2), y2, fitorder)
np.polynomial.polynomial.Polynomial.fit(x2 - np.mean(x2), y2, fitorder).convert().coef
instead of np.polyfit(x2 - np.mean(x2), y2, fitorder)
, but no luck there either...
Hmm... this is starting to seem like a difference based on the underlying fortran/C code that numpy is calling. Maybe something like using Intel/MKL on one on openblas on the other? If you make some identical numpy arrays, does numpy.convolve
give different results? How about scipy.signal.fftconvolve
? If those are different, I don't think there is much we can do here.
The identical result in ReduceResolution2
is encouraging, because that is a using a cython function instead of a numpy/scipy function. It just seems to point once again to the root cause being something that numpy is doing under the hood.
interesting... with
x = np.array([ 4.8989581802489894e+05, -7.0047669989025399e+03,
-2.2587957736986473e+02, 1.5087547446587456e+02,
1.1385585807807672e+01, -1.0128377043463260e+01,
-1.0876553899647047e-01, 1.9202965875675754e-01])
y = x * 123.5123
np.convolve(x, y)
give
array([ 2.9642694170332156e+13, -8.4769111408234192e+11,
-2.1274766406506424e+10, 1.8649245144027988e+10,
1.1230761294922433e+09, -1.2538184692065442e+09,
6.5394999361436795e+06, 2.4416406305336077e+07,
-6.8768264709222293e+05, -4.3254782224562587e+04,
1.9521419709781570e+04, 8.1221405530880395e+02,
-4.7898905282819726e+02, -5.1594075072325101e+00,
4.5545642128112975e+00])
vs
array([ 2.9642694170332156e+13, -8.4769111408234192e+11,
-2.1274766406506424e+10, 1.8649245144027988e+10,
1.1230761294922435e+09, -1.2538184692065442e+09,
6.5394999361436777e+06, 2.4416406305336077e+07,
-6.8768264709222293e+05, -4.3254782224562587e+04,
1.9521419709781570e+04, 8.1221405530880395e+02,
-4.7898905282819726e+02, -5.1594075072325101e+00,
4.5545642128112975e+00])
as for scipy.signal.fftconvolve(x, y)
give:
array([ 2.9642694170332156e+13, -8.4769111408234131e+11,
-2.1274766406504925e+10, 1.8649245144025188e+10,
1.1230761294921207e+09, -1.2538184692071297e+09,
6.5394999393915813e+06, 2.4416406305493101e+07,
-6.8768264690513606e+05, -4.3254778974914552e+04,
1.9521419116274516e+04, 8.1221379254659018e+02,
-4.7899185384114583e+02, -5.1578109741210936e+00,
4.5551582336425778e+00])
vs
array([ 2.9642694170332156e+13, -8.4769111408234131e+11,
-2.1274766406504925e+10, 1.8649245144025188e+10,
1.1230761294921207e+09, -1.2538184692071297e+09,
6.5394999393915813e+06, 2.4416406305493101e+07,
-6.8768264690513606e+05, -4.3254778974914552e+04,
1.9521419116274516e+04, 8.1221379254659018e+02,
-4.7899185384114583e+02, -5.1578109741210936e+00,
4.5551582336425778e+00])
So, np.convolve(x, y)
can give different results, but scipy.signal.fftconvolve(x, y)
seems to deliver identical results in this test.
Hi, Some good news here. By forcing both machine to use Intel/MKL and with the env set to
export KMP_DETERMINISTIC_REDUCTION=yes
export MKL_CBWR=AVX2
I can get identical results from np.convolve(x, y)
, scipy.signal.fftconvolve(x, y)
, and np.polyfit(x, y, 7)
with the example array given above.
HOWEVER, if run those function with x and y from line 144 under FittingUtilities.pyx,
x = x2 - x2.mean()
y = y2
We can only get identical results from np.convolve(x, y)
, but not scipy.signal.fftconvolve(x, y) and np.polyfit(x, y, 7)
.
np.convolve(x, y)
give:
[-3078157.903749539 -6190528.532554617 -9305887.57730984 ...
6943736.745801406 4624592.394061316 2317106.7351819864]
scipy.signal.fftconvolve(x, y)
give:
[-3078157.903749593 -6190528.53255491 -9305887.577309955 ...
6943736.745801346 4624592.394061731 2317106.7351820203]
vs
[-3078157.9037495367 -6190528.532554853 -9305887.57731001 ...
6943736.745800668 4624592.3940609405 2317106.7351819077]
np.polyfit(x, y, 7)
give:
[ 1.9202966156872722e-01 -1.0876553165167625e-01 -1.0128377142562083e+01
1.1385585653605061e+01 1.5087547496649788e+02 -2.2587957797804717e+02
-7.0047669935642043e+03 4.8989581803430733e+05]
vs
[ 1.9202965875671646e-01 -1.0876553899666826e-01 -1.0128377043461651e+01
1.1385585807815550e+01 1.5087547446584793e+02 -2.2587957736996029e+02
-7.0047669989023143e+03 4.8989581802489934e+05]
Hi @kgullikson88 ,
After all the digging/studying, I think this level of discrepancy that we see is inevitably.
However, before closing this issue, I would like to suggest switching the default from ReduceResolution
to ReduceResolution2
. This can reduce the level of discrepancy between machine and potentially give better result plus not much penalty on the runtime.
I am good with that switch in principle. I guess could you make some plots or something comparing the two functions, and include those in a PR that makes the change?
Done, let me know if you need more material on this.
Hi @gully and @kgullikson88 ,
While testing our
IGRINS RV
code on different machine, we got different RVs from them (can up to 10m/s......). After a day of digging, I found out that it all started fromTelfit
. On different machine, the best fit result from the TelluricFitter.Fit are quite different.For example, the last printout fitting results from one machine gave:
the other one gave:
I tracked down the issue to the
scipy.optimize.leastsq
that you used at this line. If, instead of usingscipy.optimize.leastsq
butscipy.optimize.least_squares
, i.e., change the fitting line fromto
It will have
the other gave:
a much consistence results.
Because two functions have different outputs, it looks like this will not be a quick fix as
output[3]
for scipy.optimize.least_squares is "jacndarray, sparse matrix or LinearOperator, shape (m, n)", not scipy.optimize.leastsq 's "mesgstr" anymore.Really need help here. Thank you!