Different fitting results between machine

shihyuntang commented 3 years ago

Hi @gully and @kgullikson88 ,

While testing our IGRINS RV code on different machine, we got different RVs from them (can up to 10m/s......). After a day of digging, I found out that it all started from Telfit. On different machine, the best fit result from the TelluricFitter.Fit are quite different.

For example, the last printout fitting results from one machine gave:

Parameter       Value       Fitting?    Bounds
-------------   -----       -----       -----
h2o         4.20078E+01 True        1 - 99
co2         1.61491E+05 True        1 - inf
n2o         5.00000E-02 True        1e-05 - inf
ch4         1.35668E+00 True        0.1 - 10

the other one gave:

Parameter       Value       Fitting?    Bounds
-------------   -----       -----       -----
h2o         4.16462E+01 True        1 - 99
co2         1.06081E+05 True        1 - inf
n2o         5.00000E-02 True        1e-05 - inf
ch4         1.36286E+00 True        0.1 - 10

I tracked down the issue to the scipy.optimize.leastsq that you used at this line. If, instead of using scipy.optimize.leastsq but scipy.optimize.least_squares, i.e., change the fitting line from

output = leastsq(self.FitErrorFunction, fitpars, full_output=True, epsfcn=1e-1)

to

output = least_squares(self.FitErrorFunction, fitpars)

It will have

Parameter       Value       Fitting?    Bounds
-------------   -----       -----       -----
h2o         2.19997E+01 True        1 - 99
co2         3.67500E+02 True        1 - inf
n2o         5.00000E-02 True        1e-05 - inf
ch4         1.80000E+00 True        0.1 - 10

the other gave:

Parameter       Value       Fitting?    Bounds
-------------   -----       -----       -----
h2o         2.19966E+01 True        1 - 99
co2         3.67500E+02 True        1 - inf
n2o         5.00000E-02 True        1e-05 - inf
ch4         1.80003E+00 True        0.1 - 10

a much consistence results.

Because two functions have different outputs, it looks like this will not be a quick fix as output[3] for scipy.optimize.least_squares is "jacndarray, sparse matrix or LinearOperator, shape (m, n)", not scipy.optimize.leastsq 's "mesgstr" anymore.

Really need help here. Thank you!

shihyuntang commented 3 years ago

Also, the use of scipy.optimize.least_squares is way faster then scipy.optimize.leastsq, if with the default tolerance. SO... there is the fitting result precision issue here.

kgullikson88 commented 3 years ago

Are the results consistent on a given machine? aka if you run it several times on what machine, do you get basically the same result? Or do the results vary as much within a machine as it does between machines? I would be surprised if scipy varied a bunch between OS's or something, but could see things being sensitive to potential randomness in the initialization or something?

From what I remember, I did put a good amount of work into the optimization parameters to get things to consistently converge. So I am definitely hesitant to change the optimization method without a lot of validation work that I don't have time for now.

shihyuntang commented 3 years ago

Yes, results are consistence on the same machine (rerun the demo data many times, so I am sure about this). Currently, we tested the data on Mac, Ubuntu, and Red hat, all three gave different results. I am sure that the input data to the Fit function are the same between machine, but i am not sure where goes wrong in the Fit or other functions.

The replacement to scipy.optimize.least_squares will gave other issues later on... so, yes, agree that this part of code is critical for quick modification without testing. However, did you see this kind of discrepancy before between machine while testing telfit?

Thank you!

kgullikson88 commented 3 years ago

Hmm, that is odd and concerning. I don't recall seeing that in initial development, but not sure I explicitly checked. That was also a decade ago, so I don't really remember details at that level. I guess the last hail mary would be: are the python environments the same on the different computers? Especially the scipy version I guess? Or is there a known scipy issue with scipy.optimize.leastsq? It seems like something people would have found in other contexts if that function gives different results on different OS's.

shihyuntang commented 3 years ago

Hi @kgullikson88 ,

Sorry for the late reply. Here are some of my findings after more digging: First good news. After making sure all systems are on the same scipy and NumPy version, I can get identical results from two machine running \ MacOS 10.15.7 & 11.2.3 with the same fortran compiler:\ GNU Fortran (Homebrew GCC 7.5.0_3) 7.5.0.

The identical results means h2o and X^2 values in the chisq_summary.dat outputed from the examples/fit_example/Fit.py are the same down to the 10th digits.

However, values between the ubuntu and Mac machine are still different.

On ubuntu:

h2o X^2
54  0.235288
54  0.235288
54  0.235288
55.2606176464   0.258027
53.8585871631   0.220858
55.0749262449   0.274078
53.5756656382   0.212992
54.703325034    0.25979
53.0106362097   0.199938
53.9608542816   0.222895
51.8794464668   0.183323
52.4733351045   0.190659
51.6553386305   0.179918
52.1784989922   0.193145
51.7132900214   0.180796
51.6868576119   0.181124
51.6584905597   0.179876
52.1826459104   0.189396
51.6521866944   0.180012
51.657188547    0.179998
51.6583603585   0.179888
51.6584775396   0.179968
51.6584892577   0.18013
51.6584904295   0.179876

On Mac:

h2o X^2
54  0.224346
54  0.224346
54  0.224346
54.891717562    0.266551
52.9307450765   0.2161
53.5850150648   0.21364
53.2878114005   0.205976
54.021484902    0.224917
52.5735224967   0.212621
53.0337941823   0.217462
53.2450652335   0.20502
53.9692383476   0.233988
53.162098951    0.2032
53.8678284146   0.220375
53.0053323898   0.200053
53.6761971659   0.215583
52.6917082273   0.213801
52.955406094    0.19945
53.6151633291   0.214001
52.9064245889   0.198321
53.5552826565   0.212461
52.8123522338   0.215056
52.8966683745   0.215796
52.9054453348   0.1983
53.5540854845   0.212455
52.903487989    0.215877
52.9052494839   0.215904
52.9054257497   0.198319
52.9054433763   0.215921
52.905445139    0.1983
52.9054452369   0.1983
52.9054452859   0.1983

It caught my eye that with the same h2o value, 54, both machine can already give different X^2 value. After some tracking, I found the discrepancy started at Line 707,

model = FittingUtilities.ReduceResolution(model, resolution)

in the GenerateModel function.

Results are identical between ubuntu and Mac at Line 676

model = self.Modeler.MakeModel(pressure, temperature, wavenum_start, wavenum_end, angle, h2o, co2, o3, n2o, co,
                                           ch4, o2, no, so2, no2, nh3, hno3, lat=lat, alt=alt, wavegrid=None,
                                           resolution=None, vac2air=air_wave_overwrite)
print(model.x, and model.y)

we have:\ On ubuntu

[630.7348461410271 630.7349935640831 630.735140987139  ...
 681.7274492307195 681.7275966537753 681.7277440768313]
[0.9999813437461853 0.9999809465631525 0.9999805829784827 ...
 0.9998752856131503 0.9998762534948515 0.9998770952224731]

On Mac

[630.7348461410271 630.7349935640831 630.735140987139  ...
 681.7274492307195 681.7275966537753 681.7277440768313]
[0.9999813437461853 0.9999809465631525 0.9999805829784827 ...
 0.9998752856131503 0.9998762534948515 0.9998770952224731]

BUT, deviation started at Line 707,

model = FittingUtilities.ReduceResolution(model, resolution)
print(model.x, and model.y)

we have:\ On ubuntu

[630.7348461410271 630.7349935640831 630.735140987139  ...
 681.7274492307195 681.7275966537753 681.7277440768313]
[0.9999106645364759 0.9999124500210417 0.9999141916047274 ...
 0.9999050645314573 0.9999069698621474 0.9999088371642161]

On mac

[630.7348461410271 630.7349935640831 630.735140987139  ...
 681.7274492307195 681.7275966537753 681.7277440768313]
[0.9999106645364759 0.9999124500210416 0.999914191604727  ...
 0.9999050645314572 0.9999069698621474 0.9999088371642163]

the deviation carried on to Line 708,

model = FittingUtilities.RebinData(model, data.x)
print(model.x, and model.y)

we have:\ On ubuntu

[650.9096  650.91289 650.91618 ... 661.91209 661.91415 661.91621]
[0.9986356206429381 0.9988175270366134 0.9989386076560012 ...
 0.9997714869438686 0.9998380392219246 0.9998752451923993]

On mac

[650.9096  650.91289 650.91618 ... 661.91209 661.91415 661.91621]
[0.9986356206429381 0.9988175270366138 0.9989386076560012 ...
 0.9997714869438689 0.9998380392219246 0.9998752451923993]

As we call the GenerateModel function multiple times during the leastsq fitting... you can see how this end up. Do you have any idea why functions under FittingUtilities (or cython) are trying to give us a hard time?

Thank you.

shihyuntang commented 3 years ago

Updates:

By replacing

model = FittingUtilities.ReduceResolution(model, resolution)

with

model = FittingUtilities.ReduceResolution2(model, resolution)

I am able to get identical results until line 732:

data.cont = FittingUtilities.Continuum(data.x, resid, fitorder=self.continuum_fit_order, lowreject=2,
                                               highreject=3)

Discrepancy again to show because of line 144 under FittingUtilities.pyx

fit = np.poly1d(np.polyfit(x2 - x2.mean(), y2, fitorder))

The fitted coefficient are different stating at around the 8th digits, an example:

array([ 4.8989581802489894e+05, -7.0047669989025399e+03,
       -2.2587957736986473e+02,  1.5087547446587456e+02,
        1.1385585807807672e+01, -1.0128377043463260e+01,
       -1.0876553899647047e-01,  1.9202965875675754e-01])

vs

array([ 4.8989581803430786e+05, -7.0047669935640715e+03,
      -2.2587957797811950e+02,  1.5087547496647005e+02,
       1.1385585653610732e+01, -1.0128377142560113e+01,
      -1.0876553165186138e-01,  1.9202966156868048e-01])

I had also tried with

np.polynomial.polynomial.polyfit(x2 - np.mean(x2), y2, fitorder)
np.polynomial.polynomial.Polynomial.fit(x2 - np.mean(x2), y2, fitorder).convert().coef

instead of np.polyfit(x2 - np.mean(x2), y2, fitorder), but no luck there either...

kgullikson88 commented 3 years ago

Hmm... this is starting to seem like a difference based on the underlying fortran/C code that numpy is calling. Maybe something like using Intel/MKL on one on openblas on the other? If you make some identical numpy arrays, does numpy.convolve give different results? How about scipy.signal.fftconvolve? If those are different, I don't think there is much we can do here.

The identical result in ReduceResolution2 is encouraging, because that is a using a cython function instead of a numpy/scipy function. It just seems to point once again to the root cause being something that numpy is doing under the hood.

shihyuntang commented 3 years ago

interesting... with

x = np.array([ 4.8989581802489894e+05, -7.0047669989025399e+03,
       -2.2587957736986473e+02,  1.5087547446587456e+02,
        1.1385585807807672e+01, -1.0128377043463260e+01,
       -1.0876553899647047e-01,  1.9202965875675754e-01])
y = x * 123.5123

np.convolve(x, y) give

array([ 2.9642694170332156e+13, -8.4769111408234192e+11,
       -2.1274766406506424e+10,  1.8649245144027988e+10,
        1.1230761294922433e+09, -1.2538184692065442e+09,
        6.5394999361436795e+06,  2.4416406305336077e+07,
       -6.8768264709222293e+05, -4.3254782224562587e+04,
        1.9521419709781570e+04,  8.1221405530880395e+02,
       -4.7898905282819726e+02, -5.1594075072325101e+00,
        4.5545642128112975e+00])

vs

array([ 2.9642694170332156e+13, -8.4769111408234192e+11,
       -2.1274766406506424e+10,  1.8649245144027988e+10,
        1.1230761294922435e+09, -1.2538184692065442e+09,
        6.5394999361436777e+06,  2.4416406305336077e+07,
       -6.8768264709222293e+05, -4.3254782224562587e+04,
        1.9521419709781570e+04,  8.1221405530880395e+02,
       -4.7898905282819726e+02, -5.1594075072325101e+00,
        4.5545642128112975e+00])

as for scipy.signal.fftconvolve(x, y) give:

array([ 2.9642694170332156e+13, -8.4769111408234131e+11,
       -2.1274766406504925e+10,  1.8649245144025188e+10,
        1.1230761294921207e+09, -1.2538184692071297e+09,
        6.5394999393915813e+06,  2.4416406305493101e+07,
       -6.8768264690513606e+05, -4.3254778974914552e+04,
        1.9521419116274516e+04,  8.1221379254659018e+02,
       -4.7899185384114583e+02, -5.1578109741210936e+00,
        4.5551582336425778e+00])

vs

array([ 2.9642694170332156e+13, -8.4769111408234131e+11,
       -2.1274766406504925e+10,  1.8649245144025188e+10,
        1.1230761294921207e+09, -1.2538184692071297e+09,
        6.5394999393915813e+06,  2.4416406305493101e+07,
       -6.8768264690513606e+05, -4.3254778974914552e+04,
        1.9521419116274516e+04,  8.1221379254659018e+02,
       -4.7899185384114583e+02, -5.1578109741210936e+00,
        4.5551582336425778e+00])

So, np.convolve(x, y) can give different results, but scipy.signal.fftconvolve(x, y) seems to deliver identical results in this test.

shihyuntang commented 3 years ago

Hi, Some good news here. By forcing both machine to use Intel/MKL and with the env set to

export KMP_DETERMINISTIC_REDUCTION=yes
export MKL_CBWR=AVX2

I can get identical results from np.convolve(x, y), scipy.signal.fftconvolve(x, y), and np.polyfit(x, y, 7) with the example array given above.

HOWEVER, if run those function with x and y from line 144 under FittingUtilities.pyx,

x = x2 - x2.mean()
y = y2

We can only get identical results from np.convolve(x, y), but not scipy.signal.fftconvolve(x, y) and np.polyfit(x, y, 7). np.convolve(x, y) give:

[-3078157.903749539  -6190528.532554617  -9305887.57730984   ...
  6943736.745801406   4624592.394061316   2317106.7351819864]

scipy.signal.fftconvolve(x, y) give:

[-3078157.903749593  -6190528.53255491   -9305887.577309955  ...
  6943736.745801346   4624592.394061731   2317106.7351820203]

vs

[-3078157.9037495367 -6190528.532554853  -9305887.57731001   ...
  6943736.745800668   4624592.3940609405  2317106.7351819077]

np.polyfit(x, y, 7) give:

[ 1.9202966156872722e-01 -1.0876553165167625e-01 -1.0128377142562083e+01
  1.1385585653605061e+01  1.5087547496649788e+02 -2.2587957797804717e+02
 -7.0047669935642043e+03  4.8989581803430733e+05]

vs

[ 1.9202965875671646e-01 -1.0876553899666826e-01 -1.0128377043461651e+01
  1.1385585807815550e+01  1.5087547446584793e+02 -2.2587957736996029e+02
 -7.0047669989023143e+03  4.8989581802489934e+05]

shihyuntang commented 3 years ago

Hi @kgullikson88 ,

After all the digging/studying, I think this level of discrepancy that we see is inevitably. However, before closing this issue, I would like to suggest switching the default from ReduceResolution to ReduceResolution2. This can reduce the level of discrepancy between machine and potentially give better result plus not much penalty on the runtime.

kgullikson88 commented 3 years ago

I am good with that switch in principle. I guess could you make some plots or something comparing the two functions, and include those in a PR that makes the change?

shihyuntang commented 3 years ago

Done, let me know if you need more material on this.

kgullikson88 / Telluric-Fitter

Different fitting results between machine #31