keflavich / pyradex

Python interface to RADEX
BSD 3-Clause "New" or "Revised" License
18 stars 12 forks source link

Jupyter Kernal Crashing Upon running Radex() #38

Open SpacialTree opened 2 years ago

SpacialTree commented 2 years ago

I've been having an issue where I run the following code:

import pyradex
# The path should refer to where the molecular data files are stored.
data='./data/'
R = pyradex.Radex(density=10**5, column=10**15, temperature=50, species='sio', datapath=data)
Tlvg = R(escapeProbGeom='lvg')
Tlvg

Using jupyter lab, the kernal will either crash and restart, or gives the attached table of values with zeroes as brightness temperatures. image

Using ipython on my local machine and running the same code, ipython will crash and give the same error in Issue #30. Once, instead of crashing it gave the same attached table of values with zeroes as brightness temperatures.

Run correctly, the table should look something like this:

<Table length=20>
       Tex                  tau           frequency  upperstateenergy upperlevel lowerlevel     upperlevelpop          lowerlevelpop            brightness                T_B
        K                                    GHz            K                                                                              erg / (cm2 Hz s sr)             K
     float64              float64          float64       float64        bytes6     bytes6          float64                float64                float64                float64
------------------ --------------------- ----------- ---------------- ---------- ---------- ---------------------- ---------------------- ---------------------- ----------------------
 43.00127043761652    1.6551126000974052   43.423853             2.08     01         00        0.08887736421923301   0.031096939370898864 1.8814463305880553e-14      32.47605365459304
 38.87877786030384     6.776549531318966   86.846985             6.25     02         01         0.1330701596849882    0.08887736421923301  8.257497311145349e-14      35.63420519567343
37.340311997153876    13.843238069818273  130.268683             12.5     03         02        0.15757756362185393     0.1330701596849882 1.7516230636607365e-13      33.59609010935477
 37.58801347625897      20.1529917602968  173.688238            20.84     04         03        0.16230317399422695    0.15757756362185393  3.073500959343751e-13      33.16046879773295
 36.18337967090943    25.386539729231924  217.104919            31.26     05         04        0.14873642779677212    0.16230317399422695 4.4875517508389066e-13      30.98833701318935
 33.51212300554245     28.42799825707387  260.518009            43.76     06         05         0.1210428844252477    0.14873642779677212  5.738125559477095e-13     27.518367974592817
29.562182976653645    28.566256610346066  303.926812            58.35     07         06        0.08527102613157889     0.1210428844252477  6.469563663579779e-13      22.79633808098519
24.270021493757877     25.43467910467738  347.330581            75.02     08         07        0.04862666783083065    0.08527102613157889  6.243387188244012e-13      16.84466393116842
  17.9407230793653    18.787824978339522  390.728631            93.77     09         08       0.019109279304423483    0.04862666783083065  4.760644179573459e-13     10.149470338438322
12.327633352016502       9.2313583198246  434.120218            114.6     10         09       0.003896931916487397   0.019109279304423483  2.723271071224652e-13      4.703259867462964
 9.028588435853798    2.1165525894261155  477.504635           137.52     11         10      0.0003372015797760192   0.003896931916487397 1.2080733290093796e-13     1.7245120799255027
11.336345434950275   0.17620777896751225  520.881187           162.52     12         11     4.0403250404407386e-05  0.0003372015797760192  4.167265836377032e-14     0.4999217007303331
18.662274525439337  0.018110530772826736  564.249098            189.6     13         12     1.0224944523856482e-05 4.0403250404407386e-05  1.454685604294231e-14       0.14871525549452
21.264971660003923  0.004455645988704638  607.607719           218.76     14         13       2.78705125200321e-06 1.0224944523856482e-05  5.000528684488013e-15    0.04408567405930072
22.188434559698436 0.0012267247511957037   650.95629            250.0     15         14      7.288317528932284e-07   2.78705125200321e-06  1.614841142844157e-15   0.012403786367217581
22.918630347532385 0.0003247326221445022  694.294114           283.32     16         15     1.8128792048655494e-07  7.288317528932284e-07   4.88530158972442e-16   0.003298621917150412
23.599749117026015 8.171601144834894e-05  737.620462           318.72     17         16      4.290102895084982e-08 1.8128792048655494e-07  1.388757804192125e-16  0.0008307851564600045
 24.17809073361704 1.958266336001464e-05  780.934648            356.2     18         17      9.624760931550455e-09  4.290102895084982e-08  3.704610742103023e-17 0.00019771577695101698
23.698298801769916 4.519659897618835e-06    824.2359           395.76     19         18     1.9113073923531206e-09  9.624760931550455e-09  8.662373284063449e-18    4.1501325634804e-05
 22.68989207929122  9.28309376935288e-07 867.5235458           437.39     20         19      3.207320330689684e-10 1.9113073923531206e-09 1.6974743004721564e-18   7.34122936961071e-06
autocorr commented 2 years ago

I suspect we're having a similar problem, so I'll post some information in this issue. When the Python kernel dies in Jupyter, I suspect that's from a fatal error in the wrapped Fortran part pyradex: the function SGEIR from the slatec.f file in RADEX. If the kernel dies, it probably doesn't print the error to standard output for one to see. What do you get when you run the code in an IPython console session? This is what I get when running the README example:

In [1]: import pyradex

In [2]: R = pyradex.Radex(collider_densities={'oH2':900,'pH2':100}, column=1e16,
   ...:  species='co', temperature=20)

In [3]: Tlvg = R(escapeProbGeom='lvg')
 ***MESSAGE FROM ROUTINE SGEIR IN LIBRARY SLATEC.
 ***POTENTIALLY RECOVERABLE ERROR, PROG ABORTED, TRACEBACK REQUESTED
 *  SINGULAR MATRIX A - NO SOLUTION
 *  ERROR NUMBER = -4
 *
 ***END OF MESSAGE

 ***JOB ABORT DUE TO UNRECOVERED ERROR.
0          ERROR MESSAGE SUMMARY
 LIBRARY    SUBROUTINE MESSAGE START             NERR     LEVEL     COUNT
 SLATEC     SGEIR      SINGULAR MATRIX A -         -4         1         1

When using different values for HCO+ instead of CO, it didn't error per se, but produced similar very small radiation temperatures and level populations as your SiO trials.

The error above comes from trying to solve a singular matrix, i.e. on that is not invertable and is degenerate. This is probably because it's getting a bunch of small, practically random numbers (although given the next test below, I'm not so sure). The way SGEIR solves things is a pretty robust way to do it (LU decomposition with back-substition), so it'll power through with a solution even if the rate matrix is very poorly conditioned. Perhaps f2py changed some default behavior in how some variables are typed, or perhaps a unit conversion has gone awry to give too small (or large) numbers for some things?

Running the code for the HCO+ test example in the RADEX source with debug set to True so RADEX can print some values while it's running shows that it errors on the first iteration:

In [4]: R = pyradex.Radex(species="hco+", density=1e4, temperature=20.0, column=
   ...: 1e13, deltav=1.0, tbackground=2.73, escapeProbGeom="sphere", debug=True)

In [5]: R()
 niter =            0
   3.3590602320021233E-005  -5.3706867440007082E-005  -1.0000000000000000E-026  -1.0000000000000000E-026
  -3.3590602320021233E-005   8.4614333904682631E-005  -4.2664447987880532E-004  -1.0000000000000000E-026
  -1.0000000000000000E-026  -3.0907466464675555E-005   4.4554304251878180E-004  -1.4894989733142689E-003
  -1.0000000000000000E-026  -1.0000000000000000E-026  -1.8898562639976494E-005   1.4983283767312687E-003
   4.4904507087917489E-005  -5.6006867440007084E-005  -1.1999999999999999E-006  -9.0999999999999997E-007
  -3.9161198760006466E-005   9.4035277243344496E-005  -4.3044447987880530E-004  -2.5000000000000002E-006
  -3.1572656943602856E-006  -3.5035465812979639E-005   4.5534785993392750E-004  -1.4937989733142690E-003
  -1.7638651316949698E-006  -2.0007375828402172E-006  -2.2066397646252852E-005   1.5091048790461584E-003
 inverting non-reduced matrix...
 ***MESSAGE FROM ROUTINE SGEIR IN LIBRARY SLATEC.
 ***POTENTIALLY RECOVERABLE ERROR, PROG ABORTED, TRACEBACK REQUESTED
 *  SINGULAR MATRIX A - NO SOLUTION
 *  ERROR NUMBER = -4
 *
 ***END OF MESSAGE

 ***JOB ABORT DUE TO UNRECOVERED ERROR.
0          ERROR MESSAGE SUMMARY
 LIBRARY    SUBROUTINE MESSAGE START             NERR     LEVEL     COUNT
 SLATEC     SGEIR      SINGULAR MATRIX A -         -4         1         1

Unfortunately these numbers (the first four rows and columns of the rate matrix) after first adding the radiative terms and then adding the collisional terms, are precisely what they should be. The next step calls SGEIR (where it says "inverting non-reduced matrix..."), so I'm lost :confused: . Further debugging requires reaching into SGEIR which is really gnarly. Perhaps f2py changed something in how the slatec.f file is interpreted? One approach may be to try several different older versions of numpy/f2py and see if you go back far enough one of them works.

keflavich commented 2 years ago

@autocorr what version of numpy + f2py are you using?

The weirdest thing is that Savannah encountered these errors when running exactly the version I was running - i.e., the same executable on the same machine. That leads me to suspect that the problem comes from different versions of libraries getting pulled in depending on which global PATHs are set, which hints perhaps at a version mismatch between the compiled version and the libraries. But I don't really know how to track that sort of thing down when fortran's involved.

autocorr commented 2 years ago

@keflavich I'm running this on numpy 1.20.3 in Python 3.9.7 from a new Anaconda install. That's a good point, I think you're onto something. SGEIR calls out to different LAPACK routines which are probably dynamically linked libraries that could be different versions depending on the path/environment of the machine. Re-compiling / re-installing things gives the same error though, so that may rule out the "different machines using the same binary".

Unfortunately I'm not sure how to inspect what libs it's using at runtime either...

keflavich commented 2 years ago

I am worried about specific versions of numpy doing weird things with the compiled fortran too - there are occasional changes in small things. Also, not so sure if changing the version of numpy (upgrading it) without re-compiling the fortran has any possible effects.

keflavich commented 2 months ago

@autocorr you ever get anywhere further with this? I'm still encountering the same issue:

 ***MESSAGE FROM ROUTINE SGEIR IN LIBRARY SLATEC.
 ***POTENTIALLY RECOVERABLE ERROR, PROG ABORTED, TRACEBACK REQUESTED
 *  SINGULAR MATRIX A - NO SOLUTION
 *  ERROR NUMBER = -4
 *
 ***END OF MESSAGE

 ***JOB ABORT DUE TO UNRECOVERED ERROR.
0          ERROR MESSAGE SUMMARY
 LIBRARY    SUBROUTINE MESSAGE START             NERR     LEVEL     COUNT
 SLATEC     SGEIR      SINGULAR MATRIX A -         -4         1         1

but it affects linux (redhat), not mac. So is this a difference in the version of the slatec library? Or more fundamentally how nix is handling floats?