Open SpacialTree opened 2 years ago
I suspect we're having a similar problem, so I'll post some information in this issue. When the Python kernel dies in Jupyter, I suspect that's from a fatal error in the wrapped Fortran part pyradex: the function SGEIR
from the slatec.f
file in RADEX. If the kernel dies, it probably doesn't print the error to standard output for one to see. What do you get when you run the code in an IPython console session? This is what I get when running the README example:
In [1]: import pyradex
In [2]: R = pyradex.Radex(collider_densities={'oH2':900,'pH2':100}, column=1e16,
...: species='co', temperature=20)
In [3]: Tlvg = R(escapeProbGeom='lvg')
***MESSAGE FROM ROUTINE SGEIR IN LIBRARY SLATEC.
***POTENTIALLY RECOVERABLE ERROR, PROG ABORTED, TRACEBACK REQUESTED
* SINGULAR MATRIX A - NO SOLUTION
* ERROR NUMBER = -4
*
***END OF MESSAGE
***JOB ABORT DUE TO UNRECOVERED ERROR.
0 ERROR MESSAGE SUMMARY
LIBRARY SUBROUTINE MESSAGE START NERR LEVEL COUNT
SLATEC SGEIR SINGULAR MATRIX A - -4 1 1
When using different values for HCO+ instead of CO, it didn't error per se, but produced similar very small radiation temperatures and level populations as your SiO trials.
The error above comes from trying to solve a singular matrix, i.e. on that is not invertable and is degenerate. This is probably because it's getting a bunch of small, practically random numbers (although given the next test below, I'm not so sure). The way SGEIR
solves things is a pretty robust way to do it (LU decomposition with back-substition), so it'll power through with a solution even if the rate matrix is very poorly conditioned. Perhaps f2py
changed some default behavior in how some variables are typed, or perhaps a unit conversion has gone awry to give too small (or large) numbers for some things?
Running the code for the HCO+ test example in the RADEX source with debug
set to True so RADEX can print some values while it's running shows that it errors on the first iteration:
In [4]: R = pyradex.Radex(species="hco+", density=1e4, temperature=20.0, column=
...: 1e13, deltav=1.0, tbackground=2.73, escapeProbGeom="sphere", debug=True)
In [5]: R()
niter = 0
3.3590602320021233E-005 -5.3706867440007082E-005 -1.0000000000000000E-026 -1.0000000000000000E-026
-3.3590602320021233E-005 8.4614333904682631E-005 -4.2664447987880532E-004 -1.0000000000000000E-026
-1.0000000000000000E-026 -3.0907466464675555E-005 4.4554304251878180E-004 -1.4894989733142689E-003
-1.0000000000000000E-026 -1.0000000000000000E-026 -1.8898562639976494E-005 1.4983283767312687E-003
4.4904507087917489E-005 -5.6006867440007084E-005 -1.1999999999999999E-006 -9.0999999999999997E-007
-3.9161198760006466E-005 9.4035277243344496E-005 -4.3044447987880530E-004 -2.5000000000000002E-006
-3.1572656943602856E-006 -3.5035465812979639E-005 4.5534785993392750E-004 -1.4937989733142690E-003
-1.7638651316949698E-006 -2.0007375828402172E-006 -2.2066397646252852E-005 1.5091048790461584E-003
inverting non-reduced matrix...
***MESSAGE FROM ROUTINE SGEIR IN LIBRARY SLATEC.
***POTENTIALLY RECOVERABLE ERROR, PROG ABORTED, TRACEBACK REQUESTED
* SINGULAR MATRIX A - NO SOLUTION
* ERROR NUMBER = -4
*
***END OF MESSAGE
***JOB ABORT DUE TO UNRECOVERED ERROR.
0 ERROR MESSAGE SUMMARY
LIBRARY SUBROUTINE MESSAGE START NERR LEVEL COUNT
SLATEC SGEIR SINGULAR MATRIX A - -4 1 1
Unfortunately these numbers (the first four rows and columns of the rate matrix) after first adding the radiative terms and then adding the collisional terms, are precisely what they should be. The next step calls SGEIR
(where it says "inverting non-reduced matrix..."), so I'm lost :confused: . Further debugging requires reaching into SGEIR
which is really gnarly. Perhaps f2py
changed something in how the slatec.f
file is interpreted? One approach may be to try several different older versions of numpy/f2py and see if you go back far enough one of them works.
@autocorr what version of numpy + f2py are you using?
The weirdest thing is that Savannah encountered these errors when running exactly the version I was running - i.e., the same executable on the same machine. That leads me to suspect that the problem comes from different versions of libraries getting pulled in depending on which global PATHs are set, which hints perhaps at a version mismatch between the compiled version and the libraries. But I don't really know how to track that sort of thing down when fortran's involved.
@keflavich I'm running this on numpy 1.20.3
in Python 3.9.7 from a new Anaconda install. That's a good point, I think you're onto something. SGEIR
calls out to different LAPACK routines which are probably dynamically linked libraries that could be different versions depending on the path/environment of the machine. Re-compiling / re-installing things gives the same error though, so that may rule out the "different machines using the same binary".
Unfortunately I'm not sure how to inspect what libs it's using at runtime either...
I am worried about specific versions of numpy doing weird things with the compiled fortran too - there are occasional changes in small things. Also, not so sure if changing the version of numpy (upgrading it) without re-compiling the fortran has any possible effects.
@autocorr you ever get anywhere further with this? I'm still encountering the same issue:
***MESSAGE FROM ROUTINE SGEIR IN LIBRARY SLATEC.
***POTENTIALLY RECOVERABLE ERROR, PROG ABORTED, TRACEBACK REQUESTED
* SINGULAR MATRIX A - NO SOLUTION
* ERROR NUMBER = -4
*
***END OF MESSAGE
***JOB ABORT DUE TO UNRECOVERED ERROR.
0 ERROR MESSAGE SUMMARY
LIBRARY SUBROUTINE MESSAGE START NERR LEVEL COUNT
SLATEC SGEIR SINGULAR MATRIX A - -4 1 1
but it affects linux (redhat), not mac. So is this a difference in the version of the slatec library? Or more fundamentally how nix is handling floats?
I've been having an issue where I run the following code:
Using jupyter lab, the kernal will either crash and restart, or gives the attached table of values with zeroes as brightness temperatures.![image](https://user-images.githubusercontent.com/23216784/151068398-9cc195a8-bea9-4c4d-a92c-e3663731084e.png)
Using ipython on my local machine and running the same code, ipython will crash and give the same error in Issue #30. Once, instead of crashing it gave the same attached table of values with zeroes as brightness temperatures.
Run correctly, the table should look something like this: