sirocco-rt / sirocco

This is the repository for Sirocco, the radiative transfer code used to model winds in AGN and other systems
GNU General Public License v3.0
26 stars 24 forks source link

"macro atom level has no way out causes" segfault. #486

Closed kslong closed 5 years ago

kslong commented 5 years ago

The problem with the c345 model of a CV turns out to be associated with the fact that the system had not converged, but when I tried to extend the run, python then failed with the error:

matom: macro atom level has no way out 12 0 0

Thus @ssim @jhmatthews I am stuck until someone either fixes the problem or figures out how to get past it in a sensible way, like killing off the photon and continuing. Philosophically, I am opposed to having errors like this crash the system, because they basically stop all progress until they are fixed, as I have stated in the past. This is just another example of this.

jhmatthews commented 5 years ago

Hi Knox. Can you provide more info? I'm confused because I thought we'd run into this problem and fixed it and that the 345 models were fundamentally all running now with minimal errors.

Regarding the general philosophy, I suppose I don't disagree, except that fatal errors like this should really really never happen. I don't think it's a case of an error that could happen sometimes in a working model, I think it has to be a fundamental numerical problem or error in code logic.

kslong commented 5 years ago

We did succeed in getting a long run of the C345 model to run, but we really only checked things on one specific run. This was a different pf file, and the problem occurred deep into the run. Indeed the sequence was that I was trying to extend an earlier run of a cv tuned to paratmeters for IX Vel, with system_type=Previous in this case, which could in principle be implicated here. But I am not sure. The failure occurred deep into the ionization cycles for this run, and so it could be simply that the failures are very rare.

Because these failures take such a long time to regenerate and arise in multiprocesser mode, that are going to be very hard to debug, even though I agree with you @jhmatthews they are important to do so.

What we need to do in such cases is to print out as much information as possible when one of these errors occurs, including the cycle and photon number, and try to let the program continue, so that perhaps we can figure out what causes them. If we can identify the cycle and photon number (and thread), we may be able to dump information about how that particular activation moves through the macroatom machinery.

Ideally we would have a standard way to "escape" one of these errors, which we could use for all of the checks that call Exit currently in the matom case, including the "stuck photon" problem which also still exists.

kslong commented 5 years ago

I have made changes so that photons that cause errors in Matom no longer cause the program to crash. Instead they are given a status P_ERROR_MATOM which causes effectively kills the photon off at the point the error was triggered.

I've verified that this works in a long run with c345 macro atoms. The error log from that run is as follows:

Error summary: End of program, Thread 0 only Recurrences -- Description 77 -- Get_atomic_data: Simple line data ignored for Macro-ion: %d 7 -- getatomic_data: line input incomplete: %s 248 -- get_atomicdata: Could not interpret line %d in file %s: %s 4 -- error_count: This error will no longer be logged: %s 1969 -- Get_atomic_data: more than one electron yield record for inner_cross %i 32 -- get_atomicdata: No inner electron yield data for inner cross section %i 1 -- get_atomicdata: macro atom but no topbase xsection! ion %i z %i istate %i, yet marked as topbase xsection! 1 -- check_grid: velocity changes by >1,000 km/s in %i cells 1 -- check_grid: some cells have large changes. Consider modifying zlog_scale or grid dims 375 -- one_ff: Bad inputs f2 %g < f1 %g returning 0.0 t_e %g 19 -- photon_checks: nphot origin nres freq freqmin freqmax 74 -- photon_checks: %id %5d %5d %10.4e %10.4e %10.4e freq out of range 649 -- The net flow out of simple ion pool (%8.4e) > than the net flow in (%8.4e) in cell %d 32 -- scatter: kpkt probability (%8.4e) < 0, zeroing 25 -- Solve_matrix: gsl_linalg_LU_solve failure (%d %.3e) for cell %i 25 -- matrix_ion_populations: bad return from solve_matrix 25 -- matrix_ion_populations: Unsolvable matrix! Determinant is zero. Defaulting to no change. 25 -- ionization_on_the_spot: nebular_concentrations failed to converge 25 -- ionization_on_the_spot: j %8.2e t_e %8.2e t_r %8.2e w %8.2e nphot %i 3 -- matom: macro atom level has no way out %d %g %g 25 -- calculate_ds: v same at both sides of cell %d 5 -- walls: distance %g<0. Position %g %g %g 3 -- trans_phot: %ld photons were lost due to DFUDGE (=%8.4e) pushing them outside of the wind after scatter 5 -- pillbox %d interfaces to pillbox is impossible

I've merged the changes into dev

kslong commented 5 years ago

This is the error summary when using only c34

Error summary: End of program, Thread 12 only Recurrences -- Description 59 -- Get_atomic_data: Simple line data ignored for Macro-ion: %d 7 -- getatomic_data: line input incomplete: %s 248 -- get_atomicdata: Could not interpret line %d in file %s: %s 4 -- error_count: This error will no longer be logged: %s 1969 -- Get_atomic_data: more than one electron yield record for inner_cross %i 32 -- get_atomicdata: No inner electron yield data for inner cross section %i 1 -- get_atomicdata: macro atom but no topbase xsection! ion %i z %i istate %i, yet marked as topbase xsection! 1 -- get_wind_params: zdom[ndom].rmax 0 for wind type %d 1 -- wind2d: Cell %3d (%2d,%2d) in domain %d has %d corners in wind, but zero volume 1 -- check_grid: velocity changes by >1,000 km/s in %i cells 1 -- check_grid: some cells have large changes. Consider modifying zlog_scale or grid dims 29 -- photon_checks: nphot origin nres freq freqmin freqmax 104 -- photon_checks: %id %5d %5d %10.4e %10.4e %10.4e freq out of range 655 -- The net flow out of simple ion pool (%8.4e) > than the net flow in (%8.4e) in cell %d 4 -- trans_phot: %ld photons were lost due to DFUDGE (=%8.4e) pushing them outside of the wind after scatter 4 -- scatter: kpkt probability (%8.4e) < 0, zeroing 6 -- walls: distance %g<0. Position %g %g %g 3 -- pillbox %d interfaces to pillbox is impossible

kslong commented 5 years ago

@jhmatthews @ssim Did you all decide that this error

Error: get_atomicdata: macro atom but no topbase xsection! ion 8 z 6 istate 6, yet marked as topbase xsection!

which appears in the C345 model, but not the C34 model did not matter. If it does not matter, why don't we get the same error for He II. (this is not in the C345 model, but it is in the h20_hetop_standard78 model, and I checked that it does not appear there).

More generally, shouldn't we try to get rid of all of the get_atomicdata errors for c345 so that at least we know the data is "clean".

jhmatthews commented 5 years ago

I haven't decided if this topbase xsection error is a problem yet, sorry. It's definitely suspicious, I just haven't had time to look at it.

Regarding data, yes, I agree and this occurred to me while I was looking at it so I commented on #451 but I didn't know enough about the data errors there (atomic yields and collisions in particular). We can remove the one for "simple line data ignored", perhaps, or make it an error that says "Ignored N lines of simple line data for macro-atoms".

kslong commented 5 years ago

Currently c345 runs through all its ionization cycles as we have been discussing. During those cycles, it produces the errors above. However, if ultimately fails in the first spectrum generation cycle due to the fact that it generates too many errors of the form:

cdf_gen_from_array: all y's were zero or xmin xmax out of range of array x-- returning uniform distribution 0

I've now tracked down why this is happening. This is the backtrace for this

0 one_ff (one=0x3e2e9030, f1=1427583333333333.5, f2=3526970588235294) at emission.c:728

1 0x0000000000447e34 in kpkt (p=0x7fffffffd160, nres=0x7fffffffd15c, escape=0x7fffffffd10c, mode=1) at matom.c:1172

2 0x0000000000464ec8 in macro_gov (p=0x7fffffffd160, nres=0x7fffffffd15c, matom_or_kpkt=2, which_out=0x7fffffffd158) at macro_gov.c:164

3 0x0000000000463a08 in get_matom_f (mode=0) at photo_gen_matom.c:361

4 0x00000000004107cc in xdefine_phot (f1=1427583333333333.5, f2=3526970588235294, ioniz_or_final=1, iwind=-1, print_mode=1) at photon_gen.c:398

5 0x000000000040fe12 in define_phot (p=0x2aaab21af010, f1=1427583333333333.5, f2=3526970588235294, nphot_tot=50000000, ioniz_or_final=1, iwind=-1, freq_sampling=0) at photon_gen.c:113

6 0x0000000000473035 in make_spectra (restart_stat=1) at run.c:564

7 0x0000000000402817 in main (argc=3, argv=0x7fffffffd918) at python.c:905

and the reason it is happening is that for the cell where the error is occurring, the temperature if 59 deg. ff which is in emission.c (around line 623) is hardwired to return 0 if the temperature is below 100 K regardless of the frequency. So evidently when you try to construct a pdf for ff, you get all 0 for the pdf and hence the cdf.

There is not corresponding limit in total_free, which seems like a problem. I am going to set a limit there and see if I can get rid of the problem.

But I wonder if this will generate another "no way out" issue for macro atoms. Comments @jhmatthews @ssim and @Higginbottom (Nick, I added you to this, since you might also have some thoughts).

kslong commented 5 years ago

I made changes on my diagnostic branch to make sure that total_free returns 0 at the same temperature where one_ff always returns 0. This did prevent ixvel_c345 runs from halting during the spectral cycles, because of the cdf_gen_from_array error described above. The spectrum produced by the run looks plausible. Temperatures in the outer wind still fall to very low values.

However, many of the various errors we have been talking about still occur, as indicated in this summary:

screenshot_948

We still do see a number of the macro related errors we have been discussing (and various get_atomic_data related errors).

kslong commented 5 years ago

Stuart provided a new version of the lines file for the C345 model, which has fake rates for the metastable transitions.

CV_c10_lines-allcoupled.py.txt

which I ran as part of the C345 model. The errors for the 16 thread run are below:

screenshot_953

This gets rid of a lot of the macro related errors we have been seeing, including the 'no way out' problem. So @ssim does this imply a problem with CV model or with the routine? I thought that in this case the macro-atom would collisionally de-excite in which case the energy would be transferreed to the thermal pool and turn into a radiative packet by some other path.

kslong commented 5 years ago

This is where the no-way-out problem occurs. It's in the CIII atom, in the CIII 1909 complex.

image

kslong commented 5 years ago

@ssim provided a new file data/atomic_macro/CIII_c10_lines-L2Coupled.py, which had finite decay probabilities for the metastable transitions in C3 for the lambda 1909 complex. The expectation was that, although this was not physical, it would eliminate the "no way out" errors. Unfortunately this was not the case. They are still there in the run of the C345 atom.

The errors are shown below:

screenshot_971

The parameter file that I ran is below:

stuart3.pf.txt

As they say, there is no joy in Mudville. Mighty Casey has struck out.

kslong commented 5 years ago

@ssim @jhmatthews The problems I reported yesterday regarding Stuart's updated CIII files were due to "operator error". I had not set up the masterful correctly. I found this when I was testing a second modified file in which all of the f values for the metastable lines were set to 0.1.

I have not yet re-run the original updated file, but I have run this second file. This produces does not have any "no way out" errors.

I have started a new run with Stuart's using the original updated file, but I expect the results will be similar.

So "Might Casey did not strike out".

Here are all of the files to produce these runs (if you also use stuart3.pf above):

CIIICIVCV_c10b.txt

CIII_c10_lines-L2Coupled.py.txt CIII_c10_lines-L2Coupled.py.updated.txt

kslong commented 5 years ago

So, James asked me to retry ixvel_c345, with his new version of the code to see whether the now way out problem still existed. It does NOT. So something he has done is avoiding this problem. Here are the errors produced in this run.

screenshot_03

This sounds very positive. It's not clear to me whether this is because the problem is solved, or that the changes that James made produced a solution where the error is less likely to clear. The temperatures in the outer wind are higher than perviously, and the model is better converged.

The output of run_check.py for the new version and the one that was generated a few weeks ago when we got back into this.

New

james.pdf

Old

ixvel_c345.pdf

jhmatthews commented 5 years ago

That's great Knox. I agree that it is still not clear whether something is going wrong in certain temperature regimes, but this looks like a decent step forward.

I do think that the changes in max_ion_stage need very careful checking, because they make some modifications to our primary ionization solver. It might be good if both you and @Higginbottom could check my code changes in branch max_ion_stage and we could run a careful series of standard tests. In principle this should make no difference whatsoever to the normal matrix ionization runs, but it would be easy to have messed it up slightly.

I think once we've done that, we can then to address the free-free issue #492 if you haven't already, and the other issue I flagged recently, #498

jhmatthews commented 5 years ago

As far as we know, this seems to have been fixed by #502 and #501 - but any recurrences of the problem should lead to this issue being reopened.