Closed saimn closed 7 years ago
"Recipe crashed" appears if the recipe disappears (without creating a stacktrace) before it could report the success state back to Python, for example by calling exit()
by itself. This is something that usually should not happen, so it may indicate a bug in python-cpl.
This should be independent of output_dir
, since even with output_dir=None
,a temporary directory is created (and deleted after the HDUList
s are created).
Could you create a small example (f.e. pixtable shortened to a few entries) that allows to reproduce the problem?
Ok thanks. I'm trying to make a minimal example to reproduce, but then it works ... I'm getting this error sporadically since a few days, but I didn't find a way to isolate the issue. What is weird is that the DRS reports that it has saved the FITS files, and I can see the files in the temporary directory before the exception is raised.
That information already helps a bit to limit where it may be happen. Which pipeline (muse) version do you use?
It's the latest SVN version.
I tried to run again the 2 exposures that were failing, and it worked. But now it is crashing one step later, in another scipost (with OFFSET_LIST this time), and with a little more information :
Traceback (most recent call last):
File "/home/conseil/lib/doit/doit/action.py", line 387, in execute
returned_value = self.py_callable(*self.args, **kwargs)
File "/home/conseil/lib/mydrs/mydrs/reduction/tasks.py", line 106, in action
result = self.run(exp['name'], exp, **kwargs)
File "/home/conseil/lib/mydrs/mydrs/reduction/tasks.py", line 469, in run
std_telluric=exp['std_telluric'], **runpar)
File "/home/conseil/lib/mydrs/mydrs/recipe.py", line 135, in run_drs
results = self._run_drs(flist, *args, **kwargs)
File "/home/conseil/lib/mydrs/mydrs/science.py", line 359, in _run_drs
results = self._recipe(raw=raw)
File "/home/conseil/miniconda2/lib/python2.7/site-packages/cpl/recipe.py", line 446, in __call__
input_len, logger, output_format, delete, mtrace)
File "/home/conseil/miniconda2/lib/python2.7/site-packages/cpl/recipe.py", line 459, in _exec
self.memory_dump, mtrace),
IOError: Recipe crashed
[35871] WARNING - cpl.muse_scipost.muse_postproc_qc_fwhm: No sources found for FWHM measurement down to 3.0 sigma limit on plane 1841, QC parameters will not contain useful information
[35871] ERROR - cpl.muse_scipost.muse_image_save: Could not save header: File read/write error: "couldn't create the named file" from CFITSIO (ver. 3.37) ffinit()=105. filename='IMAGE_FOV_000
1.fits', mode=2
[35871] WARNING - cpl.muse_scipost.muse_processing_prepare_property: Property ESO QC SCIPOST FWHM NVALID (Number of detected sources with valid FWHM in output cube.) not used
[35871] WARNING - cpl.muse_scipost.muse_processing_prepare_property: Property ESO QC SCIPOST FWHM MEDIAN ([arcsec] Median FWHM of all sources with valid FWHM measurement (in x- and y-directio
n) in output cube. If less than three sources with valid FWHM are detected, this value is zero.) not used
[35871] WARNING - cpl.muse_scipost.muse_processing_prepare_property: Property ESO QC SCIPOST FWHM MAD ([arcsec] Median absolute deviation of the FWHM of all sources with valid FWHM measuremen
t (in x- and y-direction) in output cube. If less than three sources with valid FWHM are detected, this value is zero.) not used
[ ERROR ] [tid=000] could not update the product header
[35871] ERROR - cpl.muse_scipost.muse_cplerrorstate_dump_some: Dumping the 4 most recent error(s) out of a total of 4 errors:
[35871] ERROR - cpl.muse_scipost.muse_cplerrorstate_dump_some: [1/4] 'File read/write error: "couldn't create the named file" from CFITSIO (ver. 3.37) ffinit()=105. filename='IMAGE_FOV_0001
.fits', mode=2' (5) at cpl_propertylist_save:cpl_propertylist.c:7260
[35871] ERROR - cpl.muse_scipost.muse_cplerrorstate_dump_some: [2/4] 'File read/write error: "couldn't create the named file" from CFITSIO (ver. 3.37) ffinit()=105. filename='DATACUBE_FINAL
.fits', mode=2' (5) at cpl_propertylist_save:cpl_propertylist.c:7260
[35871] ERROR - cpl.muse_scipost.muse_cplerrorstate_dump_some: [3/4] 'File read/write error: "could not open the named file" from CFITSIO (ver. 3.37) ffdkopn()=104. filename='DATACUBE_FINAL
.fits', type=65536, mode=4' (5) at cpl_image_save_:cpl_image_io.c:2239
[35871] ERROR - cpl.muse_scipost.muse_cplerrorstate_dump_some: [4/4] 'File read/write error' (5) at cpl_imagelist_save:cpl_imagelist_io.c:691
[35871] ERROR - cpl.muse_scipost.CPL_recipe_exec: could not update the product header
Just a trivial guess: Do you have plenty of space on the disk so that the full result set may fit?
If not, you may use the temp_dir
attribute of the recipe to specify one on a partition with sufficient space.
In any case, it should have a bit more information about the recipe then (at least the exit value).
Yep, no space issue, and temp_dir
is already set to use /tmp
(which use tmpfs). I also tried with another location for temp_dir
and got the same crash.
Where should I find the exit value ?
IMO this is actually two-fold: First, there is a problem within the MUSE pipeline that produces the logged errors. They clearly will let the recipe fail, and I have no idea why these happen. Did you recently update the cfitsio version? What OS do you use? But then, python-cpl should complain about a non-zero return value of the recipe and not just report a crash. This is, however, still mystic. I would suspect that somehow the recipe silently exits, so that python-cpl cannot find out more. You could try two things:
output_dir
instead of temp_dir
. This keeps the written files for later investigationOne point to /tmp
: Usually, the tmpfs on /tmp
is quite small; some gigabytes at max. Are you sure that the output files do not fill up your temp space completely? Can you print the output of df -h /tmp/
? And the whole set of parameters you use for muse_scipost
(to see what output is actually going to be created)?
Our /tmp
has the size of the RAM, so no space issue here:
❯ df -h /tmp
Filesystem Size Used Avail Use% Mounted on
tmpfs 512G 40G 473G 8% /tmp
I have tried both setting output_dir
to a string, and setting delete=False
in python-cpl's code, to keep results in the temp_dir
, and it works: the result files are created in the temp_dir
and seem valid, before the recipe crash.
❯ l /tmp/conseil/muse_scipost-*
/tmp/conseil/muse_scipost-bUHP5s:
total 2.9G
-rw-rw-r-- 1 conseil 2.9G Sep 16 09:48 DATACUBE_FINAL.fits
-rw-rw-r-- 1 conseil 518K Sep 16 09:48 IMAGE_FOV_0001.fits
-rw-rw-r-- 1 conseil 341K Sep 16 09:48 SKY_CONTINUUM_0001.fits
-rw-rw-r-- 1 conseil 608K Sep 16 09:48 SKY_LINES_0001.fits
-rw-rw-r-- 1 conseil 484K Sep 16 09:48 SKY_MASK_0001.fits
-rw-rw-r-- 1 conseil 512K Sep 16 09:48 SKY_SPECTRUM_0001.fits
/tmp/conseil/muse_scipost-e1wECN:
total 12G
-rw-rw-r-- 1 conseil 2.9G Sep 15 17:40 DATACUBE_FINAL.fits
-rw-rw-r-- 1 conseil 507K Sep 15 17:40 IMAGE_FOV_0001.fits
-rw-rw-r-- 1 conseil 8.5G Sep 15 17:40 PIXTABLE_POSITIONED_0001.fits
(This is for tow different scipost runs)
Sorry, forgot to answer this:
Did you recently update the cfitsio version? What OS do you use?
OS is Debian 8.5 cfitsio 3.37 from the DRS install on our server (version 1.2)
In the moment, I don't see anything obvious that could lead to such behaviour. The last CPL error comes indeed from CPL_recipe.c
after the recipe itself was executed. However the lines after that look good to me: they serialize the information and then send them back to the parent via a pipe. The parent process reads the serialized object, and first checks if the number of bytes is as expected. If not, it creates the mentioned IOError. So, the problem could only be that the process disappeared between somewhere here, where I don't see how this can happen.
I would still be suspective where the CFITSIO error couldn't create the named file
comes from, this looks like the beginning of the mystery for me.
I tried to put some prints to see what happen: when the recipe crashes, the first read
returns -1
before the child process writes to the pipe. Printing errno
in this case gives an Interrupted system call
message.
This SO page gives a few answer that seems related (and maybe solutions ?), but I'm not sure to understand everything here ;). And I don't see why this would happen specifically for a few exposures.
The communication in python-cpl does not use signals, so there is no reason why a signal would be raised. A signal would more mean that the child died before. In my opinion, the major problem are the CFITSIO errors before. They should not happen there at all, and I suspect that already at that time something is changed in the child (which prevents from the final write).
Hmm I don't know but, from what I can see, when the read
returns -1 the child process is still running. And the main process gets blocked on waitpid
(checked by attaching a gdb on the process).
I saw the cfitsio errors only once, and I wonder if it could be a consequence of the same kind of error, i.e. if the waitpid
call is interrrupted, the main process cleans the temp directory, and then the child process fails to save the FITS files.
I still see no reason why a signal could occur here as well. However, if you succeeded to attach a gdb: couldn't you attach it as well to the child process and then see what happens? I would do it myself; however I have no example where this problem appears.
Hi Ole,
I went back to this, and realized that the crashes occur when I resize my terminal or tmux windows. So it must have something to do with SIGWINCH
. And it seems also related to the latest Python 2.7 version (2.7.12). If I downgrade to 2.7.11 then I cannot reproduce, which explains why I didn't notice this issue before.
The best solution seems to be to do the read in a loop and check for EINTR
and retry if needed. I can also try to just ignore SIGWINCH
, but the loop seems a more robust solution.
I didn't check yet the behavior on Python 3.
The easiest solution would probably be to ignore at least this signal, or all not-so-important signals (may be a bit dangerous).
I guess it is the fix of this Python issue which causes the problem, so I would pragmatically just do a
signal(SIGWINCH, SIG_IGN);
There is no use case why a recipe should react on this at all...
Ah yes, I missed this issue in the Python NEWS file. Glad that we finally found the root of the issue !
I was testing with a loop to ignore EINTR
, which works, but your solution works as well.
Well, if you have a pull request ready, I am happy to include it... :-)
Am 29. September 2016 14:28:23 MESZ, schrieb Simon Conseil notifications@github.com:
Ah yes, I missed this issue in the Python NEWS file. Glad that we finally found the root of the issue ! I was testing with a loop to ignore
EINTR
, which works, but your solution works as well.You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/olebole/python-cpl/issues/8#issuecomment-250451981
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.
Hi Ole, I'm getting "Recipe crashed" errors while running scipost :
It seems to be an issue with the output files (I use
output_dir=None
to get aHDUList
object). The exception is raised from https://github.com/olebole/python-cpl/blob/bdbcd6afaaffc4c7882f05c2a93805b1f9105031/cpl/CPL_recipe.c#L907 I think, how can I get more details on what is happening ?