IOError: Recipe crashed

saimn commented 8 years ago

Hi Ole, I'm getting "Recipe crashed" errors while running scipost :

Traceback (most recent call last):
  File "/home/conseil/lib/doit/doit/action.py", line 387, in execute
    returned_value = self.py_callable(*self.args, **kwargs)
  File "/home/conseil/lib/mydrs/mydrs/reduction/tasks.py", line 106, in action
    result = self.run(exp['name'], exp, **kwargs)
  File "/home/conseil/lib/mydrs/mydrs/reduction/tasks.py", line 469, in run
    std_telluric=exp['std_telluric'], **runpar)
  File "/home/conseil/lib/mydrs/mydrs/recipe.py", line 135, in run_drs
    results = self._run_drs(flist, *args, **kwargs)
  File "/home/conseil/lib/mydrs/mydrs/science.py", line 359, in _run_drs
    results = self._recipe(raw=raw)
  File "/home/conseil/miniconda2/lib/python2.7/site-packages/cpl/recipe.py", line 446, in __call__
    input_len, logger, output_format, delete, mtrace)
  File "/home/conseil/miniconda2/lib/python2.7/site-packages/cpl/recipe.py", line 459, in _exec
    self.memory_dump, mtrace),
IOError: Recipe crashed

It seems to be an issue with the output files (I use output_dir=None to get a HDUList object). The exception is raised from https://github.com/olebole/python-cpl/blob/bdbcd6afaaffc4c7882f05c2a93805b1f9105031/cpl/CPL_recipe.c#L907 I think, how can I get more details on what is happening ?

olebole commented 8 years ago

"Recipe crashed" appears if the recipe disappears (without creating a stacktrace) before it could report the success state back to Python, for example by calling exit() by itself. This is something that usually should not happen, so it may indicate a bug in python-cpl.

This should be independent of output_dir, since even with output_dir=None,a temporary directory is created (and deleted after the HDULists are created).

Could you create a small example (f.e. pixtable shortened to a few entries) that allows to reproduce the problem?

saimn commented 8 years ago

Ok thanks. I'm trying to make a minimal example to reproduce, but then it works ... I'm getting this error sporadically since a few days, but I didn't find a way to isolate the issue. What is weird is that the DRS reports that it has saved the FITS files, and I can see the files in the temporary directory before the exception is raised.

olebole commented 8 years ago

That information already helps a bit to limit where it may be happen. Which pipeline (muse) version do you use?

saimn commented 8 years ago

It's the latest SVN version.

saimn commented 8 years ago

I tried to run again the 2 exposures that were failing, and it worked. But now it is crashing one step later, in another scipost (with OFFSET_LIST this time), and with a little more information :

Traceback (most recent call last):
  File "/home/conseil/lib/doit/doit/action.py", line 387, in execute
    returned_value = self.py_callable(*self.args, **kwargs)
  File "/home/conseil/lib/mydrs/mydrs/reduction/tasks.py", line 106, in action
    result = self.run(exp['name'], exp, **kwargs)
  File "/home/conseil/lib/mydrs/mydrs/reduction/tasks.py", line 469, in run
    std_telluric=exp['std_telluric'], **runpar)
  File "/home/conseil/lib/mydrs/mydrs/recipe.py", line 135, in run_drs
    results = self._run_drs(flist, *args, **kwargs)
  File "/home/conseil/lib/mydrs/mydrs/science.py", line 359, in _run_drs
    results = self._recipe(raw=raw)
  File "/home/conseil/miniconda2/lib/python2.7/site-packages/cpl/recipe.py", line 446, in __call__
    input_len, logger, output_format, delete, mtrace)
  File "/home/conseil/miniconda2/lib/python2.7/site-packages/cpl/recipe.py", line 459, in _exec
    self.memory_dump, mtrace),
IOError: Recipe crashed

[35871] WARNING - cpl.muse_scipost.muse_postproc_qc_fwhm: No sources found for FWHM measurement down to 3.0 sigma limit on plane 1841, QC parameters will not contain useful information
[35871] ERROR - cpl.muse_scipost.muse_image_save: Could not save header: File read/write error: "couldn't create the named file" from CFITSIO (ver. 3.37) ffinit()=105. filename='IMAGE_FOV_000
1.fits', mode=2
[35871] WARNING - cpl.muse_scipost.muse_processing_prepare_property: Property ESO QC SCIPOST FWHM NVALID (Number of detected sources with valid FWHM in output cube.) not used
[35871] WARNING - cpl.muse_scipost.muse_processing_prepare_property: Property ESO QC SCIPOST FWHM MEDIAN ([arcsec] Median FWHM of all sources with valid FWHM measurement (in x- and y-directio
n) in output cube. If less than three sources with valid FWHM are detected, this value is zero.) not used
[35871] WARNING - cpl.muse_scipost.muse_processing_prepare_property: Property ESO QC SCIPOST FWHM MAD ([arcsec] Median absolute deviation of the FWHM of all sources with valid FWHM measuremen
t (in x- and y-direction) in output cube. If less than three sources with valid FWHM are detected, this value is zero.) not used

[ ERROR ] [tid=000] could not update the product header
[35871] ERROR - cpl.muse_scipost.muse_cplerrorstate_dump_some: Dumping the 4 most recent error(s) out of a total of 4 errors:
[35871] ERROR - cpl.muse_scipost.muse_cplerrorstate_dump_some:   [1/4] 'File read/write error: "couldn't create the named file" from CFITSIO (ver. 3.37) ffinit()=105. filename='IMAGE_FOV_0001
.fits', mode=2' (5) at cpl_propertylist_save:cpl_propertylist.c:7260
[35871] ERROR - cpl.muse_scipost.muse_cplerrorstate_dump_some:   [2/4] 'File read/write error: "couldn't create the named file" from CFITSIO (ver. 3.37) ffinit()=105. filename='DATACUBE_FINAL
.fits', mode=2' (5) at cpl_propertylist_save:cpl_propertylist.c:7260
[35871] ERROR - cpl.muse_scipost.muse_cplerrorstate_dump_some:   [3/4] 'File read/write error: "could not open the named file" from CFITSIO (ver. 3.37) ffdkopn()=104. filename='DATACUBE_FINAL
.fits', type=65536, mode=4' (5) at cpl_image_save_:cpl_image_io.c:2239
[35871] ERROR - cpl.muse_scipost.muse_cplerrorstate_dump_some:   [4/4] 'File read/write error' (5) at cpl_imagelist_save:cpl_imagelist_io.c:691
[35871] ERROR - cpl.muse_scipost.CPL_recipe_exec: could not update the product header

olebole commented 8 years ago

Just a trivial guess: Do you have plenty of space on the disk so that the full result set may fit? If not, you may use the temp_dir attribute of the recipe to specify one on a partition with sufficient space. In any case, it should have a bit more information about the recipe then (at least the exit value).

saimn commented 8 years ago

Yep, no space issue, and temp_dir is already set to use /tmp (which use tmpfs). I also tried with another location for temp_dir and got the same crash. Where should I find the exit value ?

olebole commented 8 years ago

IMO this is actually two-fold: First, there is a problem within the MUSE pipeline that produces the logged errors. They clearly will let the recipe fail, and I have no idea why these happen. Did you recently update the cfitsio version? What OS do you use? But then, python-cpl should complain about a non-zero return value of the recipe and not just report a crash. This is, however, still mystic. I would suspect that somehow the recipe silently exits, so that python-cpl cannot find out more. You could try two things:

use output_dir instead of temp_dir. This keeps the written files for later investigation
try with esorex. If this fails as well, there must be some problem in the pipeline (or cfitsio),

olebole commented 8 years ago

One point to /tmp: Usually, the tmpfs on /tmp is quite small; some gigabytes at max. Are you sure that the output files do not fill up your temp space completely? Can you print the output of df -h /tmp/? And the whole set of parameters you use for muse_scipost (to see what output is actually going to be created)?

saimn commented 8 years ago

Our /tmp has the size of the RAM, so no space issue here:

❯ df -h /tmp
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           512G   40G  473G   8% /tmp

I have tried both setting output_dir to a string, and setting delete=False in python-cpl's code, to keep results in the temp_dir, and it works: the result files are created in the temp_dir and seem valid, before the recipe crash.

❯ l /tmp/conseil/muse_scipost-*
/tmp/conseil/muse_scipost-bUHP5s:
total 2.9G
-rw-rw-r-- 1 conseil 2.9G Sep 16 09:48 DATACUBE_FINAL.fits
-rw-rw-r-- 1 conseil 518K Sep 16 09:48 IMAGE_FOV_0001.fits
-rw-rw-r-- 1 conseil 341K Sep 16 09:48 SKY_CONTINUUM_0001.fits
-rw-rw-r-- 1 conseil 608K Sep 16 09:48 SKY_LINES_0001.fits
-rw-rw-r-- 1 conseil 484K Sep 16 09:48 SKY_MASK_0001.fits
-rw-rw-r-- 1 conseil 512K Sep 16 09:48 SKY_SPECTRUM_0001.fits

/tmp/conseil/muse_scipost-e1wECN:
total 12G
-rw-rw-r-- 1 conseil 2.9G Sep 15 17:40 DATACUBE_FINAL.fits
-rw-rw-r-- 1 conseil 507K Sep 15 17:40 IMAGE_FOV_0001.fits
-rw-rw-r-- 1 conseil 8.5G Sep 15 17:40 PIXTABLE_POSITIONED_0001.fits

(This is for tow different scipost runs)

saimn commented 8 years ago

Sorry, forgot to answer this:

Did you recently update the cfitsio version? What OS do you use?

OS is Debian 8.5 cfitsio 3.37 from the DRS install on our server (version 1.2)

olebole commented 8 years ago

In the moment, I don't see anything obvious that could lead to such behaviour. The last CPL error comes indeed from CPL_recipe.c after the recipe itself was executed. However the lines after that look good to me: they serialize the information and then send them back to the parent via a pipe. The parent process reads the serialized object, and first checks if the number of bytes is as expected. If not, it creates the mentioned IOError. So, the problem could only be that the process disappeared between somewhere here, where I don't see how this can happen. I would still be suspective where the CFITSIO error couldn't create the named file comes from, this looks like the beginning of the mystery for me.

saimn commented 8 years ago

I tried to put some prints to see what happen: when the recipe crashes, the first read returns -1 before the child process writes to the pipe. Printing errno in this case gives an Interrupted system call message.

This SO page gives a few answer that seems related (and maybe solutions ?), but I'm not sure to understand everything here ;). And I don't see why this would happen specifically for a few exposures.

olebole commented 8 years ago

The communication in python-cpl does not use signals, so there is no reason why a signal would be raised. A signal would more mean that the child died before. In my opinion, the major problem are the CFITSIO errors before. They should not happen there at all, and I suspect that already at that time something is changed in the child (which prevents from the final write).

saimn commented 7 years ago

Hmm I don't know but, from what I can see, when the read returns -1 the child process is still running. And the main process gets blocked on waitpid (checked by attaching a gdb on the process). I saw the cfitsio errors only once, and I wonder if it could be a consequence of the same kind of error, i.e. if the waitpid call is interrrupted, the main process cleans the temp directory, and then the child process fails to save the FITS files.

olebole commented 7 years ago

I still see no reason why a signal could occur here as well. However, if you succeeded to attach a gdb: couldn't you attach it as well to the child process and then see what happens? I would do it myself; however I have no example where this problem appears.

saimn commented 7 years ago

Hi Ole, I went back to this, and realized that the crashes occur when I resize my terminal or tmux windows. So it must have something to do with SIGWINCH. And it seems also related to the latest Python 2.7 version (2.7.12). If I downgrade to 2.7.11 then I cannot reproduce, which explains why I didn't notice this issue before. The best solution seems to be to do the read in a loop and check for EINTR and retry if needed. I can also try to just ignore SIGWINCH, but the loop seems a more robust solution. I didn't check yet the behavior on Python 3.

olebole commented 7 years ago

The easiest solution would probably be to ignore at least this signal, or all not-so-important signals (may be a bit dangerous).

olebole commented 7 years ago

I guess it is the fix of this Python issue which causes the problem, so I would pragmatically just do a

    signal(SIGWINCH, SIG_IGN);

There is no use case why a recipe should react on this at all...

saimn commented 7 years ago

Ah yes, I missed this issue in the Python NEWS file. Glad that we finally found the root of the issue ! I was testing with a loop to ignore EINTR, which works, but your solution works as well.

olebole commented 7 years ago

Well, if you have a pull request ready, I am happy to include it... :-)

Am 29. September 2016 14:28:23 MESZ, schrieb Simon Conseil notifications@github.com:

Ah yes, I missed this issue in the Python NEWS file. Glad that we finally found the root of the issue ! I was testing with a loop to ignore EINTR, which works, but your solution works as well.

You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/olebole/python-cpl/issues/8#issuecomment-250451981

Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.

olebole / python-cpl

IOError: Recipe crashed #8