lofar-astron / factor

Facet calibration for LOFAR
http://www.astron.nl/citt/facet-doc
GNU General Public License v2.0
19 stars 12 forks source link

smooth_amps_spline failed #209

Closed botteon closed 7 years ago

botteon commented 7 years ago

My factor run stops at smooth_amp1. Here the final part of the log:

2017-04-28 00:37:41 ERROR   facetselfcal_facet_patch_392.executable_args: Remote process python /opt2/lofar/lib/python2.7/site-packages//lofarpipe/recipes/nodes/python_plugin.py ['/data/botteon/a1758_p42/factor_runs/factor1/results/facetselfcal/facet_patch_392/L230425_SB284_uv.dppp.pre-cal_1248F9070t_129MHz.pre-cal_chunk11_1248FF796t_3g.merge_amp_parmdbs1', '/opt2/factor/factor/scripts/smooth_amps_spline.py', ['/data/botteon/a1758_p42/factor_runs/factor1/results/facetselfcal/facet_patch_392/L230425_SB284_uv.dppp.pre-cal_1248F9070t_129MHz.pre-cal_chunk11_1248FF796t_3g.merge_amp_parmdbs1', '/data/botteon/a1758_p42/factor_runs/factor1/results/facetselfcal/facet_patch_392/L230425_SB284_uv.dppp.pre-cal_1248F9070t_129MHz.pre-cal_chunk11_1248FF796t_3g.smooth_amp1'], {}, '/data/botteon/a1758_p42/factor_runs/factor1/results/facetselfcal/facet_patch_392', False, {'args_format_option_argument': '=', 'args_format_option': '-', 'args_format': 'gnu', 'args_formatlongoption': '--', 'args_format_argument': ''}, {'LOFARDATAROOT': '/opt/lofar/data', 'OMP_NUM_THREADS': '8', 'PYTHONPATH': '/opt2/jinja2/lib/python2.7/site-packages/:/opt2/matplotlib/lib/python2.7/site-packages/:/opt2/aplpy/:/opt2/LSMTool/:/opt2/shapely/lib/python2.7/site-packages/:/opt2/astropy/lib/python2.7/site-packages/:/opt2/lofar/lib/python2.7/site-packages/:/opt2/factor/:/opt2/wcsaxes/lib/python2.7/site-packages/:/opt2/RMextract/:/opt2/python-casacore/python-casacore/::/opt2/lofar/lib/python2.7/site-packages:/opt2/jinja2/lib/python2.7/site-packages/:/opt2/matplotlib/lib/python2.7/site-packages/:/opt2/aplpy/:/opt2/LSMTool/:/opt2/shapely/lib/python2.7/site-packages/:/opt2/astropy/lib/python2.7/site-packages/:/opt2/lofar/lib/python2.7/site-packages/:/opt2/factor/:/opt2/wcsaxes/lib/python2.7/site-packages/:/opt2/RMextract/:/opt2/python-casacore/python-casacore/::/opt2/python-casacore/lib/python2.7/site-packages/:/opt2/python-casacore/lib/python2.7/site-packages/', 'LOFARROOT': '/opt2/lofar', 'PATH': ':/opt2/lofar/sbin:/opt2/lofar/bin::/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/homes/botteon/bin:/opt2/casacore/bin:/opt2/prefactor/bin/:/opt2/factor/bin/:/opt2/astropy/bin/:/opt2/LSMTool/bin/:/iranet/arcsoft/casapy/casapy-release-4.7.0-1-el7/bin/:/opt2/wsclean/bin:/opt2/UDR-0.9.4/src:/opt2/casacore/bin:/opt2/prefactor/bin/:/opt2/factor/bin/:/opt2/astropy/bin/:/opt2/LSMTool/bin/:/iranet/arcsoft/casapy/casapy-release-4.7.0-1-el7/bin/:/opt2/wsclean/bin:/opt2/UDR-0.9.4/src', 'LD_LIBRARY_PATH': ':::/opt2/lofar/lib:/opt2/casacore/lib:/opt2/cfitsio/lib:/opt2/wsclean/lib'}] failed on localhost
2017-04-28 00:37:41 DEBUG   facetselfcal_facet_patch_392.executable_args: compute.dispatch results job 0: job_duration: 1.31924510002, returncode: 1 
2017-04-28 00:37:43 DEBUG   facetselfcal_facet_patch_392.executable_args: Adding node_logging_information
2017-04-28 00:37:43 ERROR   facetselfcal_facet_patch_392.executable_args: A job has failed with returncode 1 and error_tolerance is not set. Bailing out!
2017-04-28 00:37:43 WARNING facetselfcal_facet_patch_392.executable_args: Note: recipe outputs are not complete
2017-04-28 00:37:43 WARNING facetselfcal_facet_patch_392.executable_args: recipe executable_args completed with errors
2017-04-28 00:37:43 WARNING facetselfcal_facet_patch_392: smooth_amps_spline reports failure (using executable_args recipe)
2017-04-28 00:37:43 ERROR   facetselfcal_facet_patch_392: *******************************************
2017-04-28 00:37:43 ERROR   facetselfcal_facet_patch_392: Failed pipeline run: facet_patch_392
2017-04-28 00:37:43 ERROR   facetselfcal_facet_patch_392: Detailed exception information:
2017-04-28 00:37:43 ERROR   facetselfcal_facet_patch_392: <class 'lofarpipe.support.lofarexceptions.PipelineRecipeFailed'>
2017-04-28 00:37:43 ERROR   facetselfcal_facet_patch_392: smooth_amps_spline failed
2017-04-28 00:37:43 ERROR   facetselfcal_facet_patch_392: *******************************************
2017-04-28 00:37:43 ERROR   facetselfcal_facet_patch_392: LOFAR Pipeline finished unsuccesfully.
2017-04-28 00:37:43 WARNING facetselfcal_facet_patch_392: recipe facetselfcal_facet_patch_392 completed with errors

Differently from #142, the *.merge_amp_parmdbs1 file is correctly displayed by parmdbplot.py. No *.smooth_amp1 is present in the facet directory.

soumyajitmandal commented 7 years ago

can you reproduce the error outside factor? check the command it used while running smooth_amps_spline from the factor log and try to run smooth_amps_spline.py; it should be a bit more specific about the error.

AHorneffer commented 7 years ago

Not part of the Factor documentation, but still applies here: Getting help with errors

botteon commented 7 years ago

Outside factor the script works (I source the same software). The error seems to occur randomly. I tried to run several times the same parset, one time it occurred in the fist facet, for another run it occurred in the third. Factor can process successfully 2-3 facets in a row and then fails at smooth_amp step in the latest facet. If I resubmit the run from where it crashed it seems to work. Apparently something is causing instability in the software (?).

botteon commented 7 years ago

Earlier in the facet log file:


2017-05-01 22:13:05 DEBUG   facetselfcal_facet_patch_111.executable_args: Request for job 0 from ('127.0.0.1', 33834)
2017-05-01 22:13:05 DEBUG   node.lofar.python_plugin: infile            = /data/botteon/a1758_p42/factor_runs/factor1/results/facetselfcal/facet_patch_111/L230425_SB284_uv.dppp.pre-cal_1248F9070t_129MHz.pre-cal_chunk11_1248FF796t_3g.merge_amp_parmdbs1
2017-05-01 22:13:05 DEBUG   node.lofar.python_plugin: executable        = /opt2/factor/factor/scripts/smooth_amps_spline.py
2017-05-01 22:13:05 DEBUG   node.lofar.python_plugin: working directory = /data/botteon/a1758_p42/factor_runs/factor1/results/facetselfcal/facet_patch_111
2017-05-01 22:13:05 DEBUG   node.lofar.python_plugin: arguments         = ['/data/botteon/a1758_p42/factor_runs/factor1/results/facetselfcal/facet_patch_111/L230425_SB284_uv.dppp.pre-cal_1248F9070t_129MHz.pre-cal_chunk11_1248FF796t_3g.merge_amp_parmdbs1', '/data/botteon/a1758_p42/factor_runs/factor1/results/facetselfcal/facet_patch_111/L230425_SB284_uv.dppp.pre-cal_1248F9070t_129MHz.pre-cal_chunk11_1248FF796t_3g.smooth_amp1']
2017-05-01 22:13:05 DEBUG   node.lofar.python_plugin: arg dictionary    = {}
2017-05-01 22:13:05 DEBUG   node.lofar.python_plugin: environment       = {'LOFARDATAROOT': '/opt2/lofar/data/', 'OMP_NUM_THREADS': '8', 'PYTHONPATH': '/opt2/jinja2/lib/python2.7/site-packages/:/opt2/matplotlib/lib/python2.7/site-packages/:/opt2/aplpy/:/opt2/LSMTool/:/opt2/shapely/lib/python2.7/site-packages/:/opt2/astropy/lib/python2.7/site-packages/:/opt2/lofar/lib/python2.7/site-packages/:/opt2/factor/:/opt2/wcsaxes/lib/python2.7/site-packages/:/opt2/RMextract/:/opt2/python-casacore/python-casacore/::/opt2/lofar/lib/python2.7/site-packages:/opt2/jinja2/lib/python2.7/site-packages/:/opt2/matplotlib/lib/python2.7/site-packages/:/opt2/aplpy/:/opt2/LSMTool/:/opt2/shapely/lib/python2.7/site-packages/:/opt2/astropy/lib/python2.7/site-packages/:/opt2/lofar/lib/python2.7/site-packages/:/opt2/factor/:/opt2/wcsaxes/lib/python2.7/site-packages/:/opt2/RMextract/:/opt2/python-casacore/python-casacore/::/opt2/python-casacore/lib/python2.7/site-packages/:/opt2/python-casacore/lib/python2.7/site-packages/', 'LOFARROOT': '/opt2/lofar', 'PATH': ':/opt2/lofar/sbin:/opt2/lofar/bin:/iranet/arcsoft/ds9/ds9-7.7.5/bin::/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/homes/botteon/bin:/opt2/casacore/bin:/opt2/prefactor/bin/:/opt2/factor/bin/:/opt2/astropy/bin/:/opt2/LSMTool/bin/:/iranet/arcsoft/casapy/casapy-release-4.7.0-1-el7/bin/:/opt2/wsclean/bin:/opt2/UDR-0.9.4/src:/opt2/casacore/bin:/opt2/prefactor/bin/:/opt2/factor/bin/:/opt2/astropy/bin/:/opt2/LSMTool/bin/:/iranet/arcsoft/casapy/casapy-release-4.7.0-1-el7/bin/:/opt2/wsclean/bin:/opt2/UDR-0.9.4/src', 'LD_LIBRARY_PATH': ':::/opt2/lofar/lib:/opt2/casacore/lib:/opt2/cfitsio/lib:/opt2/wsclean/lib'}
2017-05-01 22:13:05 INFO    node.lofar.python_plugin: Processing /data/botteon/a1758_p42/factor_runs/factor1/results/facetselfcal/facet_patch_111/L230425_SB284_uv.dppp.pre-cal_1248F9070t_129MHz.pre-cal_chunk11_1248FF796t_3g.merge_amp_parmdbs1
2017-05-01 22:13:06 WARNING facetselfcal_facet_patch_111.executable_args: /opt2/lofar/lib/python2.7/site-packages/lofar/parmdb/__init__.py:22: RuntimeWarning: to-Python converter for std::vector<double, std::allocator<double> > already registered; second conversion method ignored.
2017-05-01 22:13:06 WARNING facetselfcal_facet_patch_111.executable_args:   from _parmdb import ParmDB
2017-05-01 22:13:06 WARNING facetselfcal_facet_patch_111.executable_args: /opt2/lofar/lib/python2.7/site-packages/lofar/parmdb/__init__.py:22: RuntimeWarning: to-Python converter for std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > already registered; second conversion method ignored.
2017-05-01 22:13:06 WARNING facetselfcal_facet_patch_111.executable_args:   from _parmdb import ParmDB
2017-05-01 22:13:06 WARNING facetselfcal_facet_patch_111.executable_args: /usr/lib/python2.7/dist-packages/gtk-2.0/gtk/__init__.py:57: GtkWarning: could not open display
2017-05-01 22:13:06 WARNING facetselfcal_facet_patch_111.executable_args:   warnings.warn(str(e), _gtk.Warning)
2017-05-01 22:13:06 WARNING facetselfcal_facet_patch_111.executable_args: /opt2/matplotlib/lib/python2.7/site-packages/matplotlib/backends/backend_gtk.py:58: GtkWarning: IA__gdk_cursor_new_for_display: assertion 'GDK_IS_DISPLAY (display)' failed
2017-05-01 22:13:06 WARNING facetselfcal_facet_patch_111.executable_args:   cursors.MOVE          : gdk.Cursor(gdk.FLEUR),
2017-05-01 22:13:06 DEBUG   facetselfcal_facet_patch_111.executable_args: Results for job 0 submitted by ('127.0.0.1', 33842)
2017-05-01 22:13:06 ERROR   node.lofar.python_plugin: could not create GdkCursor object
2017-05-01 22:13:06 INFO    node.lofar.python_plugin: Total time 0.3072s; user time: 0.2680s; system time: 0.0440s
2017-05-01 22:13:06 DEBUG   node.lofar.python_plugin: Start time was 1493669585.9925s; end time was 1493669586.2997s
2017-05-01 22:13:07 DEBUG   facetselfcal_facet_patch_111.executable_args: 
2017-05-01 22:13:07 WARNING facetselfcal_facet_patch_111.executable_args: 
2017-05-01 22:13:07 INFO    facetselfcal_facet_patch_111.executable_args: Subprocess completed with exit status 1: /bin/sh -c python /opt2/lofar/lib/python2.7/site-packages//lofarpipe/recipes/nodes/python_plugin.py 0 127.0.1.1 48552

Also, I should be worried about the following lines in the "main" factor log?

/opt2/lofar/lib/python2.7/site-packages/lofarpipe/support/utilities.pyc : Using default subprocess module!
/opt2/lofar/lib/python2.7/site-packages/lofar/parmdb/__init__.py:22: RuntimeWarning: to-Python converter for std::vector<double, std::allocator<double> > alre
ady registered; second conversion method ignored.
  from _parmdb import ParmDB
/opt2/lofar/lib/python2.7/site-packages/lofar/parmdb/__init__.py:22: RuntimeWarning: to-Python converter for std::vector<std::__cxx11::basic_string<char, std:
:char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > already registe
red; second conversion method ignored.
  from _parmdb import ParmDB
INFO - factor:parset - Reading parset file: /data/botteon/a1758_p42/parsets/factor1.parset
[...]
INFO - factor:facet_patch_574 - Using facet clean-mask region file /data/botteon/a1758_p42/masks/p42_masks/574.rgn
Trying `CDLL(libgeos_c.so.1)`
Library path: 'libgeos_c.so.1'
DLL: <CDLL 'libgeos_c.so.1', handle 35c6940 at 7fd8ec9b8dd0>
Trying `CDLL(libc.so.6)`
Library path: 'libc.so.6'
DLL: <CDLL 'libc.so.6', handle 7fd8ffd9a000 at 7fd8ec9b8e10>
INFO - factor:directions - Adjusting facets to avoid sources...
[...]

Thanks.

darafferty commented 7 years ago

This error is likely due to the DISPLAY environment variable not being set. I've updated the script to hopefully avoid this, as it's needed only for debugging purposes.

botteon commented 7 years ago

After updating Factor to the latest version I've processed three facets in a row without errors (and still running). Since this never happened before, I assume that the issue is solved. Thanks.

botteon commented 7 years ago

I shouldn't have talked. It crashed with the following error:

2017-05-03 18:02:26 INFO    facetselfcal_facet_patch_837: Beginning step copy_model_data2
2017-05-03 18:02:26 ERROR   facetselfcal_facet_patch_837: *******************************************
2017-05-03 18:02:26 ERROR   facetselfcal_facet_patch_837: Failed pipeline run: facet_patch_837
2017-05-03 18:02:26 ERROR   facetselfcal_facet_patch_837: Detailed exception information:
2017-05-03 18:02:26 ERROR   facetselfcal_facet_patch_837: <class 'ConfigParser.NoSectionError'>
2017-05-03 18:02:26 ERROR   facetselfcal_facet_patch_837: No section: ''
2017-05-03 18:02:26 ERROR   facetselfcal_facet_patch_837: *******************************************
2017-05-03 18:02:26 ERROR   facetselfcal_facet_patch_837: LOFAR Pipeline finished unsuccesfully.
2017-05-03 18:02:26 WARNING facetselfcal_facet_patch_837: recipe facetselfcal_facet_patch_837 completed with errors

and

ERROR - factor:scheduler - Operation facetselfcal failed due to an error (direction: facet_patch_837)
ERROR - factor:scheduler - Caught an (Keyboard-)Interrupt, stopping all pipelines.
ERROR - factor:scheduler - One or more operations failed due to an error. Exiting...
darafferty commented 7 years ago

Ah -- this problem was fixed yesterday with commit 9f763ec.