lofar-astron / factor

Facet calibration for LOFAR
http://www.astron.nl/citt/facet-doc
GNU General Public License v2.0
19 stars 12 forks source link

WSclean: Too many segfaults #193

Closed duyhoang-astro closed 7 years ago

duyhoang-astro commented 7 years ago

I am having troubles with wsclean during selfcal loops. The log says too many segfaults found when imaging a facet calibrator (always stops at image 31). This happens with a full-bandwidth run in one field, but not with a 4-bands run in another field. Anyone knows why that is and how to fix this?

Here is part of the log.

Reading mask '/home/hoang/para/h112/factor/results/facetselfcal/facet_patch_888/target_121MHz.pre-cal_chunk10_1294E892Et_0g.premask_selfcal'...

2017-02-23 00:51:51 WARNING node.lofar1.strw.leidenuniv.nl.executable_args.target_121MHz.pre-cal_chunk10_1294E88FEt_5g.mssort_into_Groups_apply_output]: /net/lofar1/data1/oonk/rh7_wsclean_jan2017_2_0/wsclean-2.0/build/wsclean process segfaulted! 2017-02-23 00:51:51 ERROR node.lofar1.strw.leidenuniv.nl.executable_args.target_121MHz.pre-cal_chunk10_1294E88FEt_5g.mssort_into_Groups_apply_output]: Too many segfaults from /net/lofar1/data1/oonk/rh7_wsclean_jan2017_2_0/wsclean-2.0/build/wsclean; aborted ... 2017-02-23 00:51:53 DEBUG facetselfcal_facet_patch_888.executable_args: compute.dispatch results job 0: job_duration: 180.460530043, returncode: 1 2017-02-23 00:51:54 DEBUG facetselfcal_facet_patch_888.executable_args: Adding node_logging_information 2017-02-23 00:51:54 ERROR facetselfcal_facet_patch_888.executable_args: A job has failed with returncode 1 and error_tolerance is not set. Bailing out! 2017-02-23 00:51:54 WARNING facetselfcal_facet_patch_888.executable_args: Note: recipe outputs are not complete 2017-02-23 00:51:54 WARNING facetselfcal_facet_patch_888.executable_args: recipe executable_args completed with errors 2017-02-23 00:51:54 WARNING facetselfcal_facet_patch_888: wsclean reports failure (using executable_args recipe) 2017-02-23 00:51:54 ERROR facetselfcal_facet_patch_888: 2017-02-23 00:51:54 ERROR facetselfcal_facet_patch_888: Failed pipeline run: facet_patch_888 2017-02-23 00:51:54 ERROR facetselfcal_facet_patch_888: Detailed exception information: 2017-02-23 00:51:54 ERROR facetselfcal_facet_patch_888: <class 'lofarpipe.support.lofarexceptions.PipelineRecipeFailed'> 2017-02-23 00:51:54 ERROR facetselfcal_facet_patch_888: wsclean failed 2017-02-23 00:51:54 ERROR facetselfcal_facet_patch_888: 2017-02-23 00:51:54 ERROR facetselfcal_facet_patch_888: LOFAR Pipeline finished unsuccesfully. 2017-02-23 00:51:55 WARNING facetselfcal_facet_patch_888: recipe facetselfcal_facet_patch_888 completed with errors

Thanks, Duy

AHorneffer commented 7 years ago

Reading mask '/home/hoang/para/h112/factor/results/facetselfcal/facet_patch_888/target_121MHz.pre-cal_chunk10_1294E892Et_0g.premask_selfcal'...

Did you check if that file exists and is a usable mask?

duyhoang-astro commented 7 years ago

Yes, I checked the mask. It seems to be fine with the mask. image

twshimwell commented 7 years ago

Maybe worth having a try running the wsclean command outside of factor? I've seen this before in aoflagger but don't seem to be able to recall what caused it or what fixed it. Its not running out of memory or struggling to write to disk or anything?

AHorneffer commented 7 years ago

Maybe worth having a try running the wsclean command outside of factor?

That would also be my suggestion!

If that doesn't help, then put the full logfile somewhere where we can have a look at it. Maybe we can find something in there.

aroffringa commented 7 years ago

I'm not sure this is the issue, but I've had several people report WSClean seg faults lately when they use an odd image size. It was due to some weird technical issue in which some FFTW versions are not able to work on non-16byte aligned data, which can occur with odd image sizes. This I've now avoided on trunk, so updating it will at least fix the segfault. If this imaging run uses an odd image size, that can be an issue.

If you are using odd image sizes, it would be better to avoid that anyway; a standard FFT uses frequencies -2/n, ..., -1, 0, ... 2/n-1, where '0' is thus the centre pixel. If you make the image size odd, the centre pixel is not well defined, and I'm actually not sure all software handles that well, including WSClean. Apart from that, it's also faster to use nicely factorizable image sizes, so even sizes makes sense. Bottom line, I recommend to only use even sizes, both for the trimmed size and for the full size.

AHorneffer commented 7 years ago

@aroffringa: Factor makes sure that the size of an image is even and has only prime factors smaller or equal to 7. So this should not be a problem here.

aroffringa commented 7 years ago

Ok, thanks, wrong guess ;). Then I've no idea why it would segfault....

duyhoang-astro commented 7 years ago

It seems that something is wrong with two of my measurement sets: one in the middle and one at the end of the bandwidth. When I remove these measurement sets out of the ms input directory, I don't see the error at the wsclean step. These bands are flagged 42% and 48%.