ratt-ru / QuartiCal

CubiCal, but with greater power.
MIT License
7 stars 4 forks source link

Segmentation fault (core dumped) #293

Closed AKHughes1994 closed 10 months ago

AKHughes1994 commented 10 months ago

Hi,

I'm in the process of switching from CC to QC. Right now, I'm trying to match past CC self-calibration to check that I'm getting similar image fidelity/improvement.

I'm getting continuous segmentation faults that kill my QC runs. Oddly, these seem to be stochastic; i.e., sometimes the command will execute fully, but most of the time, it will kill the script. QC should have no problems running this script as I could run it in CC (on the same machine) without issues (the only difference is I've changed the f-slope solver to delay_and_offset).

I'm not exactly sure what information/documents would help debug this issue, but if you let me know what you need to reproduce the error, I can provide it.

(Virtual) Machine specs: 64 Gb, 8 core

Data: S-band VLA data (i.e., 2 x 8 SPW basebands, each with 512 -- 2 MHz -- channels)

JSKenyon commented 10 months ago

Hi @AKHughes1994! Sorry that you seem to have run into a bug - if it seems stochastic it may be thread safety related. Could you please share both your log file and your QuartiCal config file/command line?

AKHughes1994 commented 10 months ago

Hi @JSKenyon I've attached the log file + .yaml file,

The command I run is,

goquartical ../quartical_parsets/DI_bb.yaml input_ms.path=ms.ms input_ms.select_ddids=[8,9,10,11,12,13,14,15] input_ms.freq_chunk=512 K.freq_interval=512

DI_bb.txt 20230829_194411.log.qc.txt

JSKenyon commented 10 months ago

Ok, I can reproduce on an arbitrary dataset which suggests it is a bug in the code and not some peculiarity in the data. Will drill down and find it.

JSKenyon commented 10 months ago

I believe I have found the problem - could you please unset output.subtract_directions? Please let me know if that works for you, as it seems to resolve the segfaults (due to out of bounds access) for me.

Thanks for the bug report - I will put in a check to ensure this doesn't trouble anyone else.

Edit: Just to clarify, the problem is that the corrected residual code is attempting to subtract direction 1 which doesn't actually exist in this case. This leads to an out-of-bounds access which may or may not cause a segfault. The solution is to check that all values in output.subtract_directions correspond to real directions. This can be done in the dask layer of the residual computation.

AKHughes1994 commented 10 months ago

Ahhhhh!

I modified a DD yaml file into a DI yaml file and absolutely should have caught that issue. Apologies.

Thanks for finding it, Andrew