Open IanHeywood opened 6 years ago
thanks Ian - I had started a clean run with a fresh measurement set, but I guess its possible that previous failures have somehow carried through to this error.
@o-smirnov will likely have a better intuition for the behaviour of shared_dict.py. From my side, I suspect @IanHeywood is correct. The problem can be zombie CubiCal processes - this is an unfortunate side effect of interrupting (via ctrl-c for example) CubiCal in multiprocessing mode. So it may not be the MS at fault. I would check out htop for lingering jobs and kill them manually. If the error persists after killing all old cubical processes, @drjamesallison, could you please follow up?
Do you think there are other checks we can put in, @o-smirnov, to make the presence of the zombies more obvious to the user?
Very odd. Doesn't feel like it should be zombie-related (the PID of the current process is in the pathname...), and I can see @drjamesallison was running in single-CPU mode so it's not a race condition either. If it happens again, could you post a full log please?
thanks @JSKenyon and @o-smirnov. Indeed I was running in single-CPU mode. We were finding that processes were hanging when performing I/O for some tiles, for which single-CPU mode seemed to provide a quick fix (but that is another story). I was using CUBICAL to apply a pre-existing gain table from a low-res (1k chan) MeerKAT MS to a full-spectral resolution (32k chan) MS, and produce corrected residuals (-ar).
I'll re-run and post a full log if it happens again
@drjamesallison Please do open an issue regarding the hanging I/O - is definitely something we would want to fix.
@o-smirnov Ah, I missed that. Not something I have hit before.
@o-smirnov and @JSKenyon I did a clean run using a fresh MS and unfortunately got the same error. Attached is the log, cheers!
@JSKenyon I'll open a separate issue with regard to the hanging I/O when running multiple processes
Ok, @o-smirnov , I am leaving this one to you. Might be related to #201, as it seems that this might be an old version.
thanks guys, I will try and update my version (not sure why it was so out of date) and re-run
Just to confirm, this is inside a Singularity container?
Yes that's correct
Via James Allison, running CubiCal on an IDIA node... I'm guessing this is related to a previous run that didn't finish, but this isn't something I've seen before:
Any suggestions?
Thanks.