ratt-ru / CubiCal

A fast radio interferometric calibration suite.
GNU General Public License v2.0
18 stars 13 forks source link

Hanging processes during I/O when running in multiprocess mode #201

Open drjamesallison opened 6 years ago

drjamesallison commented 6 years ago

When running CubiCal on an IDIA node with the following (res_corr.parset attached):

gocubical res_corr.parset --data-ms=1524597185_1284.full_pol_full_specJ1458p0416.ms --g-solvable=0 --g-xfer-from=1524597185_1284.full_polJ1458p0416.ms/G-field:0-ddid:None.parmdb

Process hangs for some tiles during I/O when --dis-ncpu > 0

res_corr.parset.txt

JSKenyon commented 6 years ago

Just to verify, is this the latest version?

drjamesallison commented 6 years ago

gocubical version 0.9.3

I installed this locally about a week or so ago

JSKenyon commented 6 years ago

That is very out of date. Current version is 1.2.1. I would suggest following the instructions here and installing from PyPI.

JSKenyon commented 6 years ago

Obviously, this is just to make sure that the bug persists in the latest version. If you can confirm that it is still around, we can track it down.

drjamesallison commented 6 years ago

Right - then yes this could be the issue (and also presumably for #200 as well)

not sure how I managed to install such an out of date version, I will check my paths

IanHeywood commented 6 years ago
gocubical --version
gocubical version 0.9.3

however

gocubical -h | grep mad
    --madmax-enable=0|1
    --madmax-estimate=['corr', 'all', 'diag', 'offdiag']
    --madmax-diag=0|1   Flag on on-diagonal (parallel-hand) residuals.
    --madmax-offdiag=0|1
    --madmax-threshold=S
    --madmax-global-threshold=S
                        --madmax-threshold. (default: [0, 12])
    --madmax-plot=['0', '1', 'show']
                        --madmax-plot-frac-above. (default: 1)
    --madmax-plot-frac-above=PLOT-FRAC-ABOVE

Possibly related to the hokey way we have to install it on the IDIA machines:

cd ~/Software/
git clone https://github.com/ratt-ru/CubiCal.git
git clone https://github.com/ska-sa/montblanc.git
singularity shell /users/ianh/containers/katdal-k3.img
pip install --user -e montblanc/
cd CubiCal
python setup.py gocythonize
cd ../
pip install --user -e CubiCal/
export PATH=~/.local/bin/:$PATH
JSKenyon commented 6 years ago

Ah ok. I think that there is a version tag in an init file that hasn't been updated - that one is on me. So back to figuring out what is going wrong. If this is running inside something like docker/singularity, it may be necessary to explicitly up the shared memory sizes. I know that broke things for me at one point, but no guarantee that it is the same problem.

JSKenyon commented 6 years ago

If possible, would it be possible to get a log of a run with --dist-ncpu > 0? It is ok if the log is incomplete - I would just like to see the output/some of the stuff at the top of the log. Unfortunately, without a reproducible example on my end, I am really only guessing.

drjamesallison commented 6 years ago

sure I'll give that a go

drjamesallison commented 6 years ago

I ran it again with ncpus = 4. It apparently hung around 15:24 and I killed the process using keyboard interrupt at 18:15. Note I used a singularity container.

pcal.log

JSKenyon commented 6 years ago

Hmmm - looking at that it looks like it may be stuck on a MODEL_DATA read. Out of interest, if you attempt to read MODEL_DATA using pyrap/python-casacore, does it load it happily?

The other alternative I see it that the first chunk doesn't ever signal its completion. Will try test this on my own data tomorrow.

Unfortunately, I am not a Singularity user so I don't have any instinct for whether or not it could be the problem. I guess I would suggest (if it is possible) running CubiCal on a box that doesn't require Singularity. The MS doesn't look too unwieldy - would just appreciate ruling that out.

Wish I could be of more instantaneous help.

JSKenyon commented 6 years ago

Ok, have run this parset on some of my own VLA data. I didn't use singularity but all settings were otherwise identical. I didn't (EDITED) manage to reproduce the hanging in the I/O thread. Whilst this is by no means an exhaustive test, it makes it more likely that the problem stems from the particular measurement set or the environment.

In fact, is there a different data set which you could try? Would be interested to see if it hangs for all input.

JSKenyon commented 6 years ago

Of course, I did set --dist-ncpu to 4.

drjamesallison commented 6 years ago

@JSKenyon thanks for the update and do that test

I will investigate further using different MS and environments and report back to this thread

o-smirnov commented 6 years ago

Any more experience with this problem?