Open drjamesallison opened 6 years ago
Just to verify, is this the latest version?
gocubical version 0.9.3
I installed this locally about a week or so ago
That is very out of date. Current version is 1.2.1. I would suggest following the instructions here and installing from PyPI.
Obviously, this is just to make sure that the bug persists in the latest version. If you can confirm that it is still around, we can track it down.
Right - then yes this could be the issue (and also presumably for #200 as well)
not sure how I managed to install such an out of date version, I will check my paths
gocubical --version
gocubical version 0.9.3
however
gocubical -h | grep mad
--madmax-enable=0|1
--madmax-estimate=['corr', 'all', 'diag', 'offdiag']
--madmax-diag=0|1 Flag on on-diagonal (parallel-hand) residuals.
--madmax-offdiag=0|1
--madmax-threshold=S
--madmax-global-threshold=S
--madmax-threshold. (default: [0, 12])
--madmax-plot=['0', '1', 'show']
--madmax-plot-frac-above. (default: 1)
--madmax-plot-frac-above=PLOT-FRAC-ABOVE
Possibly related to the hokey way we have to install it on the IDIA machines:
cd ~/Software/
git clone https://github.com/ratt-ru/CubiCal.git
git clone https://github.com/ska-sa/montblanc.git
singularity shell /users/ianh/containers/katdal-k3.img
pip install --user -e montblanc/
cd CubiCal
python setup.py gocythonize
cd ../
pip install --user -e CubiCal/
export PATH=~/.local/bin/:$PATH
Ah ok. I think that there is a version tag in an init file that hasn't been updated - that one is on me. So back to figuring out what is going wrong. If this is running inside something like docker/singularity, it may be necessary to explicitly up the shared memory sizes. I know that broke things for me at one point, but no guarantee that it is the same problem.
If possible, would it be possible to get a log of a run with --dist-ncpu > 0? It is ok if the log is incomplete - I would just like to see the output/some of the stuff at the top of the log. Unfortunately, without a reproducible example on my end, I am really only guessing.
sure I'll give that a go
I ran it again with ncpus = 4. It apparently hung around 15:24 and I killed the process using keyboard interrupt at 18:15. Note I used a singularity container.
Hmmm - looking at that it looks like it may be stuck on a MODEL_DATA read. Out of interest, if you attempt to read MODEL_DATA using pyrap/python-casacore, does it load it happily?
The other alternative I see it that the first chunk doesn't ever signal its completion. Will try test this on my own data tomorrow.
Unfortunately, I am not a Singularity user so I don't have any instinct for whether or not it could be the problem. I guess I would suggest (if it is possible) running CubiCal on a box that doesn't require Singularity. The MS doesn't look too unwieldy - would just appreciate ruling that out.
Wish I could be of more instantaneous help.
Ok, have run this parset on some of my own VLA data. I didn't use singularity but all settings were otherwise identical. I didn't (EDITED) manage to reproduce the hanging in the I/O thread. Whilst this is by no means an exhaustive test, it makes it more likely that the problem stems from the particular measurement set or the environment.
In fact, is there a different data set which you could try? Would be interested to see if it hangs for all input.
Of course, I did set --dist-ncpu to 4.
@JSKenyon thanks for the update and do that test
I will investigate further using different MS and environments and report back to this thread
Any more experience with this problem?
When running CubiCal on an IDIA node with the following (res_corr.parset attached):
Process hangs for some tiles during I/O when --dis-ncpu > 0
res_corr.parset.txt