ratt-ru / CubiCal

A fast radio interferometric calibration suite.
GNU General Public License v2.0
18 stars 13 forks source link

Cubical Montblanc error #409

Open joesbright opened 4 years ago

joesbright commented 4 years ago

Hi,

I'm running into an issue with cubical when trying to run a G/dE correction. I have a model in my model column and a tigger sky-model (converted from pyBDSM) of the problem source. I've attached the logs below. The error seems to be from montblanc, but I can't immediately see what is causing the issue.

Appreciate any information on this you could share.

Thanks, Joe

cubical_log.txt

JSKenyon commented 4 years ago

Hi @joesbright. If it is possible, would you mind uploading your sky model file? I want to check that it isn't something specific to the model.

The second possibility is that looking at the log, I see that the MS has two spectral windows with a single channel in each. I have never run CubiCal for this particular case, so it is possible the problem could stem from that.

joesbright commented 4 years ago

Hi @JSKenyon,

See the sky model below. The data are old VLA data, hence the single channel per SPW.

Thanks for the quick response!

bright_source.lsm.html.zip

JSKenyon commented 4 years ago

Great! I will take a look in the morning.

joesbright commented 4 years ago

Just a quick update - when I combine the SPWs with mstransform and rerun I get to the same 'future warnings' as in the previous log file, but then simply get a 'illegal instruction (core dumped)' error. This is similar to what @IanHeywood saw in issue #238 but I am not running on IDIA.

Thanks again, Joe

bennahugo commented 4 years ago

It means one of the upstream dependencies have been compiled marching to machine architecture and the shared objects contain illegal machine instructions. I think we have to run it with gdb to find out which.

On Mon, Sep 28, 2020 at 8:02 PM joesbright notifications@github.com wrote:

Just a quick update - when I combine the SPWs with mstransform and rerun I get to the same future warnings' as in the previous log file, but then simply get a Illegal instruction (core dumped)' error. This is similar to what @IanHeywood https://github.com/IanHeywood saw in issue #238 https://github.com/ratt-ru/CubiCal/issues/238 but I am not running on IDIA.

Thanks again, Joe

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ratt-ru/CubiCal/issues/409#issuecomment-700192838, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4RE6VQPX6DOHIZC65RGKLSIDFURANCNFSM4R4YL62Q .

--

Benjamin Hugo

PhD. student, Centre for Radio Astronomy Techniques and Technologies Department of Physics and Electronics Rhodes University

Junior software developer Radio Astronomy Research Group South African Radio Astronomy Observatory Black River Business Park Observatory Cape Town

JSKenyon commented 4 years ago

I cannot seem to reproduce this error. I have tried creating a 2 band MS with a single channel (using simms) in each and predicting using your sky model. I am not quite sure how to help further unless you can share the data? I understand if that is impossible though. I am running in a python3 (3.6.9) virtualenv with a fresh install of all python dependencies, ignoring cached installs. In theory everything should be up-to-date.

JSKenyon commented 4 years ago

Out of interest, where are you running this? Is it on a local laptop/desktop? Or is it on a server somewhere?

bennahugo commented 4 years ago

Can you post a pip freeze for reference

On Tue, Sep 29, 2020 at 11:12 AM JSKenyon notifications@github.com wrote:

I cannot seem to reproduce this error. I have tried creating a 2 band MS with a single channel (using simms) in each and predicting using your sky model. I am not quite sure how to help further unless you can share the data? I understand if that is impossible though. I am running in a python3 (3.6.9) virtualenv with a fresh install of all python dependencies, ignoring cached installs. In theory everything should be up-to-date.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ratt-ru/CubiCal/issues/409#issuecomment-700574625, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4RE6QPRI2TR74SQPV764LSIGQHPANCNFSM4R4YL62Q .

--

Benjamin Hugo

PhD. student, Centre for Radio Astronomy Techniques and Technologies Department of Physics and Electronics Rhodes University

Junior software developer Radio Astronomy Research Group South African Radio Astronomy Observatory Black River Business Park Observatory Cape Town

JSKenyon commented 4 years ago
pip freeze

absl-py==0.10.0 astLib==0.11.4 astor==0.8.1 astro-kittens==1.4.3 astro-tigger-lsm==1.6.0 astropy==4.0.1.post1 attrdict==2.0.1 attrs==20.2.0 backcall==0.2.0 bleach==1.5.0 Cerberus==1.3.2 configparser==5.0.0 -e git+https://github.com/ratt-ru/CubiCal.git@2dd52a0df04ef506bf89cf9cd6d4e5889bcddc3d#egg=cubical cycler==0.10.0 decorator==4.4.2 funcsigs==1.0.2 future==0.18.2 gast==0.4.0 grpcio==1.32.0 html5lib==0.9999999 hypercube==0.3.4 importlib-metadata==2.0.0 ipython==7.16.1 ipython-genutils==0.2.0 jedi==0.17.2 kiwisolver==1.2.0 llvmlite==0.34.0 Markdown==3.2.2 matplotlib==2.2.5 montblanc @ git+https://github.com/ska-sa/montblanc.git@547008faa46d5798f682d9d00597351a67f1915e nose==1.3.7 numba==0.51.2 numpy==1.19.2 parso==0.7.1 pexpect==4.8.0 pickleshare==0.7.5 pkg-resources==0.0.0 prompt-toolkit==3.0.7 protobuf==3.13.0 psutil==5.7.2 ptyprocess==0.6.0 Pygments==2.7.1 pyparsing==2.4.7 python-casacore==3.3.1 python-dateutil==2.8.1 pytz==2020.1 ruamel.yaml==0.16.12 ruamel.yaml.clib==0.2.2 scipy==1.5.2 SharedArray @ git+https://gitlab.com/bennahugo/shared-array.git@dc90bd2855ddcb7c1bbc473d24a1f42c60436be0 six==1.15.0 tabulate==0.8.7 tensorboard==1.8.0 tensorflow==1.8.0 termcolor==1.1.0 traitlets==4.3.3 wcwidth==0.2.5 Werkzeug==1.0.1 zipp==3.2.0

joesbright commented 4 years ago

This is running on a server with multiple nodes. I successfully ran on one of the newer nodes via a singularity shell, but I've also attached a pip freeze from the older node where I was having the issues I mentioned previously. Thanks @JSKenyon and @bennahugo for the help.

WARNING: Could not generate requirement for distribution .wh.pip 20.0.2 (/usr/local/lib/python3.6/dist-packages): Parse error at "'.wh.pip='": Expected W:(abcd...)
absl-py==0.9.0
asn1crypto==0.24.0
astLib==0.11.4
astor==0.8.1
astro-kittens==1.4.3
astro-tigger-lsm==1.6.0
astropy==4.0.1.post1
attrdict==2.0.1
attrs==19.3.0
bleach==1.5.0
Cerberus==1.3.2
configparser==5.0.0
corner==2.0.1
cryptography==2.1.4
cubical @ git+https://github.com/ratt-ru/CubiCal.git@e49aff3975d8cc91a29a707c2d60036ad883bc8e
cycler==0.10.0
emcee==3.0.2
funcsigs==1.0.2
future==0.18.2
gast==0.3.3
grpcio==1.29.0
html5lib==0.9999999
hypercube==0.3.4
idna==2.6
importlib-metadata==1.6.0
keyring==10.6.0
keyrings.alt==3.0
kiwisolver==1.2.0
llvmlite==0.32.1
Markdown==3.2.2
matplotlib==3.2.1
montblanc @ git+https://github.com/ska-sa/montblanc.git@547008faa46d5798f682d9d00597351a67f1915e
nose==1.3.7
numba==0.49.1
numpy==1.18.4
pandas==1.0.3
protobuf==3.12.0
psutil==5.7.0
pycrypto==2.6.1
pygobject==3.26.1
pyparsing==2.4.7
python-apt==1.6.4
python-casacore==3.3.1
python-dateutil==2.8.1
pytz==2020.1
pyxdg==0.25
PyYAML==5.3
ruamel.yaml==0.16.10
ruamel.yaml.clib==0.2.0
scipy==1.4.1
SecretStorage==2.3.1
SharedArray @ git+https://gitlab.com/bennahugo/shared-array.git@dc90bd2855ddcb7c1bbc473d24a1f42c60436be0
six==1.14.0
tabulate==0.8.7
tensorboard==1.8.0
tensorflow==1.8.0
termcolor==1.1.0
unattended-upgrades==0.1
Werkzeug==1.0.1
zipp==3.1.0
JSKenyon commented 4 years ago

This is still mysterious, as it was in #238. The fact that it works on one node and not another suggests something system related but I have no instinct for the cause.

bennahugo commented 4 years ago

As I mentioned the instruction sets are cpu model specific. If you enable march-native the SSE/AVX & manufacturer specific instruction sets are compiled into the assembler code of the binary distributed wheel. It may work on a range of architectures but not necessarily cross generation or manufacturer. This is why you would get SIG ILLEGAL from the kernel.

On Tue, Sep 29, 2020 at 3:18 PM JSKenyon notifications@github.com wrote:

This is still mysterious, as it was in #238 https://github.com/ratt-ru/CubiCal/issues/238. The fact that it works on one node and not another suggests something system related but I have no instinct for the cause.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ratt-ru/CubiCal/issues/409#issuecomment-700696615, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4RE6WJN7UPWK6BTSENHWDSIHNDVANCNFSM4R4YL62Q .

--

Benjamin Hugo

PhD. student, Centre for Radio Astronomy Techniques and Technologies Department of Physics and Electronics Rhodes University

Junior software developer Radio Astronomy Research Group South African Radio Astronomy Observatory Black River Business Park Observatory Cape Town

bennahugo commented 4 years ago

You can try --no-binary to forcefully disable wheel building to check whether the issue arises in the wheels and as a work around.

I will look into this with gdb tomorrow.

On Tue, Sep 29, 2020 at 3:22 PM Benna Hugo bennahugo@gmail.com wrote:

As I mentioned the instruction sets are cpu model specific. If you enable march-native the SSE/AVX & manufacturer specific instruction sets are compiled into the assembler code of the binary distributed wheel. It may work on a range of architectures but not necessarily cross generation or manufacturer. This is why you would get SIG ILLEGAL from the kernel.

On Tue, Sep 29, 2020 at 3:18 PM JSKenyon notifications@github.com wrote:

This is still mysterious, as it was in #238 https://github.com/ratt-ru/CubiCal/issues/238. The fact that it works on one node and not another suggests something system related but I have no instinct for the cause.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ratt-ru/CubiCal/issues/409#issuecomment-700696615, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4RE6WJN7UPWK6BTSENHWDSIHNDVANCNFSM4R4YL62Q .

--

Benjamin Hugo

PhD. student, Centre for Radio Astronomy Techniques and Technologies Department of Physics and Electronics Rhodes University

Junior software developer Radio Astronomy Research Group South African Radio Astronomy Observatory Black River Business Park Observatory Cape Town

--

Benjamin Hugo

PhD. student, Centre for Radio Astronomy Techniques and Technologies Department of Physics and Electronics Rhodes University

Junior software developer Radio Astronomy Research Group South African Radio Astronomy Observatory Black River Business Park Observatory Cape Town

JSKenyon commented 4 years ago

Thanks @bennahugo!