Closed o-smirnov closed 4 years ago
I have taken a look at this. The offending code is here and here. The reason it crashes is because there are eight spectral windows. This means that some of the shapes end up disagreeing, as the Montblanc Stokes array gets something which is (nband, nchan)
as opposed to (nband*nchan)
. I have implemented a "fix" in this branch. @sjperkins could you please verify that what I am doing isn't going to break things? @o-smirnov your example runs through for me with the changes, though I have no idea if the results are correct.
I have taken a look at this. The offending code is here and here. The reason it crashes is because there are eight spectral windows. This means that some of the shapes end up disagreeing, as the Montblanc Stokes array gets something which is
(nband, nchan)
as opposed to(nband*nchan)
. I have implemented a "fix" in this branch. @sjperkins could you please verify that what I am doing isn't going to break things? @o-smirnov your example runs through for me with the changes, though I have no idea if the results are correct.
Thanks for taking a look at this @JSKenyon. The SourceProvider's need to be set up to return the correct data extents and chunks. It certainly should be possible to configure them, per DDID. I have no idea if Cubical's data management loads in multiple DDID's at once. Is this the case? If so, the SourceProviders probably need to be configured to iterate over each one separately.
I have implemented a "fix" in this branch. @sjperkins could you please verify that what I am doing isn't going to break things?
I'm unsure about the ravelling of self._freqs[lc:uc]
since my current understanding of it as a 1D (nchan,) array of CHAN_FREQ. Has a getcol been changed to a getvarcol somewhere and introduced the extra singleton row. i.e. (row, nchan)?
I'm a bit confused too. We iterate over DDIDs, right? So it's only ever called for one DDID at a time. Where does the extra nband
dimension come in?
Ah no, wait, we actually set up the simulation per tile, which means a bunch of DDIDs are predicted together.
Precisely - I think we just stack them along the frequency axis.
Strangely, I'm getting it to segfault now, with only a few changes to the options. @JSKenyon could you run it with cyg.parset
please to see if you get the same?
Working on it - I did see the process pool crash out, but I suspect it happens when nthread
is greater than 1.
Ah of course, that must be the crucial change. But does this mean that Montblanc threading is broken?
I don't think it is dying in Montblanc - still working on it.
Yeah - this definitely makes it to the solver before dying.
The issue is likely something in a numba kernel.
Lovely. Seems like the simplest kernel of them all, right?
Hmmm, this is proving to be far, far stranger that I would have anticipated.
Something is VERY dodgy when using parallel kernels and caching.
@o-smirnov Giving up on this one for today - I am downloading the test data to my local machine to speed up testing. There is some weird stuff happening, such as arrays having their shapes changed when they enter the numba layer.
Are you guaranteeing that the arrays are c contiguous for the shape you want? You may need to make sure the .copy() is called
On Mon, 13 Jul 2020, 16:16 JSKenyon, notifications@github.com wrote:
@o-smirnov https://github.com/o-smirnov Giving up on this one for today
- I am downloading the test data to my local machine to speed up testing. There is some weird stuff happening, such as arrays having their shapes changed when they enter the numba layer.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ratt-ru/CubiCal/issues/395#issuecomment-657587058, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4RE6U4VTL7VRG42RIEUZLR3MJKJANCNFSM4OWZVFQA .
Interesting! I can always run single-threaded, of course, but it would be nice to find the problem.
Are you guaranteeing that the arrays are c contiguous for the shape you want? You may need to make sure the .copy() is called
I think numba handles this OK:
import numba
import numpy as np
@numba.njit
def fn(a):
result = a.dtype.type(0)
for i in range(a.shape[0]):
for j in range(a.shape[1]):
result += a[i, j]
return result
data = np.ones((20, 20)).T
assert data.flags.c_contiguous is False
assert fn(data) == data.sum()
Ok nvm. Just thought it might be the reason for segfaulting
On Mon, 13 Jul 2020, 17:30 Simon Perkins, notifications@github.com wrote:
Are you guaranteeing that the arrays are c contiguous for the shape you want? You may need to make sure the .copy() is called
I think numba handles this OK:
import numbaimport numpy as np @numba.njitdef fn(a): result = a.dtype.type(0)
for i in range(a.shape[0]): for j in range(a.shape[1]): result += a[i, j] return result
data = np.ones((20, 20)).Tassert data.flags.c_contiguous is Falseassert fn(data) == data.sum()
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ratt-ru/CubiCal/issues/395#issuecomment-657628780, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4RE6RH4NR7LL7PI5K4MRLR3MR7VANCNFSM4OWZVFQA .
This appears to be some sort of regression in Numba - segfault appears from 0.49
onward. This issue seems to be the problem. Setting --dist-nthread
to the number of cores (both physical and virtual) on my laptop makes the problem go away. You can see my replies to that issue for more details. In the mean time @o-smirnov, you can downgrade to numba==0.48
.
Scary! Or I can not use threads, I guess...
For CubiCal (as opposed to QuartiCal), downgrading shouldn't be at all problematic - we aren't really using the newer features inside the kernels.
Just an quick update - the Numba devs are being really helpful. They have managed to reproduce the issue but it seems like it might be a complicated one. For now avoid using threads/downgrade Numba to 0.48
.
@o-smirnov I think this is fixed in the PR I just put up. Can you please take a look when you have a moment?
Trying to reacquaint myself with some old VLA data, I get this error with an LSM predict:
Logs/parset/etc. can be found on
/net/simon//home/oms/projects/VLA-CygA
.