ratt-ru / CubiCal

A fast radio interferometric calibration suite.
GNU General Public License v2.0
18 stars 13 forks source link

montblanc predict crashes (fixed) and nthread>0 causes segfaults #395

Closed o-smirnov closed 4 years ago

o-smirnov commented 4 years ago

Trying to reacquaint myself with some old VLA data, I get this error with an LSM predict:


could not broadcast input array from shape (8,128) into shape (5,1024)

'point_stokes' data source information:
    Description: (I,Q,U,V) Stokes parameters.
    Units: Janskys
    Schema or abstract shape: ('npsrc', 'ntime', 'nchan', 4)
            where 'npsrc' is 'point sources'
            where 'ntime' is 'Timesteps'
            where 'nchan' is 'Channels'
    Global shape on this iteration: (4, 11, 1024, 4)
    Local shape for this context: (4, 5, 1024, 4)
    Lower extents within global shape: (0, 6, 0, 0)
    Upper extents within global shape: (4, 11, 1024, 4)

Iteration information:

Logs/parset/etc. can be found on /net/simon//home/oms/projects/VLA-CygA.

JSKenyon commented 4 years ago

I have taken a look at this. The offending code is here and here. The reason it crashes is because there are eight spectral windows. This means that some of the shapes end up disagreeing, as the Montblanc Stokes array gets something which is (nband, nchan) as opposed to (nband*nchan). I have implemented a "fix" in this branch. @sjperkins could you please verify that what I am doing isn't going to break things? @o-smirnov your example runs through for me with the changes, though I have no idea if the results are correct.

sjperkins commented 4 years ago

I have taken a look at this. The offending code is here and here. The reason it crashes is because there are eight spectral windows. This means that some of the shapes end up disagreeing, as the Montblanc Stokes array gets something which is (nband, nchan) as opposed to (nband*nchan). I have implemented a "fix" in this branch. @sjperkins could you please verify that what I am doing isn't going to break things? @o-smirnov your example runs through for me with the changes, though I have no idea if the results are correct.

Thanks for taking a look at this @JSKenyon. The SourceProvider's need to be set up to return the correct data extents and chunks. It certainly should be possible to configure them, per DDID. I have no idea if Cubical's data management loads in multiple DDID's at once. Is this the case? If so, the SourceProviders probably need to be configured to iterate over each one separately.

sjperkins commented 4 years ago

I have implemented a "fix" in this branch. @sjperkins could you please verify that what I am doing isn't going to break things?

I'm unsure about the ravelling of self._freqs[lc:uc] since my current understanding of it as a 1D (nchan,) array of CHAN_FREQ. Has a getcol been changed to a getvarcol somewhere and introduced the extra singleton row. i.e. (row, nchan)?

o-smirnov commented 4 years ago

I'm a bit confused too. We iterate over DDIDs, right? So it's only ever called for one DDID at a time. Where does the extra nband dimension come in?

o-smirnov commented 4 years ago

Ah no, wait, we actually set up the simulation per tile, which means a bunch of DDIDs are predicted together.

JSKenyon commented 4 years ago

Precisely - I think we just stack them along the frequency axis.

o-smirnov commented 4 years ago

Strangely, I'm getting it to segfault now, with only a few changes to the options. @JSKenyon could you run it with cyg.parset please to see if you get the same?

JSKenyon commented 4 years ago

Working on it - I did see the process pool crash out, but I suspect it happens when nthread is greater than 1.

o-smirnov commented 4 years ago

Ah of course, that must be the crucial change. But does this mean that Montblanc threading is broken?

JSKenyon commented 4 years ago

I don't think it is dying in Montblanc - still working on it.

JSKenyon commented 4 years ago

Yeah - this definitely makes it to the solver before dying.

JSKenyon commented 4 years ago

The issue is likely something in a numba kernel.

JSKenyon commented 4 years ago

The crash is happening in this line.

o-smirnov commented 4 years ago

Lovely. Seems like the simplest kernel of them all, right?

JSKenyon commented 4 years ago

Hmmm, this is proving to be far, far stranger that I would have anticipated.

JSKenyon commented 4 years ago

Something is VERY dodgy when using parallel kernels and caching.

JSKenyon commented 4 years ago

@o-smirnov Giving up on this one for today - I am downloading the test data to my local machine to speed up testing. There is some weird stuff happening, such as arrays having their shapes changed when they enter the numba layer.

bennahugo commented 4 years ago

Are you guaranteeing that the arrays are c contiguous for the shape you want? You may need to make sure the .copy() is called

On Mon, 13 Jul 2020, 16:16 JSKenyon, notifications@github.com wrote:

@o-smirnov https://github.com/o-smirnov Giving up on this one for today

  • I am downloading the test data to my local machine to speed up testing. There is some weird stuff happening, such as arrays having their shapes changed when they enter the numba layer.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ratt-ru/CubiCal/issues/395#issuecomment-657587058, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4RE6U4VTL7VRG42RIEUZLR3MJKJANCNFSM4OWZVFQA .

o-smirnov commented 4 years ago

Interesting! I can always run single-threaded, of course, but it would be nice to find the problem.

sjperkins commented 4 years ago

Are you guaranteeing that the arrays are c contiguous for the shape you want? You may need to make sure the .copy() is called

I think numba handles this OK:

import numba
import numpy as np

@numba.njit
def fn(a):
    result = a.dtype.type(0)

    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            result += a[i, j]

    return result

data = np.ones((20, 20)).T
assert data.flags.c_contiguous is False
assert fn(data) == data.sum()
bennahugo commented 4 years ago

Ok nvm. Just thought it might be the reason for segfaulting

On Mon, 13 Jul 2020, 17:30 Simon Perkins, notifications@github.com wrote:

Are you guaranteeing that the arrays are c contiguous for the shape you want? You may need to make sure the .copy() is called

I think numba handles this OK:

import numbaimport numpy as np @numba.njitdef fn(a): result = a.dtype.type(0)

for i in range(a.shape[0]):
    for j in range(a.shape[1]):
        result += a[i, j]

return result

data = np.ones((20, 20)).Tassert data.flags.c_contiguous is Falseassert fn(data) == data.sum()

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ratt-ru/CubiCal/issues/395#issuecomment-657628780, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4RE6RH4NR7LL7PI5K4MRLR3MR7VANCNFSM4OWZVFQA .

JSKenyon commented 4 years ago

This appears to be some sort of regression in Numba - segfault appears from 0.49 onward. This issue seems to be the problem. Setting --dist-nthread to the number of cores (both physical and virtual) on my laptop makes the problem go away. You can see my replies to that issue for more details. In the mean time @o-smirnov, you can downgrade to numba==0.48.

o-smirnov commented 4 years ago

Scary! Or I can not use threads, I guess...

JSKenyon commented 4 years ago

For CubiCal (as opposed to QuartiCal), downgrading shouldn't be at all problematic - we aren't really using the newer features inside the kernels.

JSKenyon commented 4 years ago

Just an quick update - the Numba devs are being really helpful. They have managed to reproduce the issue but it seems like it might be a complicated one. For now avoid using threads/downgrade Numba to 0.48.

JSKenyon commented 4 years ago

@o-smirnov I think this is fixed in the PR I just put up. Can you please take a look when you have a moment?