ratt-ru / montblanc

GPU-accelerated RIME implementations. An offshoot of the BIRO projects, and one of the foothills of Mt Exaflop.
Other
10 stars 3 forks source link

Remove O(Nsrc) GPU memory requirement #15

Closed o-smirnov closed 8 years ago

o-smirnov commented 10 years ago

Current memory requirements scale with the number of sources, which quickly overruns the GPU capacity when using realistically-sized (100-1000 sources) sky models. Need to find away to reduce this -- unless #14 eliminates the problem.

sjperkins commented 10 years ago

@oms-ratt This relates to the perhaps broader issue of how to organise data, both on and off the GPU.

I think it would be good to have a clearer idea of the problem sizes we are aiming at. Off the top of my head, I'm aware of figures such as the following for SKA:

Perhaps this is too ambitious for the moment and I should be thinking about something like MeerKAT - 2,016 baselines (64 antennas), or LOFAR - baselines (14 antennas).

Recalling our discussion yesterday, the problem dimensions are currently ordered as a great (4, nbl, nchan, ntime, nsrc) matrix, which, after the source summation, becomes a (4, nbl, nchan, ntime) visibilty matrix. Currently, montblanc requires GPU memory for all of this..

Modifying the source dimension commensurately effects how much of the problem we can fit onto a single GPU, but other dimensions also have impact here. Recall that I've ordered the source dimension as the most rapidly changing because this makes the source summation efficient via parallel reduction.

My current strategy is to separate computation on the GPU by baseline, rather than source. Then the GPU essentially has to store (4,nchan,ntime,src) and reduce this to a single (4,nchan,ntime) visibility as a single unit of work. In my mind, this computation is easy enough to distribute both on a single GPU and multiple GPUs/cluster. These separately computed visibilities can then be collated to form the final (4,nbl,nchan,ntime) result.

Something like 4 x 1024 chans x 1440 time x 1000 sources == 5,898,240,000 complex floats, which when multiplied by 8 bytes yields ~ 44 GB of storage space required. That's already far too big for current GPUs, and I can see why you'd want to reduce the number of sources.

However, following my strategy we could further separate computation by both baseline and channel. So then you'd have (4, ntime, nsrc) == 8 x 4 x 1440 x 1000 = ~ 43 MB of storage required per baseline and channel which would be reduced to (4,ntime) by summation. Of course the separate results need to be collated.

There's space to make the granularity a bit finer here. So for example, we could do half or a third of the channels for one baseline as a unit of computation.

As I see it, the big problem is that baseline is dependent on N^2 antennas so this is the dimension that is going to explode. @jtlz2 Jon mentioned something about number of sources being related to sqrt(N), but my impression was that many of the sources would be faint and could be ignored, or wouldn't increase as significantly.

I think this is big topic and worth discussing further. I've asked Patrick to sign up to github so that he can be involved in this discussion.

jtlz2 commented 10 years ago

I agree this is a huge and important topic. Should we start an off-github document laying out the technical limitations/calculations, our astronomical requirements/potential and current use cases? What are other good ways to do this - a brainstorming session after an initial document..?

sjperkins commented 10 years ago

@jtlz2 I reckon the wiki could be suitable for this? Using markdown to put a basic document together is pretty easy, and everyone can contribute, see the changes, or comment.

jtlz2 commented 10 years ago

OK.

On 21 May 2014 12:06, sjperkins notifications@github.com wrote:

@jtlz2 https://github.com/jtlz2 I reckon the wikihttps://github.com/ska-sa/montblanc/wikicould be suitable for this? Using markdown to put a basic document together is pretty easy, and everyone can contribute, see the changes, or comment.

— Reply to this email directly or view it on GitHubhttps://github.com/ska-sa/montblanc/issues/15#issuecomment-43736673 .

sjperkins commented 10 years ago

@oms-ratt @jtlz2 With single precision floats, it seems that its possible to handle the following problem size on jake:

RIME Simulation Dimensions Antenna: 15 Baselines: 105 Channels: 64 Timesteps: 72 Sources: 200 GPU Memory: 2984 MB

Clearly the problem blows up when we start adding more sources/channels/timesteps/antennae, but is the above problem size be sufficient for your current needs?

jtlz2 commented 10 years ago

Sounds like a good start to me. How does it scale in each dimension e.g. I'd probably only want (say) 100 sources, 16 channels but more timesteps?

One thing we haven't really discussed is gridding the uv data before processing it. I've always thought this could be helpful compression. @sjperkins this only affects the way the uv points are presented: e.g. time/frequency -> u,v cells on a (modest) grid.

1 GPU helps with the RAM in a meaningful way, right?

sjperkins commented 10 years ago

@jtlz2 What size grid are you thinking of?

With our current setup, it looks like you may be able to get away with something like this:

RIME Simulation Dimensions Antenna: 15 Baselines: 105 Channels: 16 Timesteps: 600 Sources: 100 GPU Memory: 3142 MB

jtlz2 commented 10 years ago

@sjperkins Need to think about that - sorry.

OK. Maybe montblanc should come with a problem-size calculator ;)

sjperkins commented 10 years ago

@jtlz2 Its kind of got one at the moment, unfortunately you have to allocate the GPU memory to figure out how much is used :tongue:

sjperkins commented 10 years ago

@jtlz2 FYI, jake has around 3300MB RAM and elwood has 4800MB RAM. So if you want to run bigger problem sizes, you could uses elwood for the moment.

jtlz2 commented 10 years ago

@sjperkins Am good for now - 16 MB max so far.........

sjperkins commented 8 years ago

Implemented in #87