Request for information (NUMA awareness, DADA support, etc. etc.)

ewanbarr commented 7 years ago

Off the bat, this is not really an issue, it is more of a request for information on a selection of aspects of bifrost. The background for this request is that I am writing a multibeam beamformer/correlator for MeerKAT and am playing around with different frameworks for putting together the DSP pipeline.

To give some context, the beamformer will be a 32 node GPU cluster. The ingest rate is ~57 Gb/s per node, and we are planning to use SPEAD2 and PSRDADA for high-performance capture into a ring buffer and then have processing happen on a couple of GPUs per node before producing multiple SPEAD2 streams for output data products going to other instruments. The type of processing that will be done on the GPUs will be standard beamforming and correlation (using dp4a support in beanfarmer and xGPU), plus a bunch of transposes and quantisation steps.

So questions:

Where can I find some detailed example code showing a real-world application of bifrost (the GUPPI processing pipeline for example).
In the same vein I didn't see any performance measures in the paper regarding a full pipeline execution. You talk about SLOC etc. which is all good, but it would be really useful to have some idea of what the pipeline overheads are like when using bifrost.
Do you have or are you planning on having support for DADA ring buffer input and output?
How is NUMA awareness handled when specifying cores for block affinities? Are there guarantees about the location of a core in the topology based on its number (i.e. does HWLOC handle this)?

Finally, good job. This is a pretty awesome piece of software.

Cheers, Ewan

MilesCranmer commented 7 years ago

Hi @ewanbarr, thanks for the interest! Please continue to keep us updated on your evaluation of Bifrost—you have a very exciting use case.

I hope @benbarsdell / @jaycedowell / @telegraphic can comment on your questions as well, but here's my take:

Here's the GUPPI pipeline: https://github.com/telegraphic/bunyip/blob/master/bf_gpuspec.py. Note that the linear pipelines can also now be rewritten using block_chainer, which should make for even fewer tokens than listed in the paper. The pipeline code used for LWA-SV is in a private repo, and I can't speak on the future of that— @benbarsdell @jaycedowell ?
The performance figures in the paper measure this pipeline: https://github.com/ledatelescope/bifrost/blob/master/test/benchmarks/performance_vs_serial/linear_fft_pipeline.py against this one: https://github.com/ledatelescope/bifrost/blob/master/test/benchmarks/performance_vs_serial/skcuda_fft_pipeline.py. By full pipeline execution do you mean performance in reference to a hand-optimized equivalent pure-C++/CUDA pipeline? We had planned to do something using the original GUPPI->spectra https://github.com/UCBerkeleySETI/gbt_seti/blob/master/src/guppi2spectra.c but ended up not doing it for various reasons. I ran a line-by-line profiler over Bifrost on the newest version's FFT pipeline and the Python calls are small compared to the time given to C/CUDA functions, but I suppose it could be useful to publish those numbers specifically and/or keep a log of it (because they are nonzero, after all).
This is a really good idea and has come up several times in telecons. Andrew Jameson has also (?) shown interest in something to link DADA and Bifrost like this. I don't think anybody has gotten around to creating it, but it should be doable...
I think I will have to wait for Ben on this one...

Thanks again! Cheers, Miles

benbarsdell commented 7 years ago

Hey Ewan,

That would be a very cool application to try Bifrost out on; it's exactly the kind of thing it aims to tackle.

Your GPU cluster and software plans sound great. I have to say that I'm yet to be convinced about SPEAD (it always seemed very overcomplicated to me), but I guess there are lots of benefits to a standard protocol.

To answer your questions:

I've added a simpler gpuspec example here: https://github.com/ledatelescope/bifrost/blob/master/testbench/gpuspec_simple.py
I think Miles addressed most of this. I'll just add that under most practical circumstances I expect the overhead to be negligible. It should only become noticeable in cases where gulp sizes are very small.
As Miles said, I don't think anyone actually started working on it, but we talked about it. I originally had some concerns over licensing issues, but I think AJ clarified that it should be ok.
Cores are indexed by their absolute number within the system, so you can specify exactly which core on which NUMA node, and whether it's a hyperthread.

Cheers,

Ben

ledatelescope / bifrost

Request for information (NUMA awareness, DADA support, etc. etc.) #105