Use case: gpuspec for Breakthrough Listen

telegraphic commented 7 years ago

For Breakthrough Listen, we need a real-time pipeline for generation of spectral data from voltage data.

At the moment, we have a lot of data stored in guppi raw format, and need to convert this when not observing into spectra. We are currently using a big, messy C/CUDA code called gpuspec; I would like to recreate this using bifrost.

I have created an example python script using PyCuda and scikit cuda, which is a load quicker than regular python, but still pretty slow. Essentially, I'm just using cuFFT and a cuda multiply. The two slowest parts of the code seem to be file reading, memcopies to/from device, and the numpy fftshift function (is there a CUDA one??).

As the memcopy/file read is the slowest part, this seems ripe for pipelining. However, I'm not sure where to start. Advice / boilerplate appreciated!

Getting example code

Clone the filterbank repository with the gpuspec.py code in it:

git clone https://github.com/ucBerkeleySETI/filterbank

Get some data from public data archive (http://www.breakthroughinitiatives.org/OpenDataSearch)

wget http://blpd0.ssl.berkeley.edu/ratan/blc05_guppi_57628_20146_DIAG_HIP88194_0008.0001.raw

Running example code

Run gpuspec code:

./gpuspec.py blc05_guppi_57628_20146_DIAG_HIP88194_0008.0001.raw -N 5 -f 1024

Output shows some benchmarking:

Integration ID:  0 (1 of 5)
(Data load:      0.07s)
FFT Plan:          0.18s

XX polarization timing
    ..memcopy to gpu:     0.24s
    ..cuFFT:              0.00s
    ..memcopy from gpu:   0.06s
    ..cuMultiply:         0.20s
cuFFT total:              0.93s
DC blanking:              0.00s
FFT shift:                0.62s
Accumulate:               0.04s

Integration ID:  0 (2 of 5)
(Data load:      0.07s)
XX polarization timing
    ..memcopy to gpu:     0.24s
    ..cuFFT:              0.00s
    ..memcopy from gpu:   0.05s
    ..cuMultiply:         0.06s
cuFFT total:              0.85s
DC blanking:              0.00s
FFT shift:                0.51s
Accumulate:               0.04s

benbarsdell commented 7 years ago

It would be great to port this to Bifrost.

What's the purpose of the np.roll at the end?

telegraphic commented 7 years ago

Excellent question. It just needs it! I think it's to do with the coarse channelizer in the firmware not doing a shift.

This is probably an edge case for bifrost, as the fft is very long do spans / gulps will be large

On Nov 22, 2016 4:47 PM, "Ben Barsdell" notifications@github.com wrote:

It would be great to port this to Bifrost.

What's the purpose of the np.roll at the end?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ledatelescope/bifrost/issues/51#issuecomment-262409434, or mute the thread https://github.com/notifications/unsubscribe-auth/AAriI2gaBFAMCsDlJSut8bqb2vnZTdDNks5rA407gaJpZM4K6FTe .

benbarsdell commented 7 years ago

I see. Does the roll not counter-act the fftshift? They seem to be very similar operations, but my intuition could just be wrong.

I'm also having trouble working out the array shapes at each point; would you be able to briefly summarise them? In particular the order of the axes and the size of the FFTs.

telegraphic commented 7 years ago

Ok, so there's a few things going on.

The data come out of the FPGA coarsely channelized into 64 channels. The .raw file is a header followed by data sprayed directly to disk.

if you open the raw file:

import guppi
raw = guppi.GuppiRaw('/bldata/gbt_raw/blc1_guppi_57388_HIP113357_0010.0000.raw')
header, data = raw.read_next_data_block()
print data.shape
>> (64, 516608, 2)
print data.dtype
>> 'complex64'

This is 64 coarse channels, each of which has 516608 complex time samples, and there are 2 polarizations. That is, the array is (channels, time, pol)

The fft I'm doing is batched for channels (http://scikit-cuda.readthedocs.io/en/latest/generated/skcuda.fft.Plan.html)

#    Plan(N_fft,             input_dtype,    output_dtype,  batch_size
fft_plan = Plan(d_xx.shape[1], np.complex64, np.complex64, d_xx.shape[0])

To form the full spectra, you stitch together the fft'd output of each coarse channel.

The fftshift is along the time axis, the roll is along the channel axis (albeit after unravelling).

telegraphic commented 7 years ago

So after some tracking down, the second np.roll was indeed needed because the np.fftshift.fftshift was operating over all axes by default, instead of just the coarse channel axis.

I also found that pagelocking the numpy array sped up the first memcopy quite significantly, so the newer code is a fair bit faster:

(Data load:      0.07s)
    ..memcopy to gpu:     0.05s
    ..cuFFT:              0.00s
    ..memcopy from gpu:   0.06s
    ..cuMultiply:         0.06s
cuFFT:            0.44s
DC blanking:              0.00s
FFT shift:                0.21s
Accumulate:               0.03s

telegraphic commented 7 years ago

For my future reference, this might be useful: https://github.com/marwan-abdellah/cufftShift

(it's LGPL licensed)

benbarsdell commented 7 years ago

Excellent, and thanks for the interesting reference. (quick note: I'm not sure why the kernel in cufftShift_2D_OP.cu is littered with __syncthreads calls; none of them are necessary).

I haven't tried it yet, but I think it should be quite easy to implement fftshift using the new "bfMap" function I have that lets you define a function and/or indexing to apply to a set of ND arrays. Should just be this for any number of dimensions: "out(_) = in(_ - in.shape()/2)".

telegraphic commented 7 years ago

FYI, my code for this is in https://github.com/telegraphic/bifrost/tree/master/gpuspec

It works in that it takes an input, does all the operations, and puts out an output, but there is some data corruption going on somewhere. Nevertheless, it looks vaguely correct which is somewhat reassuring.

telegraphic commented 7 years ago

(also note some extra cuda kernels in tree/master/src)

benbarsdell commented 7 years ago

Hey Danny, two quick questions re guppi raw format:

Which header parameters can be used to derive the exact time the observation was made?
Could the headers ever change within a single file?

benbarsdell commented 7 years ago

@telegraphic Nvm about (1), I got it from looking in the dspsr source: PKTIDX*PKTSIZE gives the byte offset, and STT_IMJD + STT_SMJD/8600. gives the starting MJD.

benbarsdell commented 7 years ago

https://github.com/ledatelescope/bifrost/commit/88101a971d2af39727955d231c7187c78a48e5dc adds DetectBlock and FftShiftBlock, which are relevant to this issue.

benbarsdell commented 7 years ago

@telegraphic do you have an update on the status of this use-case?

A relatively complete example of this pipeline was added in https://github.com/ledatelescope/bifrost/commit/eaf1a93d05ae2f699113dc89460d4e697f65e5a1.

telegraphic commented 2 years ago

I think it's fair to close this issue at this stage! bifrost is being used in places within Breakthrough Listen, but rawspec code (https://github.com/UCBerkeleySETI/rawspec) also being used

ledatelescope / bifrost

Use case: gpuspec for Breakthrough Listen #51

Getting example code

Running example code