ledatelescope / bifrost

A stream processing framework for high-throughput applications.
BSD 3-Clause "New" or "Revised" License
66 stars 29 forks source link

Bifrost pipeline doesn't seem to be very fast #50

Closed telegraphic closed 2 years ago

telegraphic commented 7 years ago

I am finally playing around with the bifrost numpy interface, and it seems to be pretty slow. I made a script, your_first_pipeline.py, which takes a test vector and does FFT -> IFFT -> FFT -> IFFT (so the output should be equal to the input):

import time
import sys
import numpy as np
from bifrost.block import Pipeline, NumpyBlock, NumpySourceBlock

# Read data size from command line
import sys
arr_size = int(sys.argv[1])
n_iter   = int(sys.argv[2])

# Our simple test-bench
def generate_test_vectors():
    for ii in xrange(n_iter):
        yield np.arange(0, arr_size, 1, dtype='complex64')

def validate_test_vectors(x):
    v = np.arange(0, arr_size, 1, dtype='complex64')
    assert np.allclose(x, v)

# Benchmark with Python only
t0 = time.time()
for xarr in generate_test_vectors():
    xfft = np.fft.fft(xarr)
    xifft = np.fft.ifft(xfft)
    xfft = np.fft.fft(xifft)
    xifft = np.fft.ifft(xfft)
    validate_test_vectors(xifft)

t1 = time.time()
print "Time taken: ", t1-t0

# Benchmark with bifrost
t0 = time.time()
blocks = [
  (NumpySourceBlock(generate_test_vectors), {'out_1': 'raw'}),
  (NumpyBlock(np.fft.fft),  {'in_1': 'raw',   'out_1': 'fft0'}),
  (NumpyBlock(np.fft.ifft), {'in_1': 'fft0',  'out_1': 'ifft0'}),
  (NumpyBlock(np.fft.fft),  {'in_1': 'ifft0', 'out_1': 'fft1'}),
  (NumpyBlock(np.fft.ifft), {'in_1': 'fft1',  'out_1': 'ifft1'}),
  (NumpyBlock(validate_test_vectors, outputs=0), {'in_1': 'ifft1'})]

Pipeline(blocks).main()
t1 = time.time()
print "Bifrost time: ", t1-t0

And I am doing basic speed tests:

dancpr@bldcpr:/bldata/bifrost/examples$ python your_first_pipeline.py 1000 1000
Time taken:  0.0867929458618
Bifrost time:  5.81598901749
dancpr@bldcpr:/bldata/bifrost/examples$ python your_first_pipeline.py 10000 100
Time taken:  0.0630059242249
Bifrost time:  0.601629018784
dancpr@bldcpr:/bldata/bifrost/examples$ python your_first_pipeline.py 10000 1000
Time taken:  0.624031066895
Bifrost time:  5.99617195129

But it seems like bifrost is slower in most circumstances.... What's up?

MilesCranmer commented 7 years ago

I am finding similar results/conclusions when I run this script. From reading through the code, it looks like everything is set up exactly as it should be, so it does look like there is an internal problem with the Pipeline or the Numpy Block classes. The speed tests that I had run two months ago, which had shown a 2.5x speedup over the serial equivalent using 10000 element complex64 arrays going through an 8 block FFT/IFFT pipeline did use the NumpyBlock, but did not use the NumpySourceBlock, so that might be one contributor to the slow speed. I can't think of any other recent changes that would cause the issue.

I will look a bit more for the cause of this before the telecon, and will try to study it more over the weekend.

Which branch are you using btw?

MilesCranmer commented 7 years ago

Hi Danny,

I figured part of the problem out. The speed tests I did in September did use the NumpySourceBlock, but had the following flag set: changing=False (this flag exists on the branches /example/fft-speed-test and /feature/multi-sequence-support). However, I don't think this will completely solve the speed problem even on those branches (see my next point below).

This flag is by default set to True, and basically what it does is force the NumpySourceBlock to decide when it has to create a new sequence/header on the output ring. It does this by checking the generated arrays' shape/size/dtypes for changes at every single iteration, which I see can make it quite slow. I remember making the default True, because I thought that it would be safer to assume the user's generator might start outputting different size arrays, but I suppose this default behaviour is snail-paced. Would it be a good idea to change the default to be changing=False instead? I suppose it is a question of safety vs. speed as default behaviour (though a bottleneck might be hard to localize vs. a broken block which becomes evident immediately).

The other part of the problem is that I don't think the current versions of /master or the other most up-to-date branches are actually parallelizing blocks on my machine (this process should occur by the affinity core function in the _main() method of the MultiTransformBlock). Even when I increase the array size to 1,000,000, so as to minimize the impact of Bifrost's overhead on the pipeline performance, the Bifrost pipeline speed only approaches the serial speed. I will have to look at this more next week.

Cheers,

Miles

jaycedowell commented 2 years ago

Since I'm going through old tickets I decided to try this out under the autoconf branch and Py3.6:

$ python your_first_pipeline.py 1000 1000
Time taken:  0.3951113224029541
Bifrost time:  10.06708550453186
$ python your_first_pipeline.py 10000 100
Time taken:  0.19725704193115234
Bifrost time:  1.342881441116333
$ python your_first_pipeline.py 10000 1000
Time taken:  1.9654483795166016
Bifrost time:  9.767495155334473

I had similar results when setting changing=False on the NumpySourceBlock.

If go to larger array sizes the trend starts to reverse:

$ python your_first_pipeline.py 100000 1000
Time taken:  18.668033123016357
Bifrost time:  8.693476915359497
$ python your_first_pipeline.py 1000000 100
Time taken:  22.8755042552948
Bifrost time:  9.941920518875122

Digging into this a little more it looks like the smaller array Bifrost times are being dominated by a slow down in the input ring. For example with python your_first_pipeline.py 1000 10, the NumPy time is ~4 ms and the Bifrost time is ~630 ms. Of that ~550 ms can be attributed to a single slow span acquire. If you ignore that one gulp the runtime drops down to ~80 ms which I think is much more inline with the expected overheads for Bifrost. What do you do about it? No idea. It always seems to be there even if I have a pipeline that consists of only a single NumpySourceBlock.

telegraphic commented 2 years ago

I think this is using an older version of Block -- not sure if anyone is actually using this? I've been using
https://github.com/ledatelescope/bifrost/blob/d9d23d2967c824f8c09158b4e9c81ca46b6db41e/python/bifrost/pipeline.py#L319

I think we should deprecate the block.py version in favor of supporting the pipeline.py version ...

telegraphic commented 2 years ago

PS: on the topic of slow performance, I noted that if CUDA_CACHE_DIR is set to an NFS mount the pipeline startup is really slow. (Can't recall if I documented this elsewhere)

(Broader suggestion is to set CUDA_CACHE_DIRand/or symlink ~/.nv thing to /dev/shm on compute nodes with shared filesystem)

jaycedowell commented 2 years ago

Yeah, the whole block.py vs pipeline.py thing seems to be kind of a theme in Bifrost right now where there are two ways to do the same thing. I'll open an issue on that (see https://github.com/ledatelescope/bifrost/issues/156).

On your second point we should try to collect this kind of wisdom somewhere. I'm not sure where, though. The documentation? The tutorials?

jaycedowell commented 2 years ago

After some additional digging it looks like the slowdown is from the closeout of the innermost ExitStack used in the writemethod of the base block. Given that I now understand where/why it happens and the fact that I don't think anyone is using block.py I am inclined to close this. @telegraphic any objections?