Filter benchmarks and design implications

pavlis commented 3 years ago

I just ran a couple of interesting benchmark tests to compare the new bandpass filter processor I built by adapting C code from seismic unix. I'm going to post this as a jupyter notebook in tutorials when I finish exercising the rest of the implementation but the results are striking enough I want to share them. They have implications for design and how algorithms should be designed.

First, since I haven't checked this stuff in yet you won't be able to run this right now. To understand the benchmark script below you need to know that my implementation uses the OOP paradigm of creation is initialization as opposed to the classic FORTRAN subroutine model (call with a parameters). That is, for these tests I create and instance of a Butterworth filter processor using this (admittedly ugly) construct:

bf=Butterworth(False,True,True,2,0.5,2,2.0,0.05)

This is very nonpythonic because there is no fancy wrapper to handle the ugly position dependent arguments. This construct defines bf as a Butterworth bandpass filter operator with 2 poles and corners at 0.5 and 2 hz. 0.05 is a sample rate needed because the coefficient of a filter like this depend upon the sample rate. I'm well aware we need some wrapper around this ugly construct to make it more pythonic, but that isn't the point of this entry.

Here is the first benchmark I want to show you. It is cut from the notebook so a number of the variables and the imports are earlier. Let me show it and then tell you what it shows below:

# We make these copy only as containers to use as workspaces - 
do=TimeSeries(d1)
dm=TimeSeries(d1)
tobspy=0.0
tmspass=0.0
number_to_process=100000
for n in range(number_to_process):
    x=np.random.randn(npts)
    for i in range(npts):
        dm.data[i]=x[i]
        do.data[i]=x[i]
    t0=time.time()
    mspass_signals.filter(do,"bandpass", freqmin=0.5, freqmax=2.0,corners=2)
    t1=time.time()
    tobspy += t1-t0
    t0=time.time()
    bf.apply(dm)
    t1=time.time()
    tmspass += t1-t0
print('Processing time for obspy filter=',tnumpy)
print('Processing time for mspass filter=',tmspass)

This test yielded this output:

Processing time for obspy filter= 112.68708205223083
Processing time for mspass filter= 0.4334728717803955

Noting the mspass.signals.filter function is a call to the wrappers Jinxin wrote for obspy's filter and the bf.apply call runs the C++ Butterworth processor.

The numbers aren't totally surprising since there is a significant overhead in the obspy filter in copying between Trace and TimeSeries containers.

More surprising was the result of this second test. Here I called the numpy function obspy uses directly:

# this is the numpy filter design function - we call it just once 
# that is totally analagous to calling the Butterworth constructor
sos=signal.butter(2,[0.5,2.0],btype='bandpass',output='sos',fs=20)
tnumpy=0.0
tmspass=0.0
number_to_process=100000
for n in range(number_to_process):
    x=np.random.randn(npts)
    for i in range(npts):
        dm.data[i]=x[i]
    t0=time.time()
    # note this requires a workspace to contain the output - f
    f=signal.sosfilt(sos,x)
    t1=time.time()
    tnumpy += t1-t0
    t0=time.time()
    bf.apply(dm)
    t1=time.time()
    tmspass += t1-t0
print('Processing time for numpy filter=',tnumpy)
print('Processing time for mspass filter=',tmspass)

which yielded these numbers:

Processing time for numpy filter= 4.013174533843994
Processing time for mspass filter= 0.33939361572265625

I actually thought they should be comparable. I don't know for sure why, but presume it has something to do with the copy operation implicit in f=signal.sosfilter(sos,x). In my Butterworth implementation the data are altered in place so no copying is needed.

I'm putting this down for the record as it is a good lesson in performance. It will take a lot of processors to throw at the obspy function before it beats a single thread version written in C. That is a point we will need to be sure to put in our documentation is a basic lesson in high performance computing. Fancy wrappers often come at a high cost.

A secondary lesson that is not as strongly made by this case is the advantage of a creation is initialization model for defining an algorithm compared to the subroutine model. This one isn't the best example because: (a) it is overwhelmed by the copy problem and (b) the initialization for a Butterworth filter is pretty trivial. It would come up in spades, however, for an algorithm like deconvolution. A naive algorithm that would compute and FFT independently for every record would be much slower than the way you (Ian) implemented your frequency domain method and which I adapted for MsPASS. That is, we initialize a bunch of stuff on creation that doesn't have to be recreated when the "process" method (for that algorithm) is called. I think our documentation on designing a new algorithm should plan to address that issue head on.

JiaoMaWHU commented 3 years ago

Hey Gary, I just read through the two lessons, thanks for bringing them out. I totally agree with your analysis. The converting does take a lot of time. But I didn't anticipate that plain python processing will also take that amount of time... I think we need to consider using whether c++ or python to implement our core algorithms, and if we're gonna use python, we have to do a lot of optimization work.

pavlis commented 3 years ago

Yes, a point is that we should keep in mind as this package develops that one always needs to know what is limiting performance when it matters. As always things done rarely and only a few times matter little, but things done millions of times, like the filter operations here, matter a lot.

I should also point out that the filter operator using obspy does not use python to do the low level filter operation, but it calls the numpy filter operator that operates on the array with optimized C code (or so I assume). The second benchmark shows numpy's filter operator is intrinsically slower by about an order of magnitude than the implementation I used stolen from seismic unix. I should point out I used that algorithm because it was very cleverly written by David Hale at Col. School of Mines to minimize opeation counts in the filter application loop. I suspect the extra bells and whistles numpy adds to be generic add some overhead to create the order of magnitude difference. The difference between the first and second benchmark is a measure of the conversion overhead we will face in any of the obspy wrapped processing functions. That is, they all use your common code to convert from a TimeSeries to a Trace, do the operation, and then convert back to a TimeSeries. The core part of that convert does not use a python copy operation but is done in C. Not sure where the residual overhead is but it is presumably the dict to Metadata conversion and the basic copy operation for the data vectors.

mspass-team / mspass

Filter benchmarks and design implications #90