Closed landmanbester closed 1 year ago
The issue seems to be that the following pattern
y = np.fft.fftshift(np.pad(x, pad_width=10000, mode='constant'), axes=(0,1))
becomes very inefficient for large images (memory allocations?). Assigning before shifting i.e.
xpad = np.pad(x, pad_width=10000, mode='constant')
y = np.fft.fftshift(xpad, axes=(0,1))
goes some way towards fixing the issue but this still repeatedly allocates memory so I have come up with custom pad_and_shift routines which reuses pre-initialised arrays. This seems to have sped things up significantly.
The c2r call still takes twice as long as r2c but things seem more balanced now. This should be good enough for the time being but I'll keep this one open because it's always nice to to go faster.
As discussed one more thing to try (and it does not look like numpy support this) is to expressly ensure that the memory allocation starts at a multiple of the system page size - that way you don't thrash the page cache that often by striding reads over the page boundary. If I recall correctly now from my days doing this in C++ the solution is to over-allocate by at least a page worth of memory and then offset the start pointer to an address that is an integral multiple of the page boundary. This raw pointer can then be wrapped inside an ndarray structure for use in numpy.
The alternative (which I have seen makes a huge difference in terms of set and copy operations) is to switch the system to use hugepages -- then there is much less chance of striding the page boundary. The change can be made permanent in grub, but to get it temporarily until next reboot do: echo always > /sys/kernel/mm/transparent_hugepage/enabled
Also remember to rebuild numpy to use native instruction set to have any chance of vectorized instruction set usage: CFLAGS="-O3 -march=native" pip install --no-binary :all: numpy==1.13.3 #same as system
maybe @mreineck has some thoughts or may have written an allocator to do page aligned memory wrapping with numpy
Thanks @bennahugo. I will give that a go. I see there is a ducc0.misc.make_noncritical which may be related
In this context (heavy accesses to large arrays) I typically see two big sources of slowdown:
make_noncritical
method which simply allocates arrays that are slightly too large if the strides would be problematic otherwise.export MALLOC_TOP_PAD_=8000000000
and see significant speedups from that. (Note the trailing underscore!)I'm not aware of problems that can happen when memory accesses cross page boundaries ... for large arrays they do that all the time, no matter whether the array starts on a page boundary or not. Can you give me a pointer to a discussion of this effect?
Thanks @mreineck -- I'm thinking of expressly avoiding too many strided accesses on large arrays. That is good to know you have an allocator for this. Maybe @landmanbester should try this upfront before any of the necessary backward steps are invoked. Interesting, I wonder if the hugepage setting has an influence on how often the temporary arrays are given back -- usually writing things in numpy does have the negative effect of not doing things in place.
I wonder if the hugepage setting has an influence on how often the temporary arrays are given back
I think this is entirely controlled by environment variables for the allocator in use (see https://www.gnu.org/software/libc/manual/html_node/Malloc-Tunable-Parameters.html for glibc, but you may have to use something dependig on your compiler).
I'm thinking of expressly avoiding too many strided accesses on large arrays.
That is not always possible. Think about FFTing the leading dimension of a 1024x1024 array ... unless you do array transposition (which has its own complications), there will be tons of really nasty strided memory accesses during the operation. These accesses will make the FFT over the leading dimension much slower than over the last dimension:
In [1]: import numpy as np
In [2]: import ducc0
In [3]: a=np.zeros((1024,1024),dtype=np.complex128)
In [4]: %timeit ducc0.fft.c2c(a,out=a,axes=(0,))
7.04 ms ± 4.24 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: %timeit ducc0.fft.c2c(a,out=a,axes=(1,))
4.29 ms ± 4.59 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [6]: a=ducc0.misc.make_noncritical(a)
In [7]: %timeit ducc0.fft.c2c(a,out=a,axes=(0,))
4.01 ms ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [8]: %timeit ducc0.fft.c2c(a,out=a,axes=(1,))
4.24 ms ± 3.08 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
One last thing: the stride of 4096 I mentioned above has nothing to do with the system's memory page size; the numbers are only equal by pure coincidence. Critical stride depends on cache size, cache line length and cache associativity. Unfortunately there is not much documentation available online, but a quick demo is given at http://blog.cs4u.us/2015/07/observing-cache-associativity-effects.html.
Thanks for the input @mreineck (and apologies for my tardy reply I thought I posted this message a while ago but I must have forgotten to hit the comment button). I definitely see more consistent timings with make_noncritical. I do however observe that c2r takes about double the time of r2c. Is this normal?
If you do a c2r
transform that affects more than one axis, then it will be somewhat slower than the analogous r2c
one, since it has to allocate a temporary array for storing intermediate results. I'm not sure how much slowdown this causes, but a factor of 2 feels somewhat high.
I'll think about this again ... maybe there is a way around this temporary array now (the FFT library has been quite extensively reworked over the last years).
You could do the following as an experiment: instead of
c2r(a, axes, out=b)
you could try something along the lines of
tmp = ducc0.misc.make_noncritical(np.empty_like(b)) # do this somewhere early, not for every transform
c2c(a, axes[:-1], out=tmp)
c2r(tmp, axes[-1], out=b)
If that improves performance, then the bottleneck is indeed the allocation of the temporary inside c2r
.
... and even in the second profile, padding and shifting is still taking way too much time; almost as long as an FFT. I'm sure this can be improved, and I'm thinking of adding this sort of functionality to ducc
, but it will take a bit.
Yeah this totally surprised me. It's a bit of a computationally awkward thing to do right? Here is my hacky attempt with numba
Pre-allocating and making things non-critical definitely helps a lot but I didn't expect it to take as much time as it does (theoretically this operation should have less than linear complexity). I'm not sure where the time goes, I'll have to dig a bit deeper. I have been thinking about how cool it would be to be able to call ducc functions from within numba (this is related) but I think that may be non-trivial (pinging @sjperkins as he had some ideas on the topic). It would certainly be cool if you could do the padding and shifting inside ducc but I also understand how this kind of complexity doesn't really fit well with the keep it simple philosophy. I did however note the convolve_axis function and I am wondering, since we are convolving with the PSF all the time, if it may make sense to have a 2D implementation of that (I don't think the Fourier transformed PSF always has an outer product representation). As always open to suggestions and very willing to test whatever you come up with
Yeah this totally surprised me. It's a bit of a computationally awkward thing to do right?
In principle it should not be too hard, but there are many details that can go wrong efficiency-wise. Your Numba implementation looks like it should be faster than what you measure; no idea what the JIT compiler is doing there.
It would certainly be cool if you could do the padding and shifting inside ducc but I also understand how this kind of complexity doesn't really fit well with the keep it simple philosophy.
I don't think that would be an issue. This would still be standalone functions, doing a small and well-defined job, so it doesn't really clash with the simplicity goal.
I did however note the convolve_axis function and I am wondering, since we are convolving with the PSF all the time, if it may make sense to have a 2D implementation of that (I don't think the Fourier transformed PSF always has an outer product representation).
Convolve_axis is specifically designed to work with 1D kernels, and practically all of its advantages would go away if it is generalized to a non-separable multi-D kernel. In such a case I think that standard convolution via FFT is still the best thing to do; perhaps with some overlap-and-add trickery, if the kernel is really small.
I'm wondering: if this is about convolutions of real-valued arrays, perhaps you can avoid the entire r2c/c2r
issue by just using Hartley transforms instead. Then you have real arrays even in the Fourier domain and can do more oerations in-place. I haven't thought about it much yet, but replacing both r2c
and c2r
with r2r_genuine_hartley
could "just work" (and if that works, we can perhaps even go to r2r_separable_hartley
which might be even faster, but that requires some extra tricks).
I did wonder at some point if there would be anything to gain from using Hartley instead of Fourier transforms. I think the slight difference in the convolution formula put me off too quickly. I will look into this again. Thanks @mreineck
There's one more option: I can introduce a variant of c2r
which is allowed to overwrite its input array with the intermediate result; that would also eliminate the temporary variable.
Yeah this totally surprised me. It's a bit of a computationally awkward thing to do right? Here is my hacky attempt with numba
Pre-allocating and making things non-critical definitely helps a lot but I didn't expect it to take as much time as it does (theoretically this operation should have less than linear complexity). I'm not sure where the time goes, I'll have to dig a bit deeper. I have been thinking about how cool it would be to be able to call ducc functions from within numba (this is related) but I think that may be non-trivial (pinging @sjperkins as he had some ideas on the topic). It would certainly be cool if you could do the padding and shifting inside ducc but I also understand how this kind of complexity doesn't really fit well with the keep it simple philosophy. I did however note the convolve_axis function and I am wondering, since we are convolving with the PSF all the time, if it may make sense to have a 2D implementation of that (I don't think the Fourier transformed PSF always has an outer product representation). As always open to suggestions and very willing to test whatever you come up with
I agree that if you want very good performance, things like page and cache sizes start to matter. This means reading and writing chunks of consecutive data, probably trying to do a page of reads + writes at a time. See for e.g. efficient memory transpose strategies in CUDA. Also note how averaging is performed in steps of 128 baselines in katdal averaging code.
It looks like your numba implementation is making 4 random (non consecutive in memory) read accesses and then performing 4 random write accesses, which probably won't be great for cache performance. So to make this faster, I think we'd have to re-arrange the loops, reads and writes: This is also going to be dependent on whether this is possible with an fftshift. @mreineck, please correct me here as you've probably done a lot of this. Perhaps we can adapt some of ducc0's C++ code into numba?
re: calling the ducc0 code from numba, I'm aware of two strategies:
https://labs.quansight.org/blog/2021/10/cxx-numba-interoperability. Quoting from the blog post
We provide here a Python script cxx2py.py that auto-generates, from a user-supplied C++ header and source files, the C/C++ wrapper library as well as a Python ctypes wrapper module. The Python module contains ctypes definitions of C++ library functions that Numba compiled functions are able to call directly without requiring the expensive and redundant object transformations mentioned above. An alternative to using ctypes in the Python wrapper module would be to use the Numba Wrapper Address Protocol - WAP . Its usage should be considered if the ctypes "C++ expressiveness" turns out to be insufficient.
Cons:
To me, it feels like (1) is the direction forward but that depends on whether padding+fftshifting can be done with efficient read/write access patterns, especially if they can be replicated from existing ducc0 code?
There's one more option: I can introduce a variant of
c2r
which is allowed to overwrite its input array with the intermediate result; that would also eliminate the temporary variable.
The way I am currently doing the convolution (see here) pre-allocates that array. I am not sure what the consequences would be if I allow r2c to create a new array. Maybe it would just move the problem elsewhere? I think I will start by investigating the Hartley transform route
It looks like your numba implementation is making 4 random (non consecutive in memory) read accesses and then performing 4 random write accesses, which probably won't be great for cache performance. So to make this faster, I think we'd have to re-arrange the loops, reads and writes.
I think the code is OK as it is: there are 8 data "streams" (4 reading, 4 writing) that each access memory in a linear fashion. The memory system should be able to handle that, as long as the offsets between the individual streams are not critical strides (and I think we made sure that they aren't).
It's of course still possible that Numba will do much better on individual loops, each with a single input and output stream.
Concerning calling ducc from numba: I don't really see where this would help much. Ducc routines typically do fairly big chunks of work in a single call, and as long as this is the case, there should be no need to call them from micro-optimized, precompiled code, right?
Please let's not get cython or ctypes into the mix ... I'm more than happy that pybind11
allows me to avoid them ;-)
I think there's probably a lot that could be gained from doing the padding and shifting in numba with cache/page efficient access. And then possibly doing it multiple threads. This view might be incorrect if the fftshift demands a random to random access pattern. @landmanbester + @mreineck please tell me if I'm wrong here.
fftshift has no bad memory access patterns. When array dimensions are even, the operation is completely trivial; when they are odd, it's a bit more involved but still doable. When padding and shifting happens at the same time, things actually become easier, since this is by definition an out-of-place operation.
I'm missing a lot of context here, but I'm sensing that @mreineck would prefer not to support the padding+fftshift op in ducc0.
Actually I'm fine with having this in ducc0. It's just that I'm such a perfectionist that I'll need some time before I have found the "best" possible interface for it.
Or perhaps he'd rather not support a pad+fftshift+grid op?
That's correct. I prefer having separate functions doing small tasks to having huge monolithic functions with dozens of options that can do everything. It's not always avoidable, but in this case I think it is.
Would a pad+fftshift op producing data that can be fed into a grid op be palatable?
Yes, I'm pretty sure.
The way I am currently doing the convolution (see here) pre-allocates that array. I am not sure what the consequences would be if I allow r2c to create a new array. Maybe it would just move the problem elsewhere? I think I will start by investigating the Hartley transform route
I think the new approach should work perfectly fine in this context. The only difference would be that xhat
would get overwritten with garbage by the c2r
call, and I don't think that would hurt anyone.
Thanks for the input @sjperkins. I certainly don't want to restrict to binary wheels. I actually thought that jitting would bypass this issue. I see pretty major speedups building ducc from source on certain architectures. The other obvious reason for wanting ducc functionality within numba is to avoid the python interpreter. For example, one reason for sticking with numpy's fft's instead of scipy's (see https://github.com/ratt-ru/codex-africanus/issues/211) is because np.fft is available in numba. However, in most cases where I can only partially jit a function I do quite a bit of work between interpreter visits so it may not help much. Cool as it may be to have that functionality, it seems non-trivial and may be a bit off topic for the current discussion
Actually I'm fine with having this in ducc0. It's just that I'm such a perfectionist that I'll need some time before I have found the "best" possible interface for it.
Eager to see what you come up with, thanks.
I think the new approach should work perfectly fine in this context. The only difference would be that
xhat
would get overwritten with garbage by thec2r
call, and I don't think that would hurt anyone.
Yeah that would be perfectly fine here. As long as the array survives I think its content after the c2r
call is irrelevant
It looks like your numba implementation is making 4 random (non consecutive in memory) read accesses and then performing 4 random write accesses, which probably won't be great for cache performance. So to make this faster, I think we'd have to re-arrange the loops, reads and writes.
I think the code is OK as it is: there are 8 data "stream" (4 reading, 4 writing) that each access memory in a linear fashion. The memory system should be able to handle that, as long as the offsets between the individual streams are not critical strides (and I think we made sure that they aren't).
I just want to check: would it not be better to read in a cache line of, say 128 bytes with spatial locality in consecutive operations than to issue 4 streams of operations, even if each stream has spatial locality?
It's of course still possible that Numba will do much better on individual loops, each with a single input and output stream.
I'm thinking of reading in a cache line of 128 bytes and writing them out in 128 bytes cache lines for e.g. This is important for CUDA and, from what I've seen general CPUs but you'll have more XP on this front.
Concerning calling ducc from numba: I don't really see where this would help much. Ducc routines typically do fairly big chunks of work in a single call, and as long as this is the case, there should be no need to call them from micro-optimized, precompiled code, right?
Please let's not get cython or ctypes into the mix ... I'm more than happy that
pybind11
allows me to avoid them ;-)
Very much agree :-). I also love pybind11 but haven't found an excuse to use it.... thus far.
I think there's probably a lot that could be gained from doing the padding and shifting in numba with cache/page efficient access. And then possibly doing it multiple threads. This view might be incorrect if the fftshift demands a random to random access pattern. @landmanbester + @mreineck please tell me if I'm wrong here.
fftshift has no bad memory access patterns. When array dimensions are even, the operation is completely trivial; when they are odd, it's a bit more involved but still doable. When padding and shifting happens at the same time, things actually become easier, since this is by definition an out-of-place operation.
Without deeply understanding what's involved here, I suspect that doing something optimal w.r.t. to cache lines and page sizes might be possible in numba.
I'm missing a lot of context here, but I'm sensing that @mreineck would prefer not to support the padding+fftshift op in ducc0.
Actually I'm fine with having this in ducc0. It's just that I'm such a perfectionist that I'll need some time before I have found the "best" possible interface for it.
Ok, I'm sure something will reveal itself after stewing in the unconcious :-)
Or perhaps he'd rather not support a pad+fftshift+grid op?
That's correct. I prefer having separate functions doing small tasks to having huge monolithic functions with dozens of options that can do everything. It's not always avoidable, but in this case I think it is.
I absolutely understand that in the context. of C++. We've been spoilt by numba which allows us to concoct unholy things like the Monolithic RIME. This is only possible in the context of a dynamic language like Python.
Would a pad+fftshift op producing data that can be fed into a grid op be palatable?
Yes, I'm pretty sure.
This also sounds like another possible avenue? To me it seems like there are two pragmatic options: pad+fftshift in numba or ducc0.
Thanks for the input @sjperkins. I certainly don't want to restrict to binary wheels. I actually thought that jitting would bypass this issue. I see pretty major speedups building ducc from source on certain architectures. The other obvious reason for wanting ducc functionality within numba is to avoid the python interpreter. For example, one reason for sticking with numpy's fft's instead of scipy's (see ratt-ru/codex-africanus#211) is because np.fft is available in numba. However, in most cases where I can only partially jit a function I do quite a bit of work between interpreter visits so it may not help much. Cool as it may be to have that functionality, it seems non-trivial and may be a bit off topic for the current discussion
Cool. Note there's also the possibility of building ducc0 and the wrapper during a source installation -- you'd just have to insist on having a C++ compiler installed, which some purists may balk at.
Yeah that would be perfectly fine here. As long as the array survives I think its content after the c2r call is irrelevant
That's exactly what happens (the array survives,but its content is modified). I have an implementation ready, and it should be available with the next release.
I just want to check: would it not be better to read in a cache line of, say 128 bytes with spatial locality in consecutive operations than to issue 4 streams of operations, even if each stream has spatial locality?
I'm pretty sure that on the CPU there is no advantage in reading the entire line in one go. The only important thing is that the line is fully read before it is thrown out of the cache by something else.
If you have very many streams (say, beyond 20 or so), there may be unrelated problems that slow things down ... for example, the CPU will run out of integer registers to store all the stream pointers.
How about
def roll_resize_roll(in, r1, r2, out)
? This behaves as if it
in
by r1
, putting the result into tmp
tmp
into out
, padding/truncating it as necessaryout
in-place by r2
In reality it just does one complicated set of nested loops, possibly parallelized, and touches every array entry exactly at most once. I think this should cover all use cases you need, @landmanbester.
I have a candidate at https://gitlab.mpcdf.mpg.de/mtr/ducc/-/tree/roll_experiments; the function is called ducc0.misc.roll_resize_roll
. If it's not too hard to get this branch into your environment, please give it a try!
This version also has the improved c2r
function with the allow_overwriting_input
flag.
Thanks, will do. Sorry for the delay. I've been moving over the weekend but I'll get to it first thing tomorrow morning
On Mon, 09 Jan 2023, 12:32 mreineck, @.***> wrote:
I have a candidate at https://gitlab.mpcdf.mpg.de/mtr/ducc/-/tree/roll_experiments; the function is called ducc0.misc.roll_resize_roll. If it's not too hard to get this branch into your environment, please give it a try!
— Reply to this email directly, view it on GitHub https://github.com/ratt-ru/pfb-clean/issues/70#issuecomment-1375404770, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEAORHOGJABWXSOF5FJC5KLWRPSNBANCNFSM6AAAAAASUKPD7Y . You are receiving this because you were mentioned.Message ID: @.***>
Disclaimer
The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful.
This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast, a leader in email security and cyber resilience. Mimecast integrates email defenses with brand protection, security awareness training, web security, compliance and other essential capabilities. Mimecast helps protect large and small organizations from malicious activity, human error and technology failure; and to lead the movement toward building a more resilient world. To find out more, visit our website.
This version also has the improved
c2r
function with theallow_overwriting_input
flag.
I can confirm that c2r and r2c give near identical timings when the allow_overwriting_input
flag is set. Thanks @mreineck
Sorry for this flashback, but I just noticed this comment from further above:
Also remember to rebuild numpy to use native instruction set to have any chance of vectorized instruction set usage: CFLAGS="-O3 -march=native" pip install --no-binary :all: numpy==1.13.3 #same as system
@bennahugo are you really running numpy
1.13 on that machine? I wouldn't touch that with a 10 foot pole anymore ;-)
I have a candidate at https://gitlab.mpcdf.mpg.de/mtr/ducc/-/tree/roll_experiments; the function is called
ducc0.misc.roll_resize_roll
. If it's not too hard to get this branch into your environment, please give it a try!
I managed to replace my clunky padding and shifting functions with roll_resize_roll
for even inputs. For the record, when the array sizes are even, pad_and_shift(xin, xout)
is equivalent to roll_resize_roll(xin, xout, (0, 0), (nxo-nxi//2, nyo-nyi//2)
with xin.shape = (nxi, nyi)
and xout.shape = (nxo, nyo)
. unpad_and_unshift(xout, xin)
(after a small bug fix) is equivalent to roll_resize_roll(xout, xin, (-(nxo-nxi//2), -(nyo-nyi//2)), (0, 0))
. I am still trying to figure out what to do for odd sized arrays (note my numba versions only handled even sized arrays). I'll get to profiling these once I have a working solution. Thanks @mreineck
@bennahugo are you really running numpy 1.13 on that machine? I wouldn't touch that with a 10 foot pole anymore ;-)
I'm on numpy 1.22.0
@landmanbester, I don't think the amount of shift should depend on the differences between the sizes, but only on the size of the array where you do the shifting.
When you call pad_and_shift
, where is the "real image center" of the input array, and where should it be after pad_and_shift
?
Looking at the psf_convolve_*
functions, actually it should not matter at all how much you roll the padded images, as long as you roll them back by the same amount after the convolution. If that is correct, then you only need to worry about the correct amount of rolling before padding and after unpadding.
Ah, I have been really silly. I think the shift operation is not even needed here, only the padding. The only thing that needs the shift is the convolution kernel because I need to get the orientation of the PSF with respect to the dirty image right. The following seems to do exactly the same as the previous version of the function
def psf_convolve_cube(xpad, # preallocated array to store padded image
xhat, # preallocated array to store FTd image
xout, # preallocated array to store output image
psfhat,
lastsize,
x, # input image, not overwritten
nthreads=1):
_, nx, ny = x.shape
xpad[...] = 0.0
xpad[:, 0:nx, 0:ny] = x
r2c(xpad, axes=(1, 2), nthreads=nthreads,
forward=True, inorm=0, out=xhat)
xhat *= psfhat
c2r(xhat, axes=(1, 2), forward=False, out=xpad,
lastsize=lastsize, inorm=2, nthreads=nthreads,
allow_overwriting_input=True)
xout[...] = xpad[:, 0:nx, 0:ny]
return xout
only now psf_convolve_cube is totally dominated by the FFT's (the two blocks under psf_convolve_cube
are for r2c
and c2r
)
I still don't fully understand where the additional 16.74 - 6.23 - 6.30 = 4.21
seconds are going but my guess is mem_copies. Not sure how to drill down deeper into that but maybe this is deep enough (although suggestions welcome). Thanks for all the input @mreineck. I'll still need the roll_resize_roll
function when I need to keep track of what things look like in Fourier space
Great, that looks much better!
I still don't fully understand where the additional 16.74 - 6.23 - 6.30 = 4.21 seconds are going but my guess is mem_copies.
Copies, initialization of xpad
and the multiplication with psfhat
. That touches a lot of memory, and it's (I think) not parallelized. Still, even if that was optimized even more, you won't be able to get more than around 20% speedup, and that's probably not worth it.
Hmmm ... you could try to replace
xpad[...] = 0.0
xpad[:, 0:nx, 0:ny] = x
by
ducc0.misc,roll_resize_roll(x,xpad,(0,0,0),(0,0,0),nthreads)
and
xout[...] = xpad[:, 0:nx, 0:ny]
by
ducc0.misc.roll_resize_roll(xpad,xout,(0,0,0),(0,0,0),nthreads)
It might help a bit, but I don't know.
Yep, I tried that. It makes almost no difference. Pretty sure the psf convolution was the original issue and since that's fairly well optimized now I will close this one
My clean implementation seems a bit slow when image sizes get large. I suspect that something weird happens when I go past a certain size (maybe cache size related?). This is an attempt to figure out where the time is spent. Here are the most relevant entries from cprofile
Interestingly, c2r (complex to real FFT) seems to take about twice as long as r2c (the real to complex FFT). Also, a disproportionate amount of time is spent in the fftshift and padding routines. I am not sure what to expect from them. I will try on a smaller image and see if the percentage time spent in the different routines remains the same