Performance Tips - Githubissues

GiggleLiu commented 3 years ago

Julia is column major, use A[:,j] is much faster than A[i,:]
Avoid copying data, A[:,j] copies the elements into a new array, use view(A, :, j) is much faster!
utilize loop fusion https://julialang.org/blog/2017/01/moredots/
These variables are not used? https://github.com/ramsterdam91/QWNNM.jl/blob/462350bc39c2cbaa087249672839c25b35121ef7/src/denoising.jl#L26

johnnychen94 commented 3 years ago

I'm recently doing performance optimization on WNNM.

The early version of my implementation can be found in https://github.com/johnnychen94/WNNMDenoise.jl. Right now I get about 2x boost compared to the original Matlab version. There are still a lot of places that can be optimized, for example, memory allocation and block matching.

Sidenote: Julia svd uses openblas while Matlab svd uses MKL; I've observed that Julia svd is ~3x slower than Matlab svd. So if we considered this, I might have gotten 4-6x performance boost.

My WNNM implementation will be added to https://github.com/JuliaImages/ImageNoise.jl when it's finished, if you're interested, we could add QWNNM implementation to ImageNoise to deal with color images.

johnnychen94 commented 3 years ago

Julia is column major, use A[:,j] is much faster than A[i,:]

If you check my implementation, you'll find out that almost all the iteration uses CartesianIndices. This is particularly useful because 1) nested for loop along multiple dimensions can be made into one compact loop, and 2) it "magically" iterates along the memory order, which gives the best indexing speed.

Avoid copying data, A[:,j] copies the elements into a new array, use view(A, :, j) is much faster!

This is generally true, but there's a special case that you might need to be aware of.

A simple implementation of block matching is:

r = CartesianIndex(3, 3)
for p in CartesianIndices(img)
    patch_p = img[p-r:p+r] # %1
    d = map(neighbor_hood(p, search_window)) do q
        patch_q = img[q-r:q+r] # %2
        dist(patch_p, patch_q)
    end
    min(d)
end

Let's talk about the %1 and %2 here. There're four possibilities:

both use copy. %1: img[p-r:p+r], %2: img[q-r:q+r]
both use view. %1: view(img, p-r:p+r), %2: view(img, q-r:q+r)
mixed version. %1: img[p-r:p+r], %2: view(img, q-r:q+r)
mixed version. %1: view(img, p-r:p+r), %2: img[q-r:q+r]

Let me give the conclusion first: option 3 is the fastest in most cases. To explain this, we need to be aware of the trade-off of copy/view strategy:

copy:

(bad) overhead in memory allocation and GC
(good) contiguous memory layout, and hence faster indexing.

view:

(good) no overhead in memory allocation
(bad) non-contiguous memory layout, and hence slower indexing.

In block matching situation, patch_p is used repeatedly in dist(patch_p, patch_q) and thus it's better to make sure patch_p is memory contiguous. While patch_q is only used once, so using view is definitely better here (copy = memory allocation + one indexing, view = one indexing).

The codes can be modified a bit to further reduce the memory allocation:

r = CartesianIndex(3, 3)
+ patch_p = zeros(@. 2r+1) # %1
for p in CartesianIndices(img)
-    patch_p = img[p-r:p+r] # %1
+    patch_p .= @view img[p-r:p+r] # %1
    d = map(neighbor_hood(p, search_window)) do q
        patch_q = img[q-r:q+r] # %2
        dist(patch_p, patch_q)
    end
    min(d)
end

Please do note the difference here: memory allocation for patch_p is only (7, 7) in the later case, while length(img) * (7, 7) in the previous case.

As an example of this copy/view trick, https://github.com/JuliaImages/ImageNoise.jl/pull/12 gives about 4x performance boost for non-local mean.

ramsterdam91 commented 3 years ago

Hi johnny Chen,

Thank you for giving advice on performance improvement generously, I really really really appreciate it. I was a little busy yesterday, so I didn't reply to you in time.

I'm still a novice at Julia Lang. Though I am interested in this matter, contributing code to the community, this job may be a little difficult for me now. I think the better time for me is when I have the ability. There's still a long way for me to go.

However, You provided an excellent WNNMDenoise.jl template for me to learn. I'm willing to practice my skills with your guidance.

In my opinion, WNNM shared a lot with QWNNM, the only difference is the object of SVD. QWNNM decomposes quaternion mattrix expressed by complex numbers. Just a little change between them. In my previous experiments, using pure QSVD may cause loss to matrix. Maybe this information can help you a little bit.

I hope to see (Q)WNNM in JuliaImage soon. You are doing great job to the community.

Thanks again,

Yubing Liu

ramsterdam91 / QWNNM.jl

Performance Tips #3