Open Aloxaf opened 5 years ago
Just to clarify, PIL uses a C version indirectly https://github.com/python-pillow/Pillow/blob/ab9a25d623fdd7f8de3e724b538f5660eac589ae/src/libImaging/BoxBlur.c#L294
It's still somewhat embarrasingly slow.
@HeroicKatora Yes, maybe it would be a good idea to port it rust?
Pillow appears to use (a variant on) iterated box filtering, as defined in http://www.mia.uni-saarland.de/Publications/gwosdek-ssvm11.pdf.
There's an open imageproc ticket to implement this: https://github.com/image-rs/imageproc/issues/93
I'll have a look at implementing this over the coming weekend.
There is actually a linear-complexity blur already implemented in Rust, where execution time is independent of blur radius: https://github.com/fschutt/fastblur
A copy of it was copied into resvg codebase and polished to match Chrome blur. You can find it in this file, search for box_blur
. It should be fairly easy to extract it back into a common crate and/or copy it into image
.
You may want to take a look at the code here. It was originally based on a different paper, written by the author of AKAZE, ported to rust by someone else, and then modified later by me (so it has been touched by many hands). It makes use of the following papers:
It is actually not used for guassian blur in akaze itself. I am also not 100% sure it can help speed up guassian blur, but I think it can based on what I have seen in its use in akaze. I am no expert. The way this algorithm works is that it determines a number of diffusion steps that start small and grow bigger as the stability increases (which may not apply to guassian blur?). See this file for how the specialized blur is being applied (with ndarray). I was able to get pretty good speed by writing my filters with ndarray like that. My situation is a bit more specific to this library, but the code is free for you to take.
Also noticed this while comparing Wasm version of image::imageops::blur
and JS implementation of Gaussian blur from https://github.com/nodeca/glur.
At high radius values the difference becomes ridiculous - e.g. for 3872x2592 image and radius 300px the image crate's version executes in 94 seconds, while JS executes in 0.6 seconds (same as for any other radius).
Having linear implementation in image crate would help a lot. fastblur
from @Shnatsel's link looks promising, but it would be nice to integrate it into image
or imageproc
instead of relying on a 3rd-party dependency that is somewhat harder to discover.
If anyone is interested: I created a first draft to resolve this issue in https://github.com/image-rs/image/pull/2302 and I am willing to finalize this.
FWIW a new crate with multiple fast blur implementations was just released: https://github.com/awxkee/libblur
It relies on unsafe SIMD intrinsics, so it might not be usable as-is for image
due to the no-unsafe policy, but we may be able to uplift parts of it or perhaps reuse the fallback implementations.
FWIW a new crate with multiple fast blur implementations was just released: https://github.com/awxkee/libblur
It relies on unsafe SIMD intrinsics, so it might not be usable as-is for
image
due to the no-unsafe policy, but we may be able to uplift parts of it or perhaps reuse the fallback implementations.
I can extract part of gaussian just for image
crate with forbid unsafe
if that makes sense.
I reworked convolution part without unsafe and situation looks something like that for RGB image 5000x4000:
fast_blur sigma 5: 793.836375ms
pure gaussian_blur 5 kernel size: 30.273625ms
fast_blur sigma 15: 803.334958ms
pure gaussian_blur 15 kernel size: 60.557ms
fast_blur sigma 35: 807.447583ms
pure gaussian_blur 35 kernel size: 133.914708ms
fast_blur sigma 151: 811.069792ms
pure gaussian_blur 151 kernel size: 495.628208ms
fast_blur sigma 251: 847.268125ms
pure gaussian_blur 151 kernel size: 875.879ms
pure gaussian_blur 251 kernel size with rayon: 125.079042ms
Pure gaussian blur parallelization works pretty fine so adding rayon
instantly adds x10 performance, and multi-threading for blur's are preferred.
Perhaps I can also add stack blur that have O(1) in big O notation, which will be about ~x5-10 faster than blur based on CLT, which used here on fast_blur
as far as I can tell.
Also blur's based on CLT have pretty high convergence with low controlling level, if you noticed that, and images became completely blurred sharply.
Here is stackblur time with disabled SIMD and multi-threading for the same image.
stackblur radius 151: 168.549ms
Even with forbid unsafe
I wouldn't expect that it'll slowdown 8 times.
Image's fast_blur
was contributed very recently, and has not received a great deal of attention yet. I'd be happy to see a PR migrating it to a faster algorithm.
I am somewhat concerned about the growing implementation complexity, but I think we could take it on for common and performance-sensitive operations such as resizing and blurring.
Everything of this actually sounds that it is pretty unreasonable :)
I do not think that blur is so common, might be spreaded a bit, but common I think this is not.
And for PR with this pixel store and organization and accessing them in the crate, it is the thing I'd like definitely stay away of that, and size of required code is not small, so complexity will grow noticable.
Blur operation isn't that common but it usually does become a bottleneck whenever it crops up, so I think optimizing it is still worthwhile.
May agree.
But issue with pixels storage, lack of really random access with iterator, lack of generic traits still an issue.
I may publish separate crate, but dealing with native problems of crate makes it immediately unreasonable.
If you could list the exact problems with the crate's API that prevent this kind of algorithms from being written on top of it in https://github.com/image-rs/image/issues/2358, that would be very helpful!
We're looking to improve the API and ship v1.0 sometime in the foreseeable future, so info on what exactly is missing right now would be very helpful.
Raw row data must be fully accessible by each color component seperately or by set for ex. full RGBA ( this is optional, if each color component accesible this is completely fine ), with iterator without any checks and this call must be [inline(always)]
.
This can be done by continious allocated vector which is pretty common and very nice.
Or by brows
, if necessary, it is very good sometimes, and sometimes are not, however, might confusing some who are not familiar with it. If y're not familiar it is an approach where instead of contiguous memory you can do gaps in layout and stores only reference to start ot the row, so your real image is a slice of row references &[&[T]]
.
This at least will make performant methods like this one and actually will prevent some people of doing bad things because someone, who starts, will search how to do so and will see this one.
For any real algorithms all generic traits must be implemented also. And all
is a very vague definition because sometimes you'll need to make bitxor, or shift right, sometimes use fma when devices have it, and sometimes take a natural logarithm or get π
from trait.
At the very least, all basic arithmetic must conform.
The thing is that for convolution for example y'll have to access each pixel many, many, many times, for example for blurring with kernel size = 151, on each pixel in the image y'll have to access at least of 150+150 pixels around at very least, if you can't optimize this is out, this is the end.
Yes, GenericImageView
not allowing viewing rows is a known issue: #2300
Thank you for the input! I've linked your message from the relevant issues on the bug tracker. We'll see what we can do about these limitations.
Yup, if this reasonable, I still can do blur in separate crate.
Because all these limits can take years :)
We're hoping to start a concerted effort on fixing the API sometime around January. But we won't know if they're any good until someone builds a complex image processing algorithm on top of them. Would you be available then to kick the tires of the new APIs and try to port stack blur to them?
Yes, sure, when API looks like at least it worth trying, then I can try to do something on top of this. I wouldn't say stack blur is complex, but ok :)
Let me know if I can help.
blur
is too slow,imageproc::filter::gaussian_blur_f32
is faster than it, but still slow.Reproduction steps
result is
and Pillow
result is
0.1433885097503662s
Here is the final image
rust_1.png (it's a little strange...
rust_2.png
python.png