Performance - Githubissues

theotherphil commented 8 years ago

It's currently pretty poor.

Benchmarking:

Make sure we have a sensible suite of benchmarks for all functions.
Write a tool to profile vs a selection of standard libraries (vlfeat, opencv, etc.).

Profiling and improving existing functions:

Do some!
Understand the cost of Rust operations/what LLVM can optimise away. e.g as well as being syntactically horrific, something like this looks expensive: (*out.get_pixel_mut(x, y))[0] += out.get_pixel(x, y - 1)[0]; I think that in theory this should all be optimised away, but maybe it's not.
Bounds checking, save vs unsafe. If we're doing something other than just iterate through all pixels in order then presumably eliding bounds checking is a lot trickier. Do we want to use unsafe everywhere? That seems a bit... unsafe.
Is there any benefit in exposing raw scanlines via a DirectImage trait (as exposed to GenericImage that just allows access to single pixels)?
Do we need any more abstractions for iterating over pixels/regions to let us get performance benefits safely?
Actually do the work to make all the existing functions reasonably performant.

Idioms/best practises:

Write up what we discovered from the profiling. Clearly document which operations are expensive, and how to avoid common pitfalls.

theotherphil commented 8 years ago

@tafia has been making a lot of large performance improvements recently. It might eventually be nice to write a short performance tutorial/discussion working roughly through the commit history of e.g. one of the filtering functions.

tafia commented 8 years ago

Always working with GenericImage may not be the best strategy for performance.

Knowing we work with a VecBuffer (as stated on readme non-goals) instead would allow us

to use iterators => no need for unsafe in many places
no need to copy GenericImage into ImageBuffer (like here)
eventually allow implementing more iterators instead of new VecBuffers

@theotherphil Is this something I can try working on or you don't want to go in that direction ?

theotherphil commented 8 years ago

I'm conflicted on this. My initial not-really-a-plan was just to wait until specialisation landed in the compiler and have faster versions of functions specialised to images that we know consist of contiguous blocks of memory (i.e. that implemented some ReallyAChunkOfMemory trait).

I'm somewhat reluctant to just switch to requiring ImageBuffers everywhere because this precludes any possible clever future ideas involving lazy images. It also potentially stops anyone using this library with their own image types without first copying all their data. However, maybe it's reasonable to say that ImageBuffers are the only way to efficiently use the library and if you don't care that much about performance you can just implement to/from buffer for your own image types.

I guess we can always change our minds later - we need to be very explicit that we're not guaranteeing stability of any APIs for the medium term future, so that we can switch around our requirements as we find out what does or doesn't work.

For maximum performance we'd probably also want to control image strides and alignment, which I don't think ImageBuffer currently allows.

Ok, there wasn't anything very concrete there, just rambling about vague concerns.

So, more concretely:

I'm not convinced that standardising on ImageBuffer is a great idea. The best structure for image encoding/decoding might not be the best for image processing, and I want to keep the ability to vary our requirements somewhat independently of the concrete types in the image library.
However, allowing for arbitrary GenericImages does impose a huge performance restriction, as you say. I'd be happy for us to start adding extra restrictions on functions for increased performance. The best approach might be to create our own image traits with additional requirements, e.g. ability to provide a row slice, or providing raw access to underlying data, etc.

Does this sound reasonable? This is more work than just switching to ImageBuffer everywhere, and it's not me offering to do the work, so it seems a little unfair to just say "no, do something harder instead".

tafia commented 8 years ago

The best approach might be to create our own image traits with additional requirements, e.g. ability to provide a row slice, or providing raw access to underlying data, etc.

This shouldn't be too complicated, I like it.

theotherphil commented 8 years ago

Ok, great.

Anything that made functions faster on ImageBuffers would be directly useful for my hypothetical new world of custom traits, and getting the API for these traits right will involve having some experience of trying to write optimised functions for buffer-like images. So it's probably worth maintaining an ImageBuffer branch for a while to let us experiment with optimisations that are enabled by restricting everything to ImageBuffers and get a better idea of what our new traits should look like.

theotherphil commented 8 years ago

I've just created a "performance" branch for this.

theotherphil commented 8 years ago

For future reference, here are all the current benchmarks according to my laptop (ignore the "baseline" ones - they're to let us make sure that whatever we add is faster than what's already in image::imagops).

test affine::test::bench_rotate_bilinear                                ... bench:     402,929 ns/iter (+/- 12,354)
test affine::test::bench_rotate_nearest                                 ... bench:     319,127 ns/iter (+/- 52,885)
test affine::test::bench_translate                                      ... bench:     211,600 ns/iter (+/- 12,841)
test contrast::test::bench_equalize_histogram                           ... bench:   2,236,089 ns/iter (+/- 224,182)
test corners::test::bench_is_corner_fast12_12_noncontiguous             ... bench:          21 ns/iter (+/- 2)
test corners::test::bench_is_corner_fast9_9_contiguous_lighter_pixels   ... bench:          21 ns/iter (+/- 2)
test filter::test::bench_baseline_gaussian_stdev_1                      ... bench:   1,720,996 ns/iter (+/- 147,949)
test filter::test::bench_baseline_gaussian_stdev_10                     ... bench:  12,492,048 ns/iter (+/- 518,286)
test filter::test::bench_baseline_gaussian_stdev_3                      ... bench:   4,335,076 ns/iter (+/- 317,193)
test filter::test::bench_box_filter                                     ... bench:   2,813,485 ns/iter (+/- 156,651)
test filter::test::bench_filter3x3_i32_filter                           ... bench:   1,988,750 ns/iter (+/- 87,241)
test filter::test::bench_horizontal_filter                              ... bench:   3,051,673 ns/iter (+/- 227,131)
test filter::test::bench_separable_filter                               ... bench:   2,251,522 ns/iter (+/- 538,489)
test filter::test::bench_vertical_filter                                ... bench:   3,172,381 ns/iter (+/- 142,535)
test haar::test::bench_evaluate_all_filters_10x10                       ... bench:   1,919,664 ns/iter (+/- 456,883)
test integralimage::test::bench_column_running_sum                      ... bench:         778 ns/iter (+/- 123)
test integralimage::test::bench_integral_image                          ... bench:     443,570 ns/iter (+/- 63,085)
test integralimage::test::bench_row_running_sum                         ... bench:         596 ns/iter (+/- 51)
test regionlabelling::test::bench_connected_components_eight_chessboard ... bench:     923,307 ns/iter (+/- 112,500)
test regionlabelling::test::bench_connected_components_four_chessboard  ... bench:     472,306 ns/iter (+/- 147,901)
test suppress::test::bench_local_maxima_dense                           ... bench:      27,985 ns/iter (+/- 5,060)
test suppress::test::bench_local_maxima_sparse                          ... bench:      36,700 ns/iter (+/- 13,368)
test suppress::test::bench_suppress_non_maximum_decreasing_gradient     ... bench:       1,297 ns/iter (+/- 642)
test suppress::test::bench_suppress_non_maximum_increasing_gradient     ... bench:       1,759 ns/iter (+/- 329)
test suppress::test::bench_suppress_non_maximum_noise_1                 ... bench:       5,255 ns/iter (+/- 1,382)
test suppress::test::bench_suppress_non_maximum_noise_3                 ... bench:       3,218 ns/iter (+/- 2,241)
test suppress::test::bench_suppress_non_maximum_noise_7                 ... bench:       2,284 ns/iter (+/- 311)

tafia commented 8 years ago

Just read about http://bluss.github.io/rust-ndarray/master/ndarray/index.html on reddit. It might help when working with kernels etc ...

theotherphil commented 8 years ago

It looks good. I've not had the time to work on this library much recently but I've just started reading up on halide a little and think we could nick some of their ideas. I doubt we'll ever be as fast as them, but it would be interesting to see how far we could get towards allowing configurable pipelines of functions by combining eager and static evaluation of iterators of image chunks. This is obviously ridiculously vague - I've not given serious thought to how exactly we'd do this, but I think we definitely need to move away from functions that force the creation of intermediates at all steps. The n-d split_at support in ndarray looks really promising for this.

theotherphil commented 7 years ago

https://github.com/libmir/dcv This D library is based heavily on the ndslice type, so might provide some useful ideas.

ivandardi commented 7 years ago

I have a question: is this crate based on CPU computations? And if yes, would it be interesting to add CUDA and OpenCL support?

theotherphil commented 7 years ago

Yes, everything is on the CPU at the moment. It certainly would be interesting to add GPU support - do you have any ideas for how we might go about doing this?

ivandardi commented 7 years ago

Yes, actually!

There are a few crates out there that offer OpenCL and CUDA support. We'd have to research them. However, the core of the idea is PIMPL.

Basically we define a bunch of traits, instead of structs. Example:

trait VerticalSobel {
    fn vertical_sobel(&self, image: &GrayImage) -> Image<Luma<i16>>;
}

Then we'll have the different implementors:

struct CPUImpl { ... }
struct CUDAImpl { ... }

Which implement the traits:

impl VerticalSobel for CPUImpl {
    fn vertical_sobel(&self, image: &GrayImage) -> Image<Luma<i16>> {
        // Use CPU to perform calculations
    }
}

impl VerticalSobel for CUDAImpl {
    fn vertical_sobel(&self, image: &GrayImage) -> Image<Luma<i16>> {
        // Use CUDA to perform calculations
    }
}

And to use them, we'd have something like

struct Gradients<T> {
    implementation: T,
    ...
}

impl<T: VerticalSobel> VerticalSobel for Gradients<T> {
    fn vertical_sobel(&self, image: &GrayImage) -> Image<Luma<i16>> {
        self.implementation.vertical_sobel(self.image)
    }
}

That way people would have a stable API to use in their code, and then we could setup at compile time to use CUDA/OpenCL if it's available, or fall back to the CPU implementation if not.

theotherphil commented 7 years ago

I'm afraid I don't follow your examples. Could you elaborate on why there's an implementation of VerticalSobel for Gradients here, and on what this would look like for the eventual user?

ivandardi commented 7 years ago

Gradients doesn't need to have an implementation of VerticalSobel. It's just a possibility.

Actually, my example was kinda bad thought out. I think it would be better to have a Processor struct.

struct Processor<T> {
    image: GrayImage,
    implementation: T,
    ...
}

impl<T: VerticalSobel> Processor<T> {
    fn vertical_sobel(&self, &GrayImage) -> Image<Luma<i16>> {
        self.implementation.vertical_sobel(self.image)
    }
}

As for the end user, it could either look like this:

let img: GrayImage = get_gray_image();
let processor = Processor::new_with_CUDA();
let vertical_sobel = processor.vertical_sobel(&img);

Or this:

let img: GrayImage = get_gray_image();
let processor = Processor::new_with_CPU(img);
let vertical_sobel = processor.vertical_sobel();

That way all image processing would be handled via this Processor struct which has a user-chosen implementation of the algorithms.

theotherphil commented 7 years ago

This sounds like an interesting approach. Before making sweeping API changes we should have a few GPU implementations in place to experiment with (initially just defined as separate functions). However, I'm certainly open to switching to something like this if it works out.

ivandardi commented 7 years ago

Googling around for GPU code with Rust, there seems to be a few crates already:

https://github.com/arrayfire/arrayfire-rust https://github.com/autumnai/rust-cudnn https://github.com/cogciprocate/ocl

We just need to pick a good one to test out. That arrayfire one looks promising.

theotherphil commented 7 years ago

I agree! Are you planning on trying this out yourself?

ivandardi commented 7 years ago

I'd love to, but I can't promise anything. College is a huge time drain, unfortunately. But hey, the idea is there, so maybe someone else can get interested in it?

theotherphil commented 7 years ago

Ok, thanks for the suggestions!

image-rs / imageproc

Performance #3