nannou-org / nannou

A Creative Coding Framework for Rust.
https://nannou.cc/
6.04k stars 305 forks source link

Performance of a million lines #479

Open schulzch opened 4 years ago

schulzch commented 4 years ago

The performance of drawing one million lines seems horrible (see code below, <1fps). However, my system can easily render 6 million lines at 60fps using a custom shader (WebGL, all lines one batch). Is there some way to cache a Drawing or reduce draw calls? I would really like to avoid going wgpu for this project.

fn view(app: &App, _model: &Model, frame: Frame){
    let draw = app.draw();
    draw.background().color(PLUM);
    let line_count = 1_000_000;
    for i in  0..line_count  {
        let t = i as f32 / line_count as f32; 
        draw.line()
            .weight(2.0)
            .color(STEELBLUE)
            .points( Point2 { x: 200.0 +  app.time.sin() * 50.0 + (app.time).cos() * 50.0, y: 0.0 }, Point2 { x:  0.0, y: 0.0 })
            .rotate(t * std::f32::consts::PI * 2.0);
    }
    draw.to_frame(app, &frame).unwrap();
}
mitchmindtree commented 4 years ago

Thanks for the issue!

Unfortunately the Draw API doesn't yet have the optimisation capabilities necessary to do something like this in real-time. The pipeline for the Draw API is very general and also needs to consider many other kinds of arbitrary shapes and meshes that may be drawn by the user. That said, it would not be out of scope to one day have an optimisation check where we detect the case that many, many primitives of the same type have been submitted sequentially and we generate and cache some instancing shader and pipeline to draw them efficiently. I am currently in the middle of overhauling the Draw API to enable #194 which will include a bunch of performance improvements, but nothing that will help to draw one million lines at 60FPS I'm afraid!

For now, drawing a million of something sounds like the perfect case for instancing or using a custom shader as you mention! It's not trivial setting up the necessary wgpu boilerplate to do this just yet (we only landed WGPU support a week or two ago!) but it's something we plan to continually improve. Something to keep in mind if you go down this route is that the wgpu examples on master have some builder types to assist with custom pipeline and bind group creation - these aren't yet published on crates.io, so you may prefer to depend on master in the meantime or wait till they are published.

Hope this helps to clarify the state of things a little!

schulzch commented 4 years ago

Then I'll go down the road of building yet another custom set of shaders, no big deal... I have plenty of experience with writing crazy shaders, e.g., Structure Formation in Cosmic Evolution (YouTube) done using MegaMol in case you're interested.

Thanks for the hint regarding the builders! 👍

noomly commented 4 years ago

Hey there, I think I'm currently facing the same problem but with drawing tiny rects instead (as pixels). There is no way to, for example, write all the pixel to a texture and then draw the texture to the picture? I don't have much knowledge about lowlevel graphics so I'm not sure if I can go with wgpu right now.

mitchmindtree commented 4 years ago

Hi @noomly!

I'm currently adding support for drawing textures via the draw API as we speak :) See #484 - I'm aiming to have it finished and merged in the next day or two.

As for drawing to each pixel - the much more efficient way would certainly be to use a custom wgpu pipeline, however I think it makes sense for us to expose something simpler too!

One of the methods I am currently adding to Texture is upload_data, which allows the user to upload an arbitrary slice of bytes to the texture (as long as the data slice length matches the texture size in bytes). One method that we could add on top might be something like upload_from_fn, where the user can pass in a function that takes (x, y) coordinates as input, and returns a colour as an output. Internally, the function could generate a buffer of bytes using the given function and call upload_data. I imagine it might look like this in practice:

texture.upload_from_fn(device, encoder, |x, y| {
    // Use x and y to create a colour.
});
draw.texture(texture.view().build());

Perhaps we could provide a short-hand for this that looks something like:

draw.shade(|x, y| /* Use x and y to generate a colour */);

Keep in mind that this still means calculating the colour for every pixel on the CPU, so it will still be much slower than using a shader (where colours are calculated in parallel on the GPU), however it will also be much faster than drawing individual rectangles for each pixel. Does something like the above sound useful for your use case?

In the meantime before #484 lands, I'm afraid the only approach is to do something along the lines of the wgpu_image.rs example for now.

noomly commented 4 years ago

Well, can't wait for #484 then!

My use case is trying to make a reaction-diffusion simulation using Gray-Scott's model. At the moment, each time update is called I step the simulation forward by iterating through a large array (200x200, maybe larger in the future). Then each time view is called I iterate again through the whole array drawing each cell as a pixel large rectangle using draw.rect().

I understand now how using a shader to render the simulation would be way better performance wise but the wgpu_image example seems to have a lot of boilerplate code that I don't really understand. I'll try to investigate this.

noomly commented 4 years ago

Now that #484 is merged, I tried to use your work to speed up the drawing of my project, with great success! More or less at least.

Performance wise it's night and day. Here is what I did after tinkering a little bit and diving into the doc:

use nannou::{image, prelude::*, wgpu};

fn view(app: &App, model: &Model, frame: Frame) {
    let draw = app.draw();

    draw.background().color(PURPLE);

    let buf =
        image::ImageBuffer::from_fn(GRID_WIDTH as u32, GRID_HEIGHT as u32, |x: u32, y: u32| {
            let cell = model.grid[(x as usize, y as usize)];
            if cell.1 > 0.25 {
                image::Rgba([0, 0, 0, 255 as u8])
            } else {
                image::Rgba([0, 0, 0, 0])
            }
        });
    let img = image::DynamicImage::ImageRgba8(buf);
    let tex = wgpu::Texture::from_image(frame.device_queue_pair().as_ref(), &img);

    draw.texture(&tex)
        .w_h(GRID_WIDTH as f32, GRID_HEIGHT as f32);

    draw.to_frame(app, &frame).unwrap();
}

The problem is that after a number of iteration, the program crashes with the error AllocationError(OutOfMemory(Device)). Here's the backtrace:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: AllocationError(OutOfMemory(Device))', /home/noom/.cargo/registry/src/github.com-1ecc6299db9ec823/wgpu-native-0.4.3/src/device.rs:650:22
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'main' panicked at 'failed to lock the queue: "PoisonError { inner: .. }"', /home/noom/dev/processing/nannou/src/frame/raw.rs:73:25
stack backtrace:
   0:     0x559a18482a34 - backtrace::backtrace::libunwind::trace::h90669f559fb267f0
                               at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/libunwind.rs:88
   1:     0x559a18482a34 - backtrace::backtrace::trace_unsynchronized::hffde4e353d8f2f9a
                               at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/mod.rs:66
   2:     0x559a18482a34 - std::sys_common::backtrace::_print_fmt::heaf44068b7eaaa6a
                               at src/libstd/sys_common/backtrace.rs:77
   3:     0x559a18482a34 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h88671019cf081de2
                               at src/libstd/sys_common/backtrace.rs:59
   4:     0x559a184a306c - core::fmt::write::h4e6a29ee6319c9fd
                               at src/libcore/fmt/mod.rs:1052
   5:     0x559a1847f9a7 - std::io::Write::write_fmt::hf06b1c86d898d7d6
                               at src/libstd/io/mod.rs:1426
   6:     0x559a18484c35 - std::sys_common::backtrace::_print::h404ff5f2b50cae09
                               at src/libstd/sys_common/backtrace.rs:62
   7:     0x559a18484c35 - std::sys_common::backtrace::print::hcc4377f1f882322e
                               at src/libstd/sys_common/backtrace.rs:49
   8:     0x559a18484c35 - std::panicking::default_hook::{{closure}}::hc172eff6f35b7f39
                               at src/libstd/panicking.rs:204
   9:     0x559a18484921 - std::panicking::default_hook::h7a68887d113f8029
                               at src/libstd/panicking.rs:224
  10:     0x559a1848529a - std::panicking::rust_panic_with_hook::hb7ad5693188bdb00
                               at src/libstd/panicking.rs:472
  11:     0x559a18484e80 - rust_begin_unwind
                               at src/libstd/panicking.rs:380
  12:     0x559a184a0f01 - core::panicking::panic_fmt::hb1f3e14b86a3520c
                               at src/libcore/panicking.rs:85
  13:     0x559a184a0d23 - core::option::expect_none_failed::he6711468044f7162
                               at src/libcore/option.rs:1199
  14:     0x559a18259cfb - nannou::frame::raw::RawFrame::submit_inner::hcf2fbb91f9435447
  15:     0x559a1825a154 - nannou::frame::Frame::submit_inner::h4522e718b4441c5d
  16:     0x559a1820343d - core::ptr::drop_in_place::h5d79738f17f954cc
  17:     0x559a18209ffa - reaction::view::h13dc43090d9188e5
  18:     0x559a18206b2d - nannou::app::run_loop::{{closure}}::he1c4035a2562af97
  19:     0x559a18205779 - winit::platform_impl::platform::x11::EventLoop<T>::run::ha111061a945d7c8d
  20:     0x559a18204e8d - winit::platform_impl::platform::EventLoop<T>::run::h1e54a82e1b50ba75
  21:     0x559a1821190c - winit::event_loop::EventLoop<T>::run::h4c9bb7642d178da1
  22:     0x559a18212352 - nannou::app::Builder<M,E>::run::h5a9a34e0a0439e52
  23:     0x559a1820976e - reaction::main::h18934480e0684874
  24:     0x559a18218483 - std::rt::lang_start::{{closure}}::h8720c41d9f63b05e
  25:     0x559a18484d63 - std::rt::lang_start_internal::{{closure}}::hb26e39676675046f
                               at src/libstd/rt.rs:52
  26:     0x559a18484d63 - std::panicking::try::do_call::he4701ab6e48d80c0
                               at src/libstd/panicking.rs:305
  27:     0x559a18489387 - __rust_maybe_catch_panic
                               at src/libpanic_unwind/lib.rs:86
  28:     0x559a18485800 - std::panicking::try::hd3de25f3cb7024b8
                               at src/libstd/panicking.rs:281
  29:     0x559a18485800 - std::panic::catch_unwind::h86c02743a24e3d92
                               at src/libstd/panic.rs:394
  30:     0x559a18485800 - std::rt::lang_start_internal::h9cf8802361ad86c2
                               at src/libstd/rt.rs:51
  31:     0x559a1820a032 - main
  32:     0x7f9b4369d023 - __libc_start_main
  33:     0x559a181e41be - _start
  34:                0x0 - <unknown>
thread panicked while panicking. aborting.
[1]    815727 illegal hardware instruction (core dumped)  cargo run --release

Did I do it the wrong way? Or is it a bug with nannou's wgpu integration?

mitchmindtree commented 4 years ago

Great news about performance improvements! And thanks for the bug report!

I have also been noticing some out-of-memory errors if I resize the window too much. Interesting that you are also getting one, but when creating a new texture each frame. I think either we or wgpu-rs are leaking resources somewhere. I'm not sure exactly where the cause is yet - I think this will require using some gpu debugging tools to see what resources are accumulating. I'll try do some tests with renderdoc after I finish what I'm currently working on, but feel free to beat me to it! Hopefully it turns out to be an easy patch :)

schulzch commented 4 years ago

I went the hard way using a custom shader (faking geometry shader things using instancing). It updates the shader buffer and bind group as required and then invokes a draw call for 6 vertices (2 triangles) and one instance per line:

fn draw(&self, frame: &Frame) {
    let mut encoder = frame.command_encoder();
    let mut render_pass = wgpu::RenderPassBuilder::new()
        .color_attachment(frame.texture_view(), |color| {
            color.load_op(wgpu::LoadOp::Load)
        })
        .begin(&mut encoder);
    render_pass.set_bind_group(0, &self.lines_bind_group, &[]);
    render_pass.set_pipeline(&self.lines_render_pipeline);
    render_pass.draw(0..6, 0..(self.lines_count as u32));
}

Note how the following vertex shader uses gl_InstanceIndex to address line points and gl_VertexIndex to sort of emit (in the geometry shader sense) vertices:

#version 450

layout(set = 0, binding = 0) uniform Uniforms {
    mat4 transform;
} uniforms;

struct LineVertex {
    vec2 position;
    float width;
    float padding0;
    vec4 color;
};

layout(set = 0, binding = 1, std430) buffer LineVertices {
    LineVertex in_vertices[];
};

layout(location = 1) out vec2 out_line_size;
layout(location = 2) out vec2 out_line_coord;
layout(location = 3) out LineVertex out_vertex;

void main(void) {
    // Fetch line vertices.
    const LineVertex v0 = in_vertices[gl_InstanceIndex * 2];
    const LineVertex v1 = in_vertices[gl_InstanceIndex * 2 + 1];
    const vec2 p0 = v0.position;
    const vec2 p1 = v1.position;

    // Compute line parameters.
    const vec2 line_tangent = normalize(p1 - p0);
    const vec2 line_normal = vec2(-line_tangent.y, line_tangent.x);
    const float line_length = length(p1 - p0);

    // Construct six triangle vertices by indexing four quad vertices.
    vec2 t;
    switch (gl_VertexIndex) {
    case 0: {
        t = (p0 - line_tangent * v0.width) + line_normal * v0.width;
        out_line_size = vec2(v0.width, line_length);
        out_line_coord = vec2(-v0.width, v0.width);
        out_vertex = v0;
        break;
    }
    case 1:
    case 3: {
        t = (p0 - line_tangent * v0.width) - line_normal * v0.width;
        out_line_size = vec2(v0.width, line_length);
        out_line_coord = vec2(-v0.width, -v0.width);
        out_vertex = v0;
        break;
    }
    case 2:
    case 5: {
        t = (p1 + line_tangent * v1.width) + line_normal * v1.width;
        out_line_size = vec2(v0.width, line_length);
        out_line_coord = vec2(line_length + v1.width, v1.width);
        out_vertex = v1;
        break;
    }
    case 4: {
        t = (p1 + line_tangent * v1.width) - line_normal * v1.width;
        out_line_size = vec2(v1.width, line_length);
        out_line_coord = vec2(line_length + v1.width, -v1.width);
        out_vertex = v1;
        break;
    }
    }

    // Transform to OpenGL NDC.
    gl_Position = uniforms.transform * vec4(t, 0.0, 1.0);

    // Convert from OpenGL NDC to WGPU NDC.
    gl_Position.z = 0.5 * (gl_Position.z + gl_Position.w);
}

The fragment shader does some trivial 2D signed distance functions stuff to fake a circular box kernel. This thing can render about 12 million lines per second on my GF1080.

mitchmindtree commented 4 years ago

Wicked! Thanks for sharing @schulzch :)

Any pics?

schulzch commented 4 years ago

Sure, however, I can only share the naive version for now (I want to avoid trouble with publishers): graph It is a node-link diagram of the world trade graph between 1992 and 2017.