redox-os / orbtk

The Rust UI-Toolkit.
MIT License
3.78k stars 188 forks source link

Improve performance #408

Open ngortheone opened 3 years ago

ngortheone commented 3 years ago

Since https://github.com/redox-os/orbtk/issues/394 was prematurely closed and https://github.com/redox-os/orbtk/issues/392 is not exactly the right topic I decided to open a new issue and consolidate all information about this issue here.

Describe the bug Showcase application runs very slowly, consuming 100% CPU (both debug and release). Input lag on debug version is >1min. On release version input lag is somewhat between 5-30 seconds. Not only mouse clicks and key presses in application window are processed slowly, but CTL-C from the terminal that ran the app is also processed with the same lag.

It also looks like input lag and CPU load depends on amount of widgets (even those not visible) in the app. If I remove almost all widgets from the showcase application it lags less severely.

To Reproduce

cargo run --example showcase 
# OR
cargo run --example showcase --release

Desktops:

Hardware Intel CPU with integrated graphics (i915)

Screenshot

This screenshot demonstrates the lag and CPU load. The highlighted "up" arrow was pressed a few minutes ago and only now we see the animation.

s

(EDITED)

release build with debug symbols built

RUSTFLAGS=-g cargo build  --example showcase --release 

perf report recorded

perf record -F 99 -a  --call-graph dwarf -- target/release/examples/showcase

perf.data.zip

Please see perf report with attached report. It shows that there are issues with rendering performance.

ngortheone commented 3 years ago

Attached interactive flamegraph based on perf output.

orbtk-showcase-flamegraph.zip

Unfortunately Github disallows svg attachments, hence zip

FloVanGH commented 3 years ago

Hey, thank you very much for your investigation. There is a wip render port of tiny-skia https://github.com/sandmor/orbtk/tree/tinyskia. Can you check if you have the same issues with it, please?

ngortheone commented 3 years ago

Unfortunately tinyskia port exhibits the same behavior. Debug build is as unusable as main branch, release build feels like 10% (subjective, I have no real measurements to back this up) faster than main branch, but still the lag is so bad that no user will ever tolerate it. If you find it useful - I can create similar perf records and flamegraph for skia port.

Is there anything else I can do to help debugging this?

FloVanGH commented 3 years ago

I can create similar perf records and flamegraph for skia port.

Yes sure thank you.

Is there anything else I can do to help debugging this?

Can you check this example? https://gitlab.redox-os.org/redox-os/orbclient/-/blob/master/examples/simple.rs

FloVanGH commented 3 years ago

Maybe the problem is connected to our OrbClient sdl2 based window backend. It would be not the first time with have cpu issues with it on Linux.

ngortheone commented 3 years ago

https://gitlab.redox-os.org/redox-os/orbclient/-/blob/master/examples/simple.rs Looks to be OK. There is no CPU load, and I see a stream of events in the console in real time.

At position (0, 553) pixel color is : 0xFFB6B6B6
Key(KeyEvent { character: 'q', scancode: 16, pressed: true })
TextInput(TextInputEvent { character: 'q' })
Key(KeyEvent { character: 'q', scancode: 16, pressed: false })
Key(KeyEvent { character: '\u{0}', scancode: 71, pressed: true })
Key(KeyEvent { character: '\u{0}', scancode: 71, pressed: false })
Focus(FocusEvent { focused: false })
Focus(FocusEvent { focused: true })
ClipboardUpdate(ClipboardUpdateEvent)
Focus(FocusEvent { focused: false })
At position (10, 617) pixel color is : 0xFFBDBDBD
At position (14, 615) pixel color is : 0xFFBDBDBD
...

Although this does not seem to be an interactive application, just an image of some sort.

ngortheone commented 3 years ago

Interesting fact: calculator example runs much better. There must be something in code that makes thus bug appear. I will keep cutting showcase example to find out what it is.

ngortheone commented 3 years ago

I didn't get far by removing widgets, this doesn't seem to have any significant impact. But what helps is to set smaller size() of the window. Calculator has much smaller window size and that hides the issue.

            Window::new()
                .title("OrbTk - showcase example")
                .position((100, 100))
                .size(400, 400)  // <--- THAT

My observations

So considering all of the above I want to make a few suggestions:

I hope this helps.

ngortheone commented 3 years ago

Minimal example to demonstrate the problem

use orbtk::prelude::*;

fn main() {
    Application::new()
        .window(|ctx| {
            Window::new()
                .title("OrbTk - showcase example")
                .position((100, 100))
//                .size(2000, 2000) // SLOW
                .size(150, 50) // FAST
                .child(ButtonView::new().build(ctx))
                .build(ctx)
        })
        .run();
}

widget!(ButtonView {});

impl Template for ButtonView {
    fn template(self, _id: Entity, ctx: &mut BuildContext) -> Self {
        let slider = Slider::new().min(0.0).max(1.0).build(ctx);
        self.child(Stack::new()
                    .spacing(8)
                    .child(slider)
                    .child(ProgressBar::new().val(slider).build(ctx))
                    .build(ctx)
                )
    }
}

Try running with both window sizes and feel the difference.

How to reproduce: left mouse click and hold on slider and make rapid mouse movements left and right many times. Observe CPU load, and how fast slider and progress bar follow the mouse.

kivimango commented 3 years ago

In #392, the perf report says he second function that the cpu spent in is ::blit_span. I searched for it, seems like orbtk does not use it directly, but servo and sw-composite does.

FloVanGH commented 3 years ago

First @ngortheone thank you very much for your contribution to find out what is the performance issue.

What I can see on the perf output is that the render method on the orbclient based window is the most expensive on the output.

I checked the code and this little piece of code:

   let color_data: Vec<orbclient::Color> = self
                .render_context
                .data()
                .iter()
                .map(|v| orbclient::Color { data: *v })
                .collect();

takes 64ms on my machine. That's a lot. I use it to convert the u8 frame buffer of raqote to a Vec that is uses by orbclient as framebuffer. I think we need there a better solution.

FloVanGH commented 3 years ago

I replaced this peace of code and now this part takes 0ms. @ngortheone can you check if it is now a little bit better on your machine, please?

ngortheone commented 3 years ago

@FloVanGH thanks!

Overall there is an improvement, but I don't think we are there yet.

What improved:

What didn't improve / other observations:

  1. RUSTFLAGS="-Ctarget-cpu=skylake" cargo run --example showcase --release
  2. cargo run --example showcase
ngortheone commented 3 years ago

I will do another perf record soon with a flamegraph. Also I'll try to record a video from my screen to show better what I experience. Is there anything else I can do to help?

FloVanGH commented 3 years ago

Ok thank you. There are some other parts that can causes this problem. Layout is not yet optimized, on each iteration every widget size and position will be recalculated (I'm currently work on an update). And I know that raqote is not the fastest render backend. A new backend like tiny-skia will help.

But I think there is more and we have to solve these pieces step by step, until debug build is usable on big sized windows.

Your perf record will help, thank you. It was much easier to find the issue in the render method of the window backend.

ngortheone commented 3 years ago

Awesome. Some time during my day I will record perf data (likely closer to the end of my day, and it is morning here now)

ngortheone commented 3 years ago

@FloVanGH Attached perf data of today's build from develop branch. perf.data.zip

ngortheone commented 3 years ago

@FloVanGH have you had a chance to look into this?

FloVanGH commented 3 years ago

@ngortheone unfortunately not. But I hope soon.

SamuraiCrow commented 2 years ago

I've been tinkering on paper with a new algorithm for doing high-speed grid layouts by folding constants and combining gadget renders via inlining render code. The basic idea is that the "pivot points" of the x and y coordinates can be converted into 2 vec structures and the remaining fixed grids can be combined into a smaller number of grid layouts. Since the grid offsets don't change during a window resize relative to the pivot coordinates, only the coordinates of the pivot points referred to by the vecs would need to change.

It's kind of like vertex shaders in a graphics layout for polygons. The list of vertexes are figured out in a batch but the 3d model only keeps track of which vertexes are used rather than their positions. In 2d we could go a step farther by figuring the x pivots and y pivots separately so that the rows and columns of the grid could be stored as a 2d array of enumerated values for determining which grid cell contains which gadget.