SwipeGestureHandler cause a panic in `timers.rs`

slint-ui / slint

Slint is a declarative GUI toolkit to build native user interfaces for Rust, C++, or JavaScript apps.

https://slint.dev

Other

17.66k stars 611 forks source link

SwipeGestureHandler cause a panic in `timers.rs` #6332

Open yanshay opened 2 months ago

yanshay commented 2 months ago

In an app running on MCU (esp32s3) with software-renderer, when adding SwipeGestureHandler it's causing panics to happen, sometimes but not always, when touching the screen, not even when doing any swipes.

When it doesn't panic the swiping work.

It is enough to include the SwipeGestureHandler, even with all swiped disabled, without any code in the swiped callback (and w/o any animation code like there is in the example in the swipe or anywhere else in the app). So it's enough it is there.

I removed all Timers from my slint code and it still happens. And my rust code doesn't use slint timers.

My app is async and it's my event-loop, but that event loop code is working flawlessly for a long time on two applications, including with animations (that probably use timers), and it's working great until the SwipeGestureHandler is introduced.

So it looks to me like it has to do with SwiptGestureHandler timing touch events, but I don't really know.

This is the Backtrace I'm getting:

panicked at /Users/my_user/.cargo/git/checkouts/slint-8153123e5dffa129/746100b/internal/core/timers.rs:257:13:
Recursion in timer code

Backtrace:

0x42060dc6
0x42060dc6 - i_slint_core::platform::update_timers_and_animations
    at /Users/my_user/.cargo/git/checkouts/slint-8153123e5dffa129/746100b/internal/core/platform.rs:236
0x420c3d29
0x420c3d29 - <i_slint_core::window::WindowRedrawTracker as i_slint_core::properties::PropertyDirtyHandler>::notify
    at /Users/my_user/.cargo/git/checkouts/slint-8153123e5dffa129/746100b/internal/core/window.rs:367
0x420bb9a9
0x420bb9a9 - i_slint_core::properties::mark_dependencies_dirty::{{closure}}
    at /Users/my_user/.cargo/git/checkouts/slint-8153123e5dffa129/746100b/internal/core/properties.rs:721
0x420bb9c1
0x420bb9c1 - i_slint_core::properties::dependency_tracker::DependencyListHead<T>::for_each
    at /Users/my_user/.cargo/git/checkouts/slint-8153123e5dffa129/746100b/internal/core/properties.rs:195

tronical commented 2 months ago

Thanks for the report. Could you please provide additional frames in the stack trace?

Could you provide us with some details about how you've implemented the event loop and integrated it with your asynchronous executor?

yanshay commented 1 month ago

Thanks for the report. Could you please provide additional frames in the stack trace?

I've already asked on the esp-hal repo why I don't see the entire trace up to my code, this is what I have for now. Will update if I manage to get more.

Could you provide us with some details about how you've implemented the event loop and integrated it with your asynchronous executor?

As I started explaining and dived into my code I came to remember one thing special about my event loop and identified it as causing this behavior. When I remove it I couldn't reproduce the issue.

It's a bit complex, I'll try to explain since it's something I implemented almost a year ago due to a what I believe is a limitation of Slint, at least was at that time, when it comes to async apps, maybe it got resolved by now.

The original issue, which I discussed back then with @ogoffart on https://github.com/slint-ui/slint/discussions/3994#discussioncomment-7680584 , in short, is that if the event-loop is async and wakes up only when really needed (so could not wake up for a long time if nothing is touched), then if a slint property is modified by the rust back-end due to some background proces in an async task, and if that property has any animations tied to it, those animations won't run, even if the event loop is woken up immediately at that point.

That's because the time recorded as the animation start would be the last time when the event-loop ran which could be much earlier than when the backend task caused the property to change, and therefore when the event-loop is reached the animation is considered to have already completed.

I hope I explained the issue well enough.

To solve that what I had to do the following:

    fn request_redraw(&self) {
        self.needs_redraw.set(true);
        self.redraw_signal.signal(1);  // <- this cause the event loop task to wake up 
        // This is required for rust driven animated properties (when an animated property is set
        // by rust). Without this, when no user interaction, animated properties will not animate
        // and only jump straight to end value
        // https://github.com/slint-ui/slint/discussions/3994#discussioncomment-7667717
        slint::platform::update_timers_and_animations(); // <- this cause taking timer mark which animation will use as animation start
    }

I just had to call update_timers_and_animations() from my implementation of request_redraw(). Or I had to manually before modifying that property call update_timers_and_animations() in the application code which would be cumbersome and push slint code all over the app.

This solution worked perfectly well all the time until I added the SwipeGestureHandler. I don't understand why it should affect my code because my understanding is that SwipeGestureHandler activate a timer but so do other features of slint, so what's unique about it to cause this behavior?

So... the issue is probably not the bug I reported, and maybe the discussion should move elsewhere, but:

Is there a different way to resolve my backend property triggering animations in another way? It's an important limitation for an app I wrote.
Or is there a way to check in the code above if now I can't call update_timers_and_animations() because we are already inside one and not call it at that time? In such cases it seems like it's not required.

tronical commented 1 month ago

I hope I explained the issue well enough.

Yes, great write-up :)

I just had to call update_timers_and_animations() from my implementation of request_redraw(). Or I had to manually before modifying that property call update_timers_and_animations() in the application code which would be cumbersome and push slint code all over the app.

This solution worked perfectly well all the time until I added the SwipeGestureHandler. I don't understand why it should affect my code because my understanding is that SwipeGestureHandler activate a timer but so do other features of slint, so what's unique about it to cause this behavior?

SGH uses a timer internally, just like Flickable.

What would be best is to call slint::platform::update_timers_and_animations(); not in request_redraw() but on the other end of self.redraw_signal, for example right before you call draw_if_needed.

Does that help and/or make sense?

yanshay commented 1 month ago

SGH uses a timer internally, just like Flickable.

What would be best is to call slint::platform::update_timers_and_animations(); not in request_redraw() but on the other end of self.redraw_signal, for example right before you call draw_if_needed.

Does that help and/or make sense?

Do you mean to call it before self.redraw_signal.signal(1)??

So just change the method to:

fn request_redraw(&self) {
    self.needs_redraw.set(true);
    slint::platform::update_timers_and_animations(); // <- this cause taking timer mark which animation will use as animation start
    self.redraw_signal.signal(1);  // <- this cause the event loop task to wake up 
}

This won't change anything since it is a single threaded app a, so it's practically the same thing and will panic just the same. I need a way to update slint on time in this method. It's just that this method is sometimes called when this can't be done and I don't know how I can know when it is such a case.

Or do you mean to call it in the event loop right before draw_if_needed? This is what I tried in the past and it didn't work (actually that call is there anyway in my code), it was too late for animations since the animations didn't take place because the time recorded as animation start seem to have been the time slint knows about when the request_redraw is called. I commented on that in the other discussion I referenced from back then (https://github.com/slint-ui/slint/discussions/3994#discussioncomment-7680584).

tronical commented 1 month ago

Or do you mean to call it in the event loop right before draw_if_needed?

Yes.

This is what I tried in the past and it didn't work (actually that call is there anyway in my code), it was too late for animations since the animations didn't take place because the time recorded as animation start seem to have been the time slint knows about when the request_redraw is called.

What is the latency between self.redraw_signal.signal(1) and the event loop waking up?

yanshay commented 1 month ago

What is the latency between self.redraw_signal.signal(1) and the event loop waking up?

It depends on the scheduler and how it schedules things (I have several other tasks running) but I guess it should be quicker than the animations I tried back then (I tested with animations of seconds to figure out why it wasn't working).

I'll try to reproduce the issues I had then in current application and see if maybe now what you said solve it. I had those issues close to a year ago and no longer remember all the tiny details. Will report back once I test it some more.

yanshay commented 1 month ago

Ok, I managed to reproduce the issue, however, both my solution as well as your suggestion, even together don't work!

But it's real more bizarre than that.

I did many many tests. The more I tested the less things made sense.

Here is one example: I tried animating three properties (at the same time) based on a boolean I change from the backend.

y position of vertical layout
color of a rectangle
width of the same rectangle

In all tests below your suggestion was applied.

The y position of vertical layout alone worked, but ONLY if my solution was there.
When I added the rectangle background color animation, it didn't work (so changed without animation) unless I added update_timers_and_animations() prior to changing the property (with my solution included).

Here starts the bizarre part:

When I added the rectangle width change based on that same property (without animation yet), suddenly the background animation started working. Again, ONLY if my solution was there.
When on top of that I added the animation to the rectangle width change (that I added in 3 only w/o the animation), the width change got animated but the background color stopped animating.

So seems like I stumbled upon issues in slint properties/animation/backend-properties-change that are more complex than what I had thought, where in short - with a single property change from the rust backend, that has several other animated properties dependent on it, some of those animations work and some don't, and to get anything to work even sometimes in such case it requires my solution which fails the SwipeGestureHandler. All that of course in an async environment with an optimized event loop that doesn't run exhausting CPU like in non async.

For now my only option is to remove my solution, and whenever I change a property that I know triggers animation later on I'll call slint update_timers_and_animations() before. Not elegant, but for now mandatory.

And to add to that, as I wrote the last paragraph slint panicked with the following backtrace, probably related to this in some way since it includes properties there:

====================== PANIC ======================
panicked at /Users/my_user/.cargo/git/checkouts/slint-8153123e5dffa129/746100b/internal/core/graphics/border_radius.rs:455:25:
called `Option::unwrap()` on a `None` value

Backtrace:

0x420a7da1
0x420a7da1 - core::option::unwrap_failed
    at /Users/my_user/.rustup/toolchains/esp/lib/rustlib/src/rust/library/core/src/option.rs:1985
0x420f0eb7
0x420f0eb7 - core::option::Option<T>::unwrap
    at /Users/my_user/.rustup/toolchains/esp/lib/rustlib/src/rust/library/core/src/option.rs:935
0x420e9648
0x420e9648 - i_slint_core::properties::PropertyTracker<DirtyHandler>::evaluate
    at /Users/my_user/.cargo/git/checkouts/slint-8153123e5dffa129/746100b/internal/core/properties.rs:1451
0x420bb7d9
0x420bb7d9 - <i_slint_core::items::BasicBorderRectangle as i_slint_core::items::Item_vtable_mod::Item>::render
    at /Users/my_user/.cargo/git/checkouts/slint-8153123e5dffa129/746100b/internal/core/items.rs:393
0x420cd5fe
0x420cd5fe - i_slint_core::items::Item_vtable_mod::ItemTO::render
    at /Users/my_user/.cargo/git/checkouts/slint-8153123e5dffa129/746100b/internal/core/items.rs:108
0x42071414
0x42071414 - i_slint_core::item_tree::ItemVisitor_vtable_mod::ItemVisitorTO::visit_item
    at /Users/my_user/.cargo/git/checkouts/slint-8153123e5dffa129/746100b/internal/core/item_tree.rs:989
0x42019cf4
0x42019cf4 - i_slint_core::item_tree::ItemTree_vtable_mod::ItemTreeVTable::new::visit_children_item
    at /Users/my_user/.cargo/git/checkouts/slint-8153123e5dffa129/746100b/internal/core/item_tree.rs:44
0x420cd625
0x420cd625 - i_slint_core::item_tree::ItemTree_vtable_mod::ItemTreeTO::visit_children_item
    at /Users/my_user/.cargo/git/checkouts/slint-8153123e5dffa129/746100b/internal/core/item_tree.rs:44
0x42071414
0x42071414 - i_slint_core::item_tree::ItemVisitor_vtable_mod::ItemVisitorTO::visit_item
    at /Users/my_user/.cargo/git/checkouts/slint-8153123e5dffa129/746100b/internal/core/item_tree.rs:989
0x42019cf4
0x42019cf4 - i_slint_core::item_tree::ItemTree_vtable_mod::ItemTreeVTable::new::visit_children_item
    at /Users/my_user/.cargo/git/checkouts/slint-8153123e5dffa129/746100b/internal/core/item_tree.rs:44

ogoffart commented 1 month ago

What is your setup exactly? Are you enabling the unsafe-single-threaded feature? Is there a possibility that there are actually several threads or task or interrupt that would use Slint apis? what kind of scheduler do you use?

yanshay commented 1 month ago

What is your setup exactly?

I'm developing on ESP32S3 using esp-hal and Embassy.

what kind of scheduler do you use? I think you refer to schedule with that, right?

Are you enabling the unsafe-single-threaded feature?

Yes, this is what I have in cargo.toml: slint = { version = "1.8.0", default-features = false, features = ["libm", "unsafe-single-threaded", "compat-1-2", "renderer-software"] }

Is there a possibility that there are actually several threads or task or interrupt that would use Slint apis?

Embassy doesn't support threads, so no threading, and I haven't activated the 2nd core on the esp32s3 in this app so no parallel execution from that. I didn't write any interrupt handlers and all my code is in under Embassy scheduled tasks so I don't see any chance that there would be multithreading.

The only 'standard' thing I can think of is that my code 're-enter' slint while slint is mid internal calls is due to events triggered by slint or event's triggered by my code updating things in slint. But I'm not doing anything special in my callback event handlers.

That said, it seems like the real cause of the original crash is due to a solution I wrote to solve a problem I had. The problem is with animated properties not functioning properly when they change due to rust modifying those properties driven by backend processing (e.g. when receiving of network packet), when there is no interaction with the app and the event loop didn't work for a while. I believe the core reason to that problem is slint assumption of busy loop event loop on MCU which isn't practical in async app. If the event loop is built so it wakes up only when needed, and it's not needed, then when the backend makes property change that change records a start time that could be minutes ago, and when the event loop wakes up, even if immediately after, it is considered as completed. So I solved it by calling update_timers_and_animations() from inside my implementation of request_redraw() and that, together with the SwipeGestureHandler sometimes cause for this call to be made in the middle of some other call that borrowed the timers structure. I could also solve it by just having some task that wakes up quite frequently and calling update_timers_and_animations() to update slint on time, but it's not elegant. Or I can call that function before any property change that is tied to an animation. I think It would be most elegant for slint to solve that by taking real time rather than by using the last time it was notified when animation start.

The strange thing is that on top of the original crash, I noticed all kinds of inconsistencies described above, unrelated to the crash issues, that animation didn't work got to work in a way I can't explain, so seems that unrelated to this my solution does not always solve the issue, probably due to internals of properties dependencies that I don't understand. And then came that last panic out of the blue due to animations being triggered continuously and with incorrect setup of some sort (since they did't function right).

Hope all this information helps.

tronical commented 1 month ago

If you have the concept of tasks, wouldn't it be possible to organize the "slint busy event loop" as a task?

loop {
    // one event loop cycle start; No .await allowed!
update_timers_and_animations();
    handle_touch_input_if_present();
    redraw_if_needed();
    // event loop cycle end
    let timer = duration_until_next_timer()
        .map(|duration| new_timer_task(duration));

    select!(timer, touch_input_signal, redraw_signal).await;
}

This is somewhat pseudo code, but I'm curious if there is anything fundamentally missing in Slint to permit this.

We could, and perhaps should, offer sugar APIs for this, such as a Slint event loop task, with an interface that allows retrieving signals as needed.

yanshay commented 1 month ago

When I refer to a busy loop I refer to a loop that runs and takes up CPU all the time, and slint gets updated that way about time continuously through update_times_and_animations(). This I can't have in an async environment.

What you suggest above includes an await, and this is exactly what I have in my code as a task dedicated to the event loop. It works most of the time, but not in the case I describe here, I'll explain:

So at high level I'm waiting for either (exactly as you write):

The duration of time slint tells me until the next step it wants me to wake it up due of animations or whatever it knows when it enters a loop
touch input coming from the user
a signal that something else changed externally and need to redraw (that's what external backend events will triger for whatever reason)

Let.'s tart in a case when the code enter that wait and no touch is taking place and no animation taking place and nothing external takes place, that's the time slint knows about last when it enters the select!. Let's say it is at 01:00 .

Then no one touches the touch screen for an hour. Nothing happened on the backend. Since no animations were pending the select.await doesn't return. An hour pass and slint still knows that current time is 01:00 even though it is 02:00 (because no call made to update_timers_and_animations()).

Then, at 02:00, a packet arrives on the network, the backend changes a property in slint, one that has an animation of let's say 2 seconds tied to it. That property change triggers the 'redraw_signal(which is called from therequest_redraw` implementation) and the loop wakes up.

Not comes the issue:

From my experience with slint what happens next is that even though you immediately call update_timers_and_animations in the loop, slint had already recorded the animation start time prior to that as 01:00 because that's the last time it knows about (even though we are already at 02:00), so on the first step to check what's the next step of that animation it sees that those 2 seconds of animation passed a long time ago. It sees that animation started at 01:00, it was supposed to run for 2 seconds and and at 01:00:02 and we are already at 02:00, already close to an hour passed since the animated should have completed, and so brings it to the final state w/o any animation taking place.

When I checked that back then to validate it, I even timed it so the redraw_signal would occur at the time of when the middle of the animation should be based on the start time slint knew about, and slint skipped over half of the animation and continued from there.

When I added the call to update_timers_and_animations' in therequest_redraw` implementation, it seemed to notify slint about the updated time BEFORE it took the start time of the animation and that made it work.

(For some reason, now in my tests this was required but not enough when I implemented multiple animations, or maybe more complex property dependencies than I tested in the past. And it sometimes caused panics with SwipeGesureHandler.)

IMO, for supporting propertly asnyc apps, when slint understand an animation is triggered, instead of taking as start time the time it knows about from the last call made to update_timers_and_animations, it should reach out to the platform implementation to request the time. It actually has a method to call 'duration_since_start' which it probably calls when invoked by 'update_timers_and_animations', it can just call it even if not requested to at that time by the event-loop.

ogoffart commented 1 month ago

Then, at 02:00, a packet arrives on the network, the backend changes a property in slint, one that has an animation of let's say 2 seconds tied to it.

I see. In order to fix the bug, one must call update_timers_and_animation right before setting that property. (the code that sets the property because a network packet comes in should also call update_timers_and_animations.

When I added the call to update_timers_and_animations in the request_redraw implementation, it seemed to notify slint about the updated time BEFORE it took the start time of the animation and that made it work.

That doesn't seem right. In the request_redraw implementation it is already too late. (although i can think of reason that this might work if the bindings that gets animated get dirty, by luck, is notified after the screen.)

IMO, for supporting propertly asnyc apps, when slint understand an animation is triggered, instead of taking as start time the time it knows about from the last call made to update_timers_and_animations, it should reach out to the platform implementation to request the time. It actually has a method to call 'duration_since_start' which it probably calls when invoked by 'update_timers_and_animations', it can just call it even if not requested to at that time by the event-loop.

I guess this can indeed be improved. It would be less efficient since every starting animation would re-update animations, but it may be doable. So that means refreshing the current time when an animation starts. I am not sure if this is actually possible, but I need to think about it.

Regardless, for this panic, I see the following solution

Improve the panic message and panic earlier when we detect a recursion
return an Err from update_timers_and_animations if we detect a recursion (not possible, until Slint 2.0)
detect the recursion and ignore it.

I guess ignoring it may be alright.

yanshay commented 1 month ago

I see. In order to fix the bug, one must call update_timers_and_animation right before setting that property. (the code that sets the property because a network packet comes in should also call update_timers_and_animations.

Yes, that's what I initially did, and probably need to go back to that. The issue with that is that either you do it everywhere where accessing the UI from rust, or need to have the backend code aware of internal UI details and changes made along time. The latter isn't really manageable, difficult to remember which property has potentially indirect influence on animations.

I guess this can indeed be improved. It would be less efficient since every starting animation would re-update animations, but it may be doable. So that means refreshing the current time when an animation starts. I am not sure if this is actually possible, but I need to think about it.

Is it maybe possible to add in the rust code generated by slint a call to update_timers_and_animations in every call? Maybe even recognize which call could affect animations or timers and make the call only there (sounds difficult)?

yanshay commented 1 month ago

The panic from above in https://github.com/slint-ui/slint/issues/6332#issuecomment-2377600033 happened to me again, unrelated to this issue since I removed my special code already, so it's just a standard slint app. It happened during an animation (that got triggered from a backend change of a property)

Enyium commented 1 month ago

So, together with #6187, timers.rs now caused the following panic messages:

Recursion in timer code
assertion failed: !timer.being_activated
invalid key

May the test coverage for the timer implementation not be sufficient, if all of these (and more?) can happen in production?