Refine the safe window handle API

notgull commented 1 year ago

In #110 I added safer versions of the window handles. However, we might need to make sure they work with all platforms, present and future, before we release them.

Platform specific issues that I'm aware of:

The Android application can be randomly invalidated, making it undefined behavior to use the handles. However, we use Active to work around that by only allowing usage of the handles while the application is active.
Windows in X11 (and Wayland, I think?) can be deleted at any point using 100% safe code. I don't think this is an issue, since I don't think any undefined behavior can result from this; worst comes to worse you get a BadWindow error. However, I'm not sure if there are any implications on the GPU end that we should be aware of.

kchibisov commented 1 year ago

and Wayland, I think

Not really, it'll raise a protocol error if you do that before the role object and it is not exposed. So while you can issue a drop, you'll kill the entire app.

ids1024 commented 1 year ago

Well, on X you could just delete a window in safe code, in a different process, on a different machine connected to the same X server, right? You can't really rely on anything behaving particularly consistently on X unless all clients are cooperative, but it should still be safe at least.

On Wayland, yeah, if you destroy the role object and other things the protocol states should be destroyed first. Then I think you could get send requests with an object ID that is no longer valid? Which probably means a protocol error, unless it has been re-used for another surface.

The documentation should probably specifically call out these things the API cannot guarantee.

ids1024 commented 1 year ago

The Active part of the API is a bit ugly and awkward, but I'm not familiar enough with this aspect of Android to know precisely what this is necessary for (without the restriction, would this violate soundness somehow, or just cause other issues?), whether this correctly addresses it, or if there's a better solution...

Lokathor commented 1 year ago

On Windows, someone can just plain delete your window on you. You don't need to have even shared the HWND with any other code, there's functions to enumerate all visible windows for example. From a separate process even. (If you use a window handle to a deleted window it's supposed to just give "invalid handle" errors, but it's a hard to test edge case so i wouldn't be surprised if possible bugs occurred.)

I'm not sure how this "active" thing interacts with that

ids1024 commented 1 year ago

Ah, that's a thing on Windows too? Even from another process? I guess it makes sense since the core of the win32 API for dealing with windows is largely the same as the win16 API included in Windows 1.01 in 1985, so it's not particularly more modern than X...

So basically this API does not guarantee that the handle is a valid handle for a window that still exists (it should be, but there are these edge cases, on many platforms), but does guarantee it doesn't include an invalid pointer on platforms like Wayland where the RawWindowHandle contains a pointer. And using it won't cause undefined behavior, but may error or cause certain unexpected behavior (like if the underlying handle has been freed and re-used for another window).

The last point here is the subtle one. This API is mostly only useful if we can be sure that it's safe to use wgpu, glutin, and softbuffer with the handle, for all platforms and all API backends of wgpu/glutin. So what soundness requirements do we need to be concerned with for using WGL, GLX, CGL, EGL, Metal, DX12, etc. with the window and display handle...

Lokathor commented 1 year ago

I can say that Vulkan builds into it the idea that the surface could explode at any time, and so there's an error code for that which can be appropriately handled by wgpu's vk backend.

The other backends, I dunno.

notgull commented 1 year ago

I think that, as a safety requirement, we should say that any window that consists of an ID (XID on X11, HWND on Windows, the ID on web platforms) could be deleted by an external process at any time and they should be prepared for that. I would think that other system libraries (The GL family, graphics APIs) would need to be ready for this as well; otherwise, subtle bugs could arise.

The Active trick mostly applies to Android, where the Window can disappear at any time. For other platforms, this isn't an issue (in fact, you can create a new Active safely at any time using Active::new() on non-Android platforms). If anyone has a better solution for this, I'm all ears. In fact, I'm not even sure if this solves the problem. @rib What do you think, since it seems like you're more familiar with the Android backend?

ids1024 commented 1 year ago

In fact, I'm not even sure if this solves the problem.

Definitely something to confirm before releasing this. We wouldn't want to release an awkward API that doesn't even succeed at enforcing the safety property its trying to enforce.

In particular, if the window can disappear "at any time", that means in some future iteration of the main loop, and not while our local code block is running? Because if it really were "at any time" in the strict sense (between any two CPU instructions where the process is pre-empted, or at exactly the same time on another core) this type isn't able to enforce much. And using the handle in this case is UB, and not just going to produce errors?

retep998 commented 1 year ago

(like if the underlying handle has been freed and re-used for another window).

Modern Windows uses generational indices, where some of the bits are the actual index, and the rest of the bits are just a counter that goes up each time the index is used. That way it would take an extremely long time before an HWND would actually be used, thus mitigating the risk of use after free. Using an invalid HWND is not dangerous as you'll just get an error that it's invalid. If you really want to be extra safe, then your safe handles need to have a way to self destruct after you get an error about the handle being invalid so they cannot be reused if somehow enough handles are burned through to get a repeat.

kchibisov commented 1 year ago

On Wayland, yeah, if you destroy the role object and other things the protocol states should be destroyed first. Then I think you could get send requests with an object ID that is no longer valid? Which probably means a protocol error, unless it has been re-used for another surface.

The thing is that winit is passing the wl_surface, and from what I remember if you destroy it before the top-level/shell it's a protocol error. Anyway, the handle is owned value, so it doesn't matter, you can't really free anything here...

notgull commented 1 year ago

I decided to take a run at integrating the new safer API into softbuffer, just to see how it would work. See rust-windowing/softbuffer#82.

The main issue here is that Window can't soundly implement HasDisplayHandle in its current form. Window isn't tied to event handling in any way, so you could just leak it and use the resulting reference to get an Active<'static>, which wouldn't be sound.

My knee jerk reaction would be to split HasDisplayHandle into the DisplayHandle provider as well as a trait that would provide Active references (maybe named Activator?). This would require the APIs involved to take references to an Active/Activator, though, which wouldn't be the most convenient thing to adopt.

notgull commented 1 year ago

Actually, thinking about it, it might be possible to just do this at the start of the program:

Box::leak(EventLoop::new()).active().unwrap()

to get a free Active<'static>. We may want to just have an is_valid() method on the window, for convenience and possibly because the current method is broken.

ids1024 commented 1 year ago

Good point. Box::leak makes it hard to enforce properties like this with lifetimes, outside of very restricted contexts.

Another possible safety concern: what happens when a WindowHandle is used with a DisplayHandlethat doesn't match? You can have multiple display server connections with back-ends like Wayland. I think with Wayland it would cause issues but not be unsound?

notgull commented 1 year ago

The idea is that an Active object doesn't live between event handlers; however, it's tricky to do that.

My goal with this API is that downstream users shouldn't have to change their externals APIs, aside from replacing the impl HasRaw[Window/Display]Handle with impl Has[Window/Display]Handle and unsafe fn with fn. Soundness concerns that are mentioned above aside, this would mean either passing an Active to every function call or bundling the active getter into HasDisplayHandle, both of which pose issues.

An alternative idea is to have an is_valid on HasWindowHandle. This would be a flat true on non-Android, and maybe we could use this to avoid the window ID issues above. However, it would be tough to use in multithreaded environments; before any call to a system function, you would need to call is_valid, since another thread might've yanked out the window handle from under you. As this would be a substantial, annoying change (and probably have multithreading implications to boot) I'd like to avoid this.

A third option, and one that I like better that the other two, would be to treat the WindowHandle like an RAII lock. When you call window_handle(), it "acquires" the "read" end of, say, an RwLock. Trying to delete the handle on the event handling side blocks until it has the write end of the RwLock. This would ensure that, on Android, the window is only suspended when it's deleted. We may also be able to make sure it compiles down to nothing but no-ops on other platforms. The main downsides that I see here are that it means WindowHandle can't safely be Copy and that this introduces a new, subtle vector for deadlocks.

Any other potential strategies?

ids1024 commented 1 year ago

The idea is that an Active object doesn't live between event handlers; however, it's tricky to do that.

I think for that to work it would need to be provided in a lifetime-restricted argument of the callback passed to something like winit's EventLoop::run. If it's provided in a method on an owned type, you run into this Box::leak issue, and probably various ways others.

This is similar to how GhostCell works, which also uses a lifetime-bound callback argument. That trick is the only way to enforce certain properties with lifetimes.

Also, would this active state would be global to the process on Android? With things like Wayland there's no reason you can't have several different Wayland connections in process (each with any number of windows) and no way to distinguish at a type level which one Active is associated with (but you don't need to on Wayland, at least, since this is just an Android thing).

A third option, and one that I like better that the other two, would be to treat the WindowHandle like an RAII lock. When you call window_handle(), it "acquires" the "read" end of, say, an RwLock. Trying to delete the handle on the event handling side blocks until it has the write end of the RwLock. This would ensure that, on Android, the window is only suspended when it's deleted.

If WindowHandle isn't Send or Sync, then "blocking" like that isn't possible (without async) since it could only do that in the same thread and would just error or deadlock from another lock acquisition in the same thread, right?

That kind of thing can also simply work with refcounting, where the resource isn't destroyed until all (strong) references are dropped. But I'm not familiar with exactly how this works in Android.

notgull commented 1 year ago

I created #116, which implements the ref-counting mechanism I described here.

notgull commented 1 year ago

On Windows, someone can just plain delete your window on you. You don't need to have even shared the HWND with any other code, there's functions to enumerate all visible windows for example. From a separate process even.

I wrote a blog post about this and posted it to Reddit, where @fleabitdev pointed out a section in the DestroyWindow function's documentation:

A thread cannot use DestroyWindow to destroy a window created by a different thread.

So I don't think we have to worry about this on Windows, thankfully.

retep998 commented 1 year ago

Any process can send your window a WM_CLOSE message (which you can decide how to respond to), but only your thread can actually call DestroyWindow at which point you have a last chance to do any cleanup (including any invalidation of safe handles) in the WM_DESTROY handler. Just make sure that if you do invalidate any safe handles, you do so in a lock free manner so you don't get a deadlock.

ogoffart commented 1 year ago

Slint is a UI toolkit, it may use different backend to implement the actual windowing (for example, winit, but could be something else, including no backend on bare metal). Ideally, i'd like the slint::Window to implement HasWindowHandle and HasDisplayHandle, but the problem is that i wouldn't know what to return when there is no backend.

Since the only way to construct a DisplayHandle (DisplayHandle::from_raw()) is unsafe, it is not even possible to construct an empty handle safely.

For that reason, i think there should either be a way to construct a empty (or invalid) DisplayHandle safely. Or the HasDisplayHandle::display_handle should return an Option

notgull commented 1 year ago

For that reason, i think there should either be a way to construct a empty (or invalid) DisplayHandle safely. Or the HasDisplayHandle::display_handle should return an Option

I already have the methods set to return a Result<*Handle, HandleError> so that it can indicate if the handle isn't available due to the application being inactive. Ideally, I think HandleError could be defined as:

pub enum HandleError {
    /// Application is inactive.
    Inactive,

    /// Raw window handle error.
    Raw(RawHandleError)
}

...and then, if we move forwards with this, I think that it should return Result<Raw*Handle, RawHandleError>.

However, it might be nice to have an 0.5.x release with borrowed handles before we break and move onto 0.6.

ogoffart commented 1 year ago

However, it might be nice to have an 0.5.x release with borrowed handles before we break and move onto 0.6.

What prevents HasDisplayHandle::display_handle and HasWindowHandle::window_handle to return a Result in 0.5.x ? These traits haven't been released yet. So now is the time to make the change.

In addition to HandleError::Inactive there should be a HandleError::Unsupported or something like that, that specify that no handle can be returned because it is not supported or not implemented by the backend.

Lokathor commented 1 year ago

0.5.1 is out already, 0.6 is the next breaking update.

ogoffart commented 1 year ago

But HasDisplayHandle and HasWindowHandle are not in 0.5.1, so the Result can be added in 0.5.2 and that will not be a breaking change

Lokathor commented 1 year ago

Ah, right right. I misunderstood.

notgull commented 1 year ago

But HasDisplayHandle and HasWindowHandle are not in 0.5.1, so the Result can be added in 0.5.2 and that will not be a breaking change

At the current point in time, they do return Result and the HandleError enum is non exhaustive, so this shouldn't be an issue.

notgull commented 1 year ago

Just realized something; this is a breaking change on Android, since it fails to compile there if the std feature isn't enabled. Best way around that issue would be to just make it so the borrowed handles are cfg(any(feature = "std", not(target_os = "android"))), or to add a new borrowed feature to gate them behind. Thoughts?

notgull commented 1 year ago

I feel like we should be ready for release. @Lokathor When you have a chance, can you publish an 0.5.x version?

Lokathor commented 1 year ago

I should be able to later today.

Lokathor commented 1 year ago

Okay, what I meant was "i can do it tomorrow", 0.5.2 is out

notgull commented 1 year ago

Closing this issue now, since I think we can now say "mission accomplished".

rib commented 1 year ago

I think that, as a safety requirement, we should say that any window that consists of an ID (XID on X11, HWND on Windows, the ID on web platforms) could be deleted by an external process at any time and they should be prepared for that. I would think that other system libraries (The GL family, graphics APIs) would need to be ready for this as well; otherwise, subtle bugs could arise.

The Active trick mostly applies to Android, where the Window can disappear at any time. For other platforms, this isn't an issue (in fact, you can create a new Active safely at any time using Active::new() on non-Android platforms). If anyone has a better solution for this, I'm all ears. In fact, I'm not even sure if this solves the problem. @rib What do you think, since it seems like you're more familiar with the Android backend?

Sorry for missing this, and the delayed reply - I was on holiday last week and not actively following up on github stuff.

I think I may have skipped a beat here and might need to follow the bread crumbs to understand the issue here.

There are a few things that make this stuff a little confusing on Android:

very old versions of Android used to require you to recreate your full graphics context after suspend + resume and it's sometimes assumed that's still a thing that apps need to worry about.
The Android NDK confusingly refers to surface swap chain state as a "window", which isn't consistent with the Java SDK.
The android_native_app_glue (C glue layer for native apps) will "terminate" your "window" on suspend, and for some kind of consistent inconsistency we have used the same terminology for all of this in ndk-glue and also android-activity for native Rust apps.

It's maybe a bit of a semantics discussion but on Android the window is a thing that's stable across suspend and resume and is generally full screen. It's also not directly referenced when constructing a graphics API surface.

Where Android's graphics stack is a bit unusual is that it effectively passes around a swapchain / BufferQueue that is something that can hand out gralloc buffers for you to render to and present to the screen / surface flinger. That source of buffers gets referred to as an ANativeWindow in the NDK and is a reference counted object.

When we get notified that the Android ANativeWindow has been "terminated" that should effectively just tell us that we "won't be able to get any more buffers to render to this surface". I believe that the actual API should remain safe to use so long as we continue to hold a reference to the ANativeWindow though (I don't think there should be a soundness issue in terms of having some pointer be invalidated under our nose).

What this means, on Android, is that you have to re-create graphics API surfaces each time the app resumes. (Such as an EGLSurface, or WGPU surface, or Vulkan VkSurface etc).

In this Glutin example I handle that by clearing surface state when suspending, and lazily-recreate when resuming: https://github.com/rust-mobile/rust-android-examples/blob/41e72ca9de45e2add4776ec004f916ad979cf781/na-winit-glutin/src/lib.rs#L386

If you don't you should just get left with a surface that's associated with a stale BufferQueue - which should be safe but you won't be able to do anything with it any more because the buffer queue will stop handing out buffers you can render to. (i.e a black screen)

I might be overlooking something but as far as I knew there wasn't any safety / soundness issue with Android here currently so I'm not currently sure what issue was being addressed here for Android.

There used to be a synchronization problem with ndk-glue for tracking the window create/terminate events between Java and native Rust code but that was addressed by the change in how events are dispatched in android-activity (following the same basic design from android_native_app_glue.c)

notgull commented 1 year ago

Hi!

I believe that the actual API should remain safe to use so long as we continue to hold a reference to the ANativeWindow though

Are you sure? The documentation for onNativeWindowDestroyed seems to indicate otherwise.

You MUST ensure that you do not touch the window object after returning from this function

I'm aware that there may be some subtlety under the hood here; maybe the window object is still valid, but the buffer queue isn't, like you said. Or maybe the docs just haven't been updated. However, I'd like to avoid going against the docs unless there is an authoritative source that says otherwise.

The docs also say:

in the common case of drawing to the window from another thread, that means the implementation of this callback must properly synchronize with the other thread to stop its drawing before returning from here.

This is what the Active structure does. It synchronizes with the currently open window handles to make sure that none are left alive before the application suspends and we can't access the window anymore. You ask some more in-depth questions on other PRs/issues related to this issue, so I'll clarify more there.

What this means, on Android, is that you have to re-create graphics API surfaces each time the app resumes. (Such as an EGLSurface, or WGPU surface, or Vulkan VkSurface etc).

The idea is that the GPU implementation would check the window_handle to make sure that it's not dealing with an invalid window. The practice is that I forgot to do that in rust-windowing/glutin#1582, so thanks for reminding me.

MarijnS95 commented 1 year ago

I tested this a while ago in https://github.com/rust-windowing/raw-window-handle/issues/84#issuecomment-1413542994 but it means nothing when considering the onNativeWindowDestroyed documentation. In short I could keep using the ANativeWindow object after onNativeWindowDestroyed if acquiring an extra reference on it (in Rust terms, the thing you get from onNativeWindowCreated is effectively a borrow with a lifetime lasting until returning from onNativeWindowDestroyed, and the NDK crate aptly allows you to clone() it which will turn this into an owned reference with increased refcount). I might even have gotten buffers from the BufferQueue; they just wouldn't get presented anywhere because another buffer queue is now used: that is why you do not just recreate a Surface and Swapchain (in the context of Vulkan) after "resuming", you create it on the new ANativeWindow handle from onNativeWindowCreated.

(I haven't actually verified if you can/may get the same ANativeWindow handle back...)

Fwiw it isn't exactly true that android-activity is safer about ANativeWindow than ndk-glue. They just handle its lifetime in a different way. Both acquire an extra reference on NativeWindow despite already having another mechanism to clean up the handle before returning from onNativeWindowDestroyed. Then android-activity may be a bit more convenient here, but both crates wrap it in a lock/condvar of sorts, and wait until that lock is cleared before returning from onNativeWindowDestroyed. ndk-glue expects the user (winit in this case, which did so during the ndk-glue days) to have read the documentation and grab the lock on Resumed and release it on Suspended to not block indefinitely in onNativeWindowDestroyed, which you may consider as being less safe (in case it is forgotten) compared to android-activity which "holds" it behind your back and uses messages and convars to block in both onNativeWindowCreated and onNativeWindowDestroyed until the new or removed window has been adopted [^1].

[^1]: @rib on that note perhaps the current while loop should use wait_while() - I'm more used to seeing condvars used with a callback of sorts to validate the condition (otoh it seems the implementation nowadays translates to a while loop anyways).

Either way both implementations trivially allow the user to keep their Vulkan/GL surfaces/swapchains alive when created from the original handles (but users will find out soon enough that that doesn't work after a resume, if it didn't crash already).

rib commented 1 year ago

In short I could keep using the ANativeWindow object after onNativeWindowDestroyed if acquiring an extra reference on it (in Rust terms, the thing you get from onNativeWindowCreated is effectively a borrow with a lifetime lasting until returning from onNativeWindowDestroyed, and the NDK crate aptly allows you to clone() it which will turn this into an owned reference with increased refcount).

yeah, I think this is a good way of squaring why the NativeActivity docs would say "MUST". Without being more specific then they don't know that you might have taken your own reference to the ANativeWindow - and if you haven't taken your own reference then you'd suddenly risk a use-after-free segfault from accessing an invalid pointer.

rib commented 1 year ago

in the common case of drawing to the window from another thread, that means the implementation of this callback must properly synchronize with the other thread to stop its drawing before returning from here.

This is what the Active structure does. It synchronizes with the currently open window handles to make sure that none are left alive before the application suspends and we can't access the window anymore. You ask some more in-depth questions on other PRs/issues related to this issue, so I'll clarify more there.

I'm not sure this is really the result you get from the Active structure (or rather, it doesn't account for how the handle can continue to be effectively held outside of the Active struct synchronization protocol).

E.g. as noted here: https://github.com/rust-windowing/raw-window-handle/pull/110#issuecomment-1493413935 any surface you create is also going to hold a reference to the ANativeWindow and this protocol isn't going to guarantee that all surfaces are destroyed.

rib commented 1 year ago

in the common case of drawing to the window from another thread, that means the implementation of this callback must properly synchronize with the other thread to stop its drawing before returning from here.

Just to add a bit here, similar to @MarijnS95's comments; Half of what this is referring to is also something that android_native_app_glue, ndk-glue and android-activity go some way to help with - they all implement some way of synchronizing the native thread with the Java thread when it's notifying us that the surface has been take away.

With android-activity, similar to android_native_app_glue then the way to fully honor this requirement is to ensure that you drop all graphics API surfaces before the the callback that dispatches a MainEvent::TerminateWindow event returns (The Java thread will be blocked from returning from onNativeWindowDestroyed while the native thread handles this event).

I tend to think that this needs to be done at an engine/middleware level though. I'm not sure that we can come up with a low level protocol for tearing down graphics surfaces (like EGLSurface / VkSurface etc) automatically in sync with onNativeWindowDestroyed. I think we probably still have to let high layers of the stack remain responsible for this.

notgull commented 1 year ago

Hmm, so it seems like the two prescient points are:

The current Active API doesn't prevent the ANativeWindow from being used illegally by graphics APIs. Personally, I think that it can if the activity/equality is checked before any GPU calls are made, but that would be irrelevant if:
Passing an ANativeWindow handle to the graphics UI after onNativeWindowDestroyed may not actually be unsound behavior, if the window handle is explicitly retained. This may be vendor dependent, though.

If the latter point holds true, then it might be prudent to just remove the Active/ActiveHandle mechanism entirely. It'd be a breaking change, but it might be worth it.

@rib and @MarijnS95, I’ll reopen this issue since it seems that there’s more to discuss here. In order to prevent discussion from being scattered across a bunch of PRs and issues, could we keep it here?

notgull commented 1 year ago

As of #118, I think we're all good here. We've now gotten a healthy selection of platform maintainers to take a look at this system and I think we've patched all the holes.

I'll give it a week for anyone else to raise any concerns, just to be safe.

MarijnS95 commented 1 year ago

@notgull FYI I haven't played with the the new borrowed handles myself yet, will attempt to do so but no promises that I can do that within a week :)

Lokathor commented 1 year ago

Given the uh, "other events" going on in Rust(tm) lately I think we should extend the time here by a fair bit.

notgull commented 1 year ago

Given the uh, "other events" going on in Rust(tm) lately I think we should extend the time here by a fair bit.

Ach, fair point. I'll wait until the trademark issue reaches a reasonable conclusion.

notgull commented 1 year ago

Given the uh, "other events" going on in Rust(tm) lately I think we should extend the time here by a fair bit.

I think these issues have been resolved, and that we've had plenty of time for discussion here. Would you think that a release is in order?

kchibisov commented 1 year ago

I'd still like to address the https://github.com/rust-windowing/raw-window-handle/pull/104 so we would have a breaking changes packed into a single release.

notgull commented 1 year ago

Finalized in #126

rust-windowing / raw-window-handle

Refine the safe window handle API #111