Open thejh opened 4 years ago
Oh, and in case someone encountering this just wants a quick workaround: Try putting the following into ~/.drirc
:
<driconf>
<device screen="0" driver="i965">
<application name="chrome" executable="chrome">
<option name="bo_reuse" value="0" />
</application>
</device>
</driconf>
At least in a quick test, this seems to prevent the problem from triggering (probably at the cost of making Chromium a little bit slower).
Hm... normally, when buffer objects are exported via brw_bo_gem_export_to_prime()
, bo->reusable
is set to false
. But Chromium's CreateBufferForBO()
calls GetPlaneFdForBo()
, which directly calls drmPrimeHandleToFD()
(just an example call chain, not necessarily the culprit):
77929 <... ioctl resumed>) = 0
> /lib/x86_64-linux-gnu/libc-2.30.so(ioctl+0x7) [0xf43e7]
> /usr/lib/x86_64-linux-gnu/libdrm.so.2.4.0(drmIoctl+0x27) [0x5cc7]
> /usr/lib/x86_64-linux-gnu/libdrm.so.2.4.0(drmPrimeHandleToFD+0x36) [0x8e06]
> /home/jann/chromium/src/out/mybuild_stable/libozone.so(gbm_wrapper::CreateBufferForBO(gbm_bo*, unsigned int, gfx::Size const&, unsigned int)+0x24e) [0x9effe]
> /home/jann/chromium/src/out/mybuild_stable/libozone.so(gbm_wrapper::Device::CreateBufferWithModifiers(unsigned int, gfx::Size const&, unsigned int, std::__Cr::vector<unsigned long, std::__Cr::allocator<unsigned long> > const&)+0xd2) [0x9fd82]
> /home/jann/chromium/src/out/mybuild_stable/libozone.so(ui::GbmPixmapWayland::InitializeBuffer(gfx::Size, gfx::BufferFormat, gfx::BufferUsage)+0xfc) [0xca4dc]
> /home/jann/chromium/src/out/mybuild_stable/libozone.so(ui::WaylandSurfaceFactory::CreateNativePixmap(int, VkDevice_T*, gfx::Size, gfx::BufferFormat, gfx::BufferUsage, base::Optional<gfx::Size>)+0x5c) [0xac51c]
> /home/jann/chromium/src/out/mybuild_stable/libgpu_ipc_service.so(gpu::GpuMemoryBufferFactoryNativePixmap::CreateAnonymousImage(gfx::Size const&, gfx::BufferFormat, gfx::BufferUsage, int, bool*)+0xad) [0x4890d]
> /home/jann/chromium/src/out/mybuild_stable/libgpu_ipc_service.so(non-virtual thunk to gpu::GpuMemoryBufferFactoryNativePixmap::CreateAnonymousImage(gfx::Size const&, gfx::BufferFormat, gfx::BufferUsage, int, bool*)+0x19) [0x48c69]
> /home/jann/chromium/src/out/mybuild_stable/libgles2.so(gpu::SharedImageBackingFactoryGLTexture::CreateSharedImageInternal(gpu::Mailbox const&, viz::ResourceFormat, int, gfx::Size const&, gfx::ColorSpace const&, unsigned int, base::span<unsigned char const, 18446744073709551615ul>)+0x580) [0x23a390]
> /home/jann/chromium/src/out/mybuild_stable/libgles2.so(gpu::SharedImageBackingFactoryGLTexture::CreateSharedImage(gpu::Mailbox const&, viz::ResourceFormat, int, gfx::Size const&, gfx::ColorSpace const&, unsigned int, bool)+0x41) [0x239e01]
> /home/jann/chromium/src/out/mybuild_stable/libgles2.so(gpu::SharedImageFactory::CreateSharedImage(gpu::Mailbox const&, viz::ResourceFormat, gfx::Size const&, gfx::ColorSpace const&, int, unsigned int)+0x91) [0x23d031]
> /home/jann/chromium/src/out/mybuild_stable/libservice.so(viz::SkiaOutputDeviceBufferQueue::Image::Initialize(gfx::Size const&, gfx::ColorSpace const&, viz::ResourceFormat, viz::SkiaOutputSurfaceDependency*, unsigned int)+0x4d) [0x144b5d]
> /home/jann/chromium/src/out/mybuild_stable/libservice.so(viz::SkiaOutputDeviceBufferQueue::Reshape(gfx::Size const&, float, gfx::ColorSpace const&, gfx::BufferFormat, gfx::OverlayTransform)+0x1d6) [0x144a16]
> /home/jann/chromium/src/out/mybuild_stable/libservice.so(viz::SkiaOutputSurfaceImplOnGpu::Reshape(gfx::Size const&, float, gfx::ColorSpace const&, gfx::BufferFormat, bool, gfx::OverlayTransform)+0xcd) [0x150c8d]
> /home/jann/chromium/src/out/mybuild_stable/libservice.so(base::internal::Invoker<base::internal::BindState<viz::SkiaOutputSurfaceImpl::ScheduleGpuTask(base::OnceCallback<void ()>, std::__Cr::vector<gpu::SyncToken, std::__Cr::allocator<gpu::SyncToken> >)::$_2, base::OnceCallback<void ()> >, void ()>::RunOnce(base::internal::BindStateBase*)+0x49) [0x14faa9]
> /home/jann/chromium/src/out/mybuild_stable/libgpu.so(gpu::Scheduler::RunNextTask()+0x404) [0x87854]
> /home/jann/chromium/src/out/mybuild_stable/libbase.so(base::TaskAnnotator::RunTask(char const*, base::PendingTask*)+0x12a) [0x18656a]
> /home/jann/chromium/src/out/mybuild_stable/libbase.so(base::sequence_manager::internal::ThreadControllerWithMessagePumpImpl::DoWorkImpl(base::sequence_manager::LazyNow*, bool*)+0x17d) [0x1990fd]
> /home/jann/chromium/src/out/mybuild_stable/libbase.so(base::sequence_manager::internal::ThreadControllerWithMessagePumpImpl::DoSomeWork()+0x9f) [0x198e8f]
> /home/jann/chromium/src/out/mybuild_stable/libbase.so(base::MessagePumpDefault::Run(base::MessagePump::Delegate*)+0x59) [0x13e9b9]
> /home/jann/chromium/src/out/mybuild_stable/libbase.so(base::sequence_manager::internal::ThreadControllerWithMessagePumpImpl::Run(bool, base::TimeDelta)+0x74) [0x199904]
> /home/jann/chromium/src/out/mybuild_stable/libbase.so(base::RunLoop::Run()+0x18d) [0x169e1d]
> /home/jann/chromium/src/out/mybuild_stable/libcontent.so(content::GpuMain(content::MainFunctionParams const&)+0x6f7) [0x84ea87]
> /home/jann/chromium/src/out/mybuild_stable/libcontent.so(content::RunZygote(content::ContentMainDelegate*)+0xae7) [0x13c7ff7]
> /home/jann/chromium/src/out/mybuild_stable/libcontent.so(content::ContentMainRunnerImpl::Run(bool)+0x114) [0x13c9264]
> /home/jann/chromium/src/out/mybuild_stable/libembedder.so(service_manager::Main(service_manager::MainParams const&)+0xd33) [0xf753]
> /home/jann/chromium/src/out/mybuild_stable/libcontent.so(content::ContentMain(content::ContentMainParams const&)+0x80) [0x13c74d0]
> /home/jann/chromium/src/out/mybuild_stable/chrome(ChromeMain+0xcd) [0xf11cfd]
> /lib/x86_64-linux-gnu/libc-2.30.so(__libc_start_main+0xea) [0x26e0a]
> /home/jann/chromium/src/out/mybuild_stable/chrome(_start+0x29) [0xf11b39]
I guess it's time for me to file a Chromium bug...
FWIW, I've filed https://bugs.chromium.org/p/chromium/issues/detail?id=1063680. From what I can tell, a proper fix is going to be annoying to implement because it requires extending the GBM interface and adding plumbing in mesa.
Shouldn't Chromium keep the buffer alive till wl_buffer.release
is received?
Oh, I didn't realize that that callback existed... I guess I'll take another look.
I had similar crashing on Wayfire, so this is a wlroots issue, right? (Doesn't seem to affect the iris driver, though)
My chromium randomly crashes and drags down everything returning me to tty1. Happens when I am e.g. in tabbed layout and then switching TO chromium window. How can I verify I am facing exactly this issue you are describing here?
@mindrunner you could try the bo_reuse
workaround I posted further up in the thread
yeah, just did that. Sometimes it does not happen for days, tho.
I understand, this is a chrome bug (possibly). However, I wonder if there is no way so that a client crashes if it fuck up things but not pull down the whole rest of that system. Isn't that kind of the idea to have a user-space and separate things. Especially in wayland world where it is pain in the ass to share a screen for security reasons, but then some buggy code is able to make everything crash. Is there no way for sway to check if that piece of memory is valid and allowed to access?
@mindrunner AFAIU sway can't really do that on its own (in a non-racy way) because the decision to exit the whole process when the kernel signals an error happens in mesa. So to fix that part, you'd probably have to go change mesa to add some kinda retry logic for the error case to figure out what went wrong and remove problematic parts.
Makes sense. Would be a killer feature for me tho. Having a stable environment in daily business has so much value. Guess, it's not easy to implement :(
The bo_reuse
workaround seems to mitigate the issue for me. Didn't have a crash with chromium since then. However, the instability of sway due to badly programmed 3rd party software is really giving me doubts if this is suitable as a daily driver setup. I just had a full sway/wayland crash after coming out of a screenshare/huddle session in slack. Slack UI was unresponsive and I was not able to end the call. After killing the client, the whole Desktop went down and I found myself back on tty1
:(
I am aware, this is hard to debug and super pain in the ass. Bug can be anywhere. kernel/drivers/slack/chromium/electron/wlroots/sway. Uff, what a long list...
Still I wonder if
1) there is anything I can do? Am I supposed to have debug symbols attached all the time to be able to send stack-traces? Does that have a significant performance impact?
2) is there effort put into this at all? Or is the only solution to figure out, it's not a sway bug, so we have to report somewhere else?
Sorry for being bit off topic and polluting here.
I'm still in the process of debugging this, but figured I'd file an issue here for now to track things, in case anyone else is concurrently trying to figure out the same issue:
When running a build of Chromium git master with native wayland support (built with
use_ozone = true
, invoked with--ozone-platform=wayland
) on a system with integrated Intel graphics, sway occasionally dies with the error messagei965: Failed to submit batchbuffer: Bad address
, which comes from mesa'ssubmit_batch()
. The cause of this error message appears to be that the execbuffer's validation list contains entries pointing to objects inI915_MADV_DONTNEED
state.My suspicion so far is that the core problem is somewhere in Chromium, and mesa is making things worse by not being able to handle this condition gracefully; but I'm not sure yet.
To be able to reproduce this more reliably, you can hack a new API for testing the status of an object into the kernel:
and then teach mesa to use that when adding things to an execbuffer's validation list (I really hope this is actually correct and doesn't just trigger false warnings):
The assert triggers as soon as I launch Chromium.
Based on some debugging I've done with strace, the handles referenced by these validation list entries seem to be for relocations, and seem to be allocated via
DRM_IOCTL_PRIME_FD_TO_HANDLE
, from here:and in at least one case, the handle actually initially was
I915_MADV_WILLNEED
, but then becameI915_MADV_DONTNEED
at a later point without any relevant-looking syscalls from sway.The file descriptor from which the handle was created was received from Chromium over a unix domain socket; in Chromium, the file descriptor comes from this spot: