kenz-gelsoft / gecko-dev

Read-only Git mirror of the Mercurial gecko repositories at https://hg.mozilla.org. How to contribute: https://firefox-source-docs.mozilla.org/contributing/contribution_quickref.html
https://firefox-source-docs.mozilla.org/setup/index.html
Other
12 stars 1 forks source link

Firefox crashes in short period #34

Open kenz-gelsoft opened 4 weeks ago

kenz-gelsoft commented 4 weeks ago

KDLs(#30) fixed in hrev57971 after that it crashes easily instead of enterring KDL. (C.f. Just wait fot a while, activate/deactivate window repeatedly.)

I think I’m facing the real cause of abnormal state to enter KDL previously.

Probably related to https://github.com/kenz-gelsoft/gecko-dev/issues/33

waddlesplash commented 4 weeks ago

The KDL was caused by the following sequence of events:

At this point, the mmap'ed region should act like another open FD to the file, but the bug was that it wasn't and instead the RAMFS just removed the underlying storage. Then the kernel VM system tried to merge the page stores rather than keep extra ones around, but of course they were of incompatible types, leading to the KDL.

I don't think this was an abnormal state at all; it was just a bug in the RAMFS and the VM system. Since it's been fixed, whatever was using that should behave correctly now.

I do note, though, that before I made the fix, I at least saw the "Your tab crashed" message appear, while after the fix it didn't seem to. But perhaps this was just random?

kenz-gelsoft commented 4 weeks ago

Thank you detailed explanation of the situation.

I suspect this may be related with following wayland-server problem.

Yes, I also encounter that. X512 said it is wayland-server's problem, but it is not clear he encounters same problem or not.

Input is wayland-server problem. It get confused by subsurface created by Firefox. Input should be delivered to root surface, but is currently misdelivered to subsurface.

I have draft fix locally.

https://github.com/kenz-gelsoft/gecko-dev/issues/30#issuecomment-2289844012

As I know Haiku wayland uses RAMFS through shm. I believe Firefox also use shm for IPC, so it may be mistake/inconsistency on porting IPC code.

So I will try again X512's wip wayland-server tonight.

waddlesplash commented 4 weeks ago

The mappings in the crash were ones allocated by nspr as I mentioned before, so yes it sounds IPC related not Wayland related.

kenz-gelsoft commented 3 weeks ago

I didn't succeed to debug multi-process part of firefox with Qt Creator and GDB.

It may be GDB or Qt Creator's bug (or skipped part on porting), but try printf-debugging for now.


or just I don't know how. I'll read official doc later if I have no luck in printf-debug.

waddlesplash commented 3 weeks ago

Did you try with the newer/fixed GDB? https://github.com/haikuports/haikuports/pull/10799

waddlesplash commented 3 weeks ago

It may be worth noting that one can attach to multiple processes at once in Qt Creator with separate GDB sessions (or just start Qt Creator multiple times.)

kenz-gelsoft commented 3 weeks ago

I could break the crash by ./mach run --debugger=gdb (without GUI frontend). Let's dig into...

~/src/gecko-dev> ./mach run --debugger=gdb
 0:00.34 /bin/gdb -q --args /boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/firefox -no-remote -profile /boot/home/src/gecko-dev/obj-ff-dbg/tmp/profile-default
Reading symbols from /boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/firefox...
(gdb) r

(snip)

Thread 43 "WRWorker#0" received signal SIGILL, Illegal instruction.
[Switching to team 262174 (firefox) thread 262232 (WRWorker#0)]
Fontconfig warning: ignoring en_US_POSIX.UTF-8: not a valid region tag
MOZ_Crash (aFilename=<optimized out>, aLine=<optimized out>, 
    aReason=0x7f0002f54934 "called `Result::unwrap()` on an `Err` value: NulError(59, [47, 98, 111, 111, 116, 47, 115, 121, 115, 116, 101, 109, 47, 100, 97, 116, 97, 47, 102, 111, 110, 116, 115, 47, 116, 116, 102, 111, 110, 116,"...) at /boot/home/src/gecko-dev/obj-ff-dbg/dist/include/mozilla/Assertions.h:317
317       MOZ_REALLY_CRASH(aLine);
(gdb) bt
#0  MOZ_Crash (aFilename=<optimized out>, aLine=<optimized out>, 
    aReason=0x7f0002f54934 "called `Result::unwrap()` on an `Err` value: NulError(59, [47, 98, 111, 111, 116, 47, 115, 121, 115, 116, 101, 109, 47, 100, 97, 116, 97, 47, 102, 111, 110, 116, 115, 47, 116, 116, 102, 111, 110, 116,"...) at /boot/home/src/gecko-dev/obj-ff-dbg/dist/include/mozilla/Assertions.h:317
#1  RustMozCrash (aFilename=<optimized out>, aLine=<optimized out>, 
    aReason=0x7f0002f54934 "called `Result::unwrap()` on an `Err` value: NulError(59, [47, 98, 111, 111, 116, 47, 115, 121, 115, 116, 101, 109, 47, 100, 97, 116, 97, 47, 102, 111, 110, 116, 115, 47, 116, 116, 102, 111, 110, 116,"...) at wrappers.cpp:18
#2  0x000000000f769aff in mozglue_static::panic_hook (info=<optimized out>) at mozglue/static/rust/lib.rs:98
#3  0x000000000f76975c in core::ops::function::Fn::call<fn(&core::panic::panic_info::PanicInfo), (&core::panic::panic_info::PanicInfo)> ()
    at /build/rust/library/core/src/ops/function.rs:79
#4  0x000000001007b783 in std::panicking::rust_panic_with_hook () from /boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so
#5  0x000000001006d214 in std::panicking::begin_panic_handler::{{closure}} () from /boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so
#6  0x000000001006d009 in std::sys_common::backtrace::__rust_end_short_backtrace () from /boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so
#7  0x000000001007b327 in rust_begin_unwind () from /boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so
#8  0x00000000100a1573 in core::panicking::panic_fmt () from /boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so
#9  0x000000001009bca6 in core::result::unwrap_failed () from /boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so
#10 0x000000000f554894 in core::result::Result<alloc::ffi::c_str::CString, alloc::ffi::c_str::NulError>::unwrap<alloc::ffi::c_str::CString, alloc::ffi::c_str::NulError> (
    self=...) at /build/rust/library/core/src/result.rs:1077
#11 wr_glyph_rasterizer::platform::unix::font::FontCache::add_font (
    self=0x106c2470 <<wr_glyph_rasterizer::platform::unix::font::FONT_CACHE as core::ops::deref::Deref>::deref::__stability::LAZY+24>, template=...)
    at gfx/wr/wr_glyph_rasterizer/src/platform/unix/font.rs:225
#12 0x000000000f555e72 in wr_glyph_rasterizer::platform::unix::font::FontContext::add_native_font (self=0x100104986970, font_key=0x7f0002f55080, native_font_handle=...)
    at gfx/wr/wr_glyph_rasterizer/src/platform/unix/font.rs:397
#13 0x000000000f553c18 in wr_glyph_rasterizer::rasterizer::{impl#33}::add_font (self=0x100104986970, font_key=0x7f0002f55080, template=0x7f0002f55060)
    at gfx/wr/wr_glyph_rasterizer/src/rasterizer.rs:1670
#14 0x000000000f54815e in wr_glyph_rasterizer::rasterizer::{impl#32}::add_font::{closure#0} (context=...) at gfx/wr/wr_glyph_rasterizer/src/rasterizer.rs:1574
#15 wr_glyph_rasterizer::rasterizer::{impl#31}::async_for_each::{closure#0}<wr_glyph_rasterizer::rasterizer::{impl#32}::add_font::{closure_env#0}> ()
    at gfx/wr/wr_glyph_rasterizer/src/rasterizer.rs:1483
#16 core::panic::unwind_safe::{impl#23}::call_once<(), wr_glyph_rasterizer::rasterizer::{impl#31}::async_for_each::{closure_env#0}<wr_glyph_rasterizer::rasterizer::{impl#32}::add_font::{closure_env#0}>> (self=...) at /build/rust/library/core/src/panic/unwind_safe.rs:272
#17 std::panicking::try::do_call<core::panic::unwind_safe::AssertUnwindSafe<wr_glyph_rasterizer::rasterizer::{impl#31}::async_for_each::{closure_env#0}<wr_glyph_rasterizer::rasterizer::{impl#32}::add_font::{closure_env#0}>>, ()> (data=<error reading variable: Cannot access memory at address 0x0>) at /build/rust/library/std/src/panicking.rs:559
#18 std::panicking::try<(), core::panic::unwind_safe::AssertUnwindSafe<wr_glyph_rasterizer::rasterizer::{impl#31}::async_for_each::{closure_env#0}<wr_glyph_rasterizer::rasterizer::{impl#32}::add_font::{closure_env#0}>>> (f=...) at /build/rust/library/std/src/panicking.rs:523
#19 std::panic::catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<wr_glyph_rasterizer::rasterizer::{impl#31}::async_for_each::{closure_env#0}<wr_glyph_rasterizer::rasterizer::{impl#32}::add_font::{closure_env#0}>>, ()> (f=<error reading variable: Cannot access memory at address 0x0>) at /build/rust/library/std/src/panic.rs:149
#20 rayon_core::unwind::halt_unwinding<wr_glyph_rasterizer::rasterizer::{impl#31}::async_for_each::{closure_env#0}<wr_glyph_rasterizer::rasterizer::{impl#32}::add_font::{closure_env#0}>, ()> (func=<error reading variable: Cannot access memory at address 0x0>) at third_party/rust/rayon-core/src/unwind.rs:17
#21 rayon_core::registry::Registry::catch_unwind<wr_glyph_rasterizer::rasterizer::{impl#31}::async_for_each::{closure_env#0}<wr_glyph_rasterizer::rasterizer::{impl#32}::add_font:--Type <RET> for more, q to quit, c to continue without paging--BMessage('CLCH') {
}
[Child 262310, StreamTrans #1] WARNING: Failed to retrieve memory telemetry for ResidentPeak: file /boot/home/src/gecko-dev/xpcom/base/MemoryTelemetry.cpp:344

:{closure_env#0}>> (self=0x1001015cd880, f=...) at third_party/rust/rayon-core/src/registry.rs:366
#22 rayon_core::spawn::spawn_job::{closure#0}<wr_glyph_rasterizer::rasterizer::{impl#31}::async_for_each::{closure_env#0}<wr_glyph_rasterizer::rasterizer::{impl#32}::add_font::{closure_env#0}>> () at third_party/rust/rayon-core/src/spawn/mod.rs:97
#23 rayon_core::job::{impl#6}::execute<rayon_core::spawn::spawn_job::{closure_env#0}<wr_glyph_rasterizer::rasterizer::{impl#31}::async_for_each::{closure_env#0}<wr_glyph_rasterizer::rasterizer::{impl#32}::add_font::{closure_env#0}>>> (this=0x100103e092b0) at third_party/rust/rayon-core/src/job.rs:169
#24 0x00000000100057d5 in rayon_core::job::JobRef::execute () at src/job.rs:64
#25 rayon_core::registry::WorkerThread::execute (self=<optimized out>, self=<optimized out>) at src/registry.rs:859
#26 rayon_core::registry::WorkerThread::wait_until_cold (self=0x7f0002f55280, latch=0x100101510d50) at src/registry.rs:793
#27 0x00000000100030ed in rayon_core::registry::WorkerThread::wait_until<rayon_core::latch::OnceLatch> (self=0x7f0002f55280, latch=<optimized out>) at src/registry.rs:768
#28 rayon_core::registry::WorkerThread::wait_until_out_of_work (self=0x7f0002f55280) at src/registry.rs:817
#29 rayon_core::registry::main_loop (thread=...) at src/registry.rs:922
#30 rayon_core::registry::ThreadBuilder::run (self=...) at src/registry.rs:52
#31 0x0000000010000f9a in rayon_core::registry::{impl#2}::spawn::{closure#0} () at src/registry.rs:97
#32 std::sys_common::backtrace::__rust_begin_short_backtrace<rayon_core::registry::{impl#2}::spawn::{closure_env#0}, ()> (f=...)
    at /build/rust/library/std/src/sys_common/backtrace.rs:155
#33 0x0000000010001741 in std::thread::{impl#0}::spawn_unchecked_::{closure#2}::{closure#0}<rayon_core::registry::{impl#2}::spawn::{closure_env#0}, ()> ()
    at /build/rust/library/std/src/thread/mod.rs:542
#34 core::panic::unwind_safe::{impl#23}::call_once<(), std::thread::{impl#0}::spawn_unchecked_::{closure#2}::{closure_env#0}<rayon_core::registry::{impl#2}::spawn::{closure_env#0}, ()>> (self=<error reading variable: Cannot access memory at address 0x10>) at /build/rust/library/core/src/panic/unwind_safe.rs:272
#35 std::panicking::try::do_call<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#2}::{closure_env#0}<rayon_core::registry::{impl#2}::spawn::{closure_env#0}, ()>>, ()> (data=<error reading variable: Cannot access memory at address 0x0>) at /build/rust/library/std/src/panicking.rs:559
#36 std::panicking::try<(), core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#2}::{closure_env#0}<rayon_core::registry::{impl#2}::spawn::{closure_env#0}, ()>>> (f=<error reading variable: Cannot access memory at address 0x10>) at /build/rust/library/std/src/panicking.rs:523
#37 std::panic::catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#2}::{closure_env#0}<rayon_core::registry::{impl#2}::spawn::{closure_env#0}, ()>>, ()> (f=<error reading variable: Cannot access memory at address 0x10>) at /build/rust/library/std/src/panic.rs:149
#38 std::thread::{impl#0}::spawn_unchecked_::{closure#2}<rayon_core::registry::{impl#2}::spawn::{closure_env#0}, ()> () at /build/rust/library/std/src/thread/mod.rs:541
#39 core::ops::function::FnOnce::call_once<std::thread::{impl#0}::spawn_unchecked_::{closure_env#2}<rayon_core::registry::{impl#2}::spawn::{closure_env#0}, ()>, ()> ()
    at /build/rust/library/core/src/ops/function.rs:250
#40 0x000000001007dceb in std::sys::pal::unix::thread::Thread::new::thread_start () from /boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so
#41 0x0000000000916528 in pthread_thread_entry(void*, void*) () from /boot/system/lib/libroot.so
#42 0x00007fffb04f0258 in ?? () from commpage
#43 0x0000000000000000 in ?? ()
(gdb) 
kenz-gelsoft commented 3 weeks ago

Did you try with the newer/fixed GDB? haikuports/haikuports#10799

No, not yet, but I didn't need that yet. As current crashing occured in parent process.

Thank you for the info. I'll try it later.

kenz-gelsoft commented 3 weeks ago

I can add breakpoint to function will be crash like this, I have to debug with GDBCUI, it's time to studying it...

~/src/gecko-dev> ./mach run --debugger=gdb
 0:00.36 /bin/gdb -q --args /boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/firefox -no-remote -profile /boot/home/src/gecko-dev/obj-ff-dbg/tmp/profile-default
Reading symbols from /boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/firefox...
(gdb) b wr_glyph_rasterizer::platform::unix::font::FontCache::add_font
Function "wr_glyph_rasterizer::platform::unix::font::FontCache::add_font" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (wr_glyph_rasterizer::platform::unix::font::FontCache::add_font) pending.
(gdb) r

(snip)

Thread 43 "WRWorker#0" hit Breakpoint 1, wr_glyph_rasterizer::platform::unix::font::FontCache::add_font (
    self=0x106c2470 <<wr_glyph_rasterizer::platform::unix::font::FONT_CACHE as core::ops::deref::Deref>::deref::__stability::LAZY+24>, template=...)
    at gfx/wr/wr_glyph_rasterizer/src/platform/unix/font.rs:208
208             if let Some(cached) = self.fonts.get(&template) {
(gdb) l
203                 lcd_filter_uses: 0,
204             }
205         }
206     
207         fn add_font(&mut self, template: FontTemplate) -> Result<Arc<Mutex<CachedFont>>, FT_Error> {
208             if let Some(cached) = self.fonts.get(&template) {
209                 return Ok(cached.clone());
210             }
211             unsafe {
212                 let mut face: FT_Face = ptr::null_mut();
(gdb) 
kenz-gelsoft commented 3 weeks ago

It seems Qt Creator should attach running process, but Qt Creator or GDB? doesn't implement process listing yet (or not working correctly), or there are a way to run gdbserver with firefox, and attach to that gdbserver with Qt Creator, but I didn't find that way.

process list attach to gdbserver
screenshot17 screenshot18
waddlesplash commented 3 weeks ago

Likely Qt Creator needs to be adjusted to support listing running processes, yes.

kenz-gelsoft commented 3 weeks ago

In rust crash, passed font path looks fine, why (Rust) CString conversion fails?

/boot/system/data/fonts/ttfonts/NotoSansDisplay-Regular.ttf
[9222] Hit MOZ_CRASH(called `Result::unwrap()` on an `Err` value: NulError(59, [47, 98, 111, 111, 116, 47, 115, 121, 115, 116, 101, 109, 47, 100, 97, 116, 97, 47, 102, 111, 110, 116, 115, 47, 116, 116, 102, 111, 110, 116, 115, 47, 78, 111, 116, 111, 83, 97, 110, 115, 68, 105, 115, 112, 108, 97, 121, 45, 82, 101, 103, 117, 108, 97, 114, 46, 116, 116, 102, 0])) at gfx/wr/wr_glyph_rasterizer/src/platform/unix/font.rs:226
    fn add_font(&mut self, template: FontTemplate) -> Result<Arc<Mutex<CachedFont>>, FT_Error> {
        if let Some(cached) = self.fonts.get(&template) {
            return Ok(cached.clone());
        }
        unsafe {
            let mut face: FT_Face = ptr::null_mut();
            let result = match template {
                FontTemplate::Raw(ref bytes, index) => {
                    FT_New_Memory_Face(
                        self.lib,
                        bytes.as_ptr(),
                        bytes.len() as FT_Long,
                        index as FT_Long,
                        &mut face,
                    )
                }
                FontTemplate::Native(NativeFontHandle { ref path, index }) => {
                    let str = path.as_os_str().to_str().unwrap();
                    println!("{}", &str);
                    let cstr = CString::new(str).unwrap();
waddlesplash commented 3 weeks ago

https://doc.rust-lang.org/std/ffi/c_str/struct.NulError.html

An error indicating that an interior nul byte was found.

There isn't a NUL byte in the middle of the string but there is one at the end. Perhaps that's the problem, and the string needs to be treated as one byte shorter?

kenz-gelsoft commented 3 weeks ago

Yes, OsPath should'nt contain last NULL byte but it contains. Root cause is at platform specific API differs to already supported platorms.

But, for now I work around those crashes.

kenz-gelsoft commented 3 weeks ago

I can operate with parent process window without immediate crash with following patch. (Tabs crash immediately for some reason.)

This is not a true fix, but a workaround till figure out the root cause of this crash.

diff --git a/gfx/wr/wr_glyph_rasterizer/src/platform/unix/font.rs b/gfx/wr/wr_glyph_rasterizer/src/platform/unix/font.rs
index 5885d8f9a270..9e120698d5e0 100644
--- a/gfx/wr/wr_glyph_rasterizer/src/platform/unix/font.rs
+++ b/gfx/wr/wr_glyph_rasterizer/src/platform/unix/font.rs
@@ -221,8 +221,17 @@ impl FontCache {
                     )
                 }
                 FontTemplate::Native(NativeFontHandle { ref path, index }) => {
-                    let str = path.as_os_str().to_str().unwrap();
-                    let cstr = CString::new(str).unwrap();
+                    let str = path.as_os_str().to_str();
+                    if str == None {
+                        return Err(0);
+                    }
+                    let str = str.unwrap();
+                    println!("add_font={}", &str);
+                    let mut v = Vec::from(str);
+                    while v.ends_with(&[0]) {
+                        v.pop();
+                    }
+                    let cstr = CString::from_vec_unchecked(v);
                     FT_New_Face(
                         self.lib,
                         cstr.as_ptr(),
@@ -394,9 +403,11 @@ impl FontContext {
     pub fn add_native_font(&mut self, font_key: &FontKey, native_font_handle: NativeFontHandle) {
         if !self.fonts.contains_key(font_key) {
             let path = native_font_handle.path.to_string_lossy().into_owned();
+            println!("add_native_font={}", &path);
             match FONT_CACHE.lock().unwrap().add_font(FontTemplate::Native(native_font_handle)) {
                 Ok(font) => self.fonts.insert(*font_key, font),
-                Err(result) => panic!("adding native font failed: file={} err={:?}", path, result),
+                Err(result) => None,
+//                     panic!("adding native font failed: file={} err={:?}", path, result),
             };
         }
     }
diff --git a/netwerk/base/nsStandardURL.cpp b/netwerk/base/nsStandardURL.cpp
index fac8e4ca7fb8..5e6f4a0f805d 100644
--- a/netwerk/base/nsStandardURL.cpp
+++ b/netwerk/base/nsStandardURL.cpp
@@ -319,7 +319,7 @@ struct DumpLeakedURLs {
 };

 DumpLeakedURLs::~DumpLeakedURLs() {
-  MOZ_ASSERT(NS_IsMainThread());
+//  MOZ_ASSERT(NS_IsMainThread());
   StaticMutexAutoLock lock(gAllURLsMutex);
   if (!gAllURLs.isEmpty()) {
     printf("Leaked URLs:\n");

It somewhat works for example settings or other browser UIs which doesn't require child content process(tabs).

screenshot21

Changing theme without accessing addons.mozilla.org seems to work

screenshot22

kenz-gelsoft commented 3 weeks ago

We need to use wayland-server of the X512's WIP branch.

I'm using following change for this in my haikuports tree

diff --git a/dev-libs/wayland-server/wayland_server-0.1.20230326.recipe b/dev-libs/wayland-server/wayland_server-0.1.20230326.recipe
index 092f40caf..86718484c 100644
--- a/dev-libs/wayland-server/wayland_server-0.1.20230326.recipe
+++ b/dev-libs/wayland-server/wayland_server-0.1.20230326.recipe
@@ -7,11 +7,10 @@ HOMEPAGE="https://github.com/X547/wayland-server"
 COPYRIGHT="2022-2023 X512"
 LICENSE="GNU LGPL v2.1
        MIT"
-REVISION="1"
-srcGitRev="1e3eb35b40bc54438594bd959b553ecd619333fc"
+REVISION="2"
+srcGitRev="ceea7c3655a4059c02bc2c1461a2278481a363fc"
 SOURCE_URI="https://github.com/X547/wayland-server/archive/$srcGitRev.tar.gz"
-CHECKSUM_SHA256="bdd6a16864ebfecc97e8896f67bf91daa0bcd5e3d8854f7146e15136bf496947"
-PATCHES="wayland_server-$portVersion.patchset"
+CHECKSUM_SHA256="f0a723ed38c0eb3d3ee838d89d063ab9ea60c6efdc65256b5de4da20c5c91e17"
 SOURCE_DIR="wayland-server-$srcGitRev"

 ARCHITECTURES="all !x86_gcc2"

and rm dev-libs/wayland-server/patches/wayland_server-0.1.20230326.patchset, as it cannot be applied cleanly.

kenz-gelsoft commented 3 weeks ago

I'll investigate next where NULL-terminated Rust string (this is invalid, as in Rust, language native Strings must not have NULL characters in it) comes from.

I guess this comes from freetype or GTK+'s font managing code.



waddlesplash commented 3 weeks ago

It looks like an off-by-one. Should be an easy fix once we determine where the path is coming from, yes.

waddlesplash commented 3 weeks ago

There are a few minor patches for C89 support: https://github.com/haikuports/haikuports/blob/master/media-libs/fontconfig/patches/fontconfig-2.13.96.patchset

I also note our Fontconfig is 2 years old...

kenz-gelsoft commented 3 weeks ago

I haven't found the root cause of misbehavior yet. fontconfig and its codepath in gecko is not special for Haiku as far as I investigated. So, something in IPC may be broken.

Tabs crash immediately for some reason.

[Parent 10398, Socket Thread] WARNING: 'NS_FAILED(rv)', file /boot/home/src/gecko-dev/netwerk/base/nsUDPSocket.cpp:1464
add_native_font=/boot/system/data/fonts/ttfonts/NotoSansDisplay-Regular.t
add_font=/boot/system/data/fonts/ttfonts/NotoSansDisplay-Regular.t
add_native_font=/boot/system/data/fonts/ttfonts/NotoSansDisplay-Regular.t
add_font=/boot/system/data/fonts/ttfonts/NotoSansDisplay-Regular.t
add_native_font=/boot/system/data/fonts/ttfonts/NotoSansDisplay-Regular.t
add_font=/boot/system/data/fonts/ttfonts/NotoSansDisplay-Regular.t
add_native_font=/boot/system/data/fonts/ttfonts/NotoSansDisplay-Regular.t
add_font=/boot/system/data/fonts/ttfonts/NotoSansDisplay-Regular.t

My patch or preceding UDP socket error (I'm not confident this has something with this or not), sending broken font file name (font file descriptor) here.

[WARN  neqo_transport::ecn] ECN validation failed, no ECN counts in ACK frame
add_native_font=/boot/system/data/fonts/otfonts/NotoSansCJKjp-Bold.otf
add_font=/boot/system/data/fonts/otfonts/NotoSansCJKjp-Bold.otf
add_native_font=/boot/system/data/fonts/otfonts/NotoSansCJKjp-Bold.otf
add_native_font=/boot/system/data/fonts/otfonts/NotoSansCJKjp-Bold.otf
add_native_font=/boot/system/data/fonts/otfonts/NotoSansCJKjp-Bold.otf
add_native_font=/boot/system/data/fonts/ttfonts/NotoSansDisplay-Bold.ttf
add_font=/boot/system/data/fonts/ttfonts/NotoSansDisplay-Bold.ttf
add_native_font=/boot/system/data/fonts/ttfonts/NotoSansDisplay-Bold.ttf
add_native_font=/boot/system/data/fonts/ttfonts/NotoSansDisplay-Bold.ttf
add_native_font=/boot/system/data/fonts/ttfonts/NotoSansDisplay-Bold.ttf
console.error: (new ReferenceError("WebAssembly is not defined", "resource://gre/actors/TranslationsParent.sys.mjs", 2737))
[Parent 10398, IPC I/O Parent] WARNING: Message needs unreceived descriptors channel:116534405b60 message-type:3538947 header()->num_handles:1 num_fds:1 fds_i:1: file /boot/home/src/gecko-dev/ipc/chromium/src/chrome/common/ipc_channel_posix.cc:467
[Parent 10398, IPC I/O Parent] WARNING: [1.1]: Dropping message '<null>'; no connection to unknown peer 9ADAF0623116EE71.73CA4320F3B7D720: file /boot/home/src/gecko-dev/ipc/glue/NodeController.cpp:365
[Child 10577, IPC I/O Child] WARNING: [9ADAF0623116EE71.73CA4320F3B7D720]: Dropping message '<null>'; no connection to unknown peer 1.1: file /boot/home/src/gecko-dev/ipc/glue/NodeController.cpp:365
Exiting due to channel error.

Then child process seems to exit(crash) here.

After that, parent process failed to send messages to child, this results tab crash screen.

[Parent 10398, Main Thread] WARNING: IPC Connection Error: [Parent][PContentParent] Send(msgname=PVsync::Msg_Notify) Channel error: cannot send/recv: file /boot/home/src/gecko-dev/ipc/glue/MessageChannel.cpp:1943
[Parent 10398, Main Thread] WARNING: IPC Connection Error: [Parent][PContentParent] RunMessage(msgname=PBrowser::Msg_DidUnsuppressPainting) Channel error: cannot send/recv: file /boot/home/src/gecko-dev/ipc/glue/MessageChannel.cpp:1943
kenz-gelsoft commented 3 weeks ago

In another session

[Parent 11130, Socket Thread] WARNING: 'NS_FAILED(rv)', file /boot/home/src/gecko-dev/netwerk/base/nsUDPSocket.cpp:1464
[WARN  neqo_transport::ecn] ECN validation failed, no ECN counts in ACK frame
add_native_font=/boot/system/data/fonts/ttfonts/NotoSansDisplay-Regular.
add_font=/boot/system/data/fonts/ttfonts/NotoSansDisplay-Regular.
add_native_font=/boot/system/data/fonts/ttfonts/NotoSansDisplay-Regular.
add_font=/boot/system/data/fonts/ttfonts/NotoSansDisplay-Regular.
add_native_font=/boot/system/data/fonts/ttfonts/NotoSansDisplay-Regular.
add_font=/boot/system/data/fonts/ttfonts/NotoSansDisplay-Regular.
add_native_font=/boot/system/data/fonts/ttfonts/NotoSansDisplay-Regular.
add_font=/boot/system/data/fonts/ttfonts/NotoSansDisplay-Regular.
add_native_font=ttf
add_font=ttf
add_native_font=ttf
add_font=ttf
add_native_font=ttf
add_font=ttf
add_native_font=ttf
add_font=ttf
add_native_font=/boot/system/data/fonts/otfonts/NotoSansCJKjp-Regular.otf
add_font=/boot/system/data/fonts/otfonts/NotoSansCJKjp-Regular.otf
add_native_font=/boot/system/data/fonts/otfonts/NotoSansCJKjp-Regular.otf
add_native_font=/boot/system/data/fonts/otfonts/NotoSansCJKjp-Regular.otf
add_native_font=/boot/system/data/fonts/otfonts/NotoSansCJKjp-Regular.otf

UDP error and broken font descriptor sent, but tab didn't crash yet atm.

console.error: (new ReferenceError("WebAssembly is not defined", "resource://gre/actors/TranslationsParent.sys.mjs", 2737))
[Parent 11130, IPC I/O Parent] WARNING: Message needs unreceived descriptors channel:1035c36c9440 message-type:3538947 header()->num_handles:1 num_fds:1 fds_i:1: file /boot/home/src/gecko-dev/ipc/chromium/src/chrome/common/ipc_channel_posix.cc:467
[Parent 11130, IPC I/O Parent] WARNING: [1.1]: Dropping message '<null>'; no connection to unknown peer 9204DC16FF8193D8.62E6D3ACD5469EF: file /boot/home/src/gecko-dev/ipc/glue/NodeController.cpp:365
[Child 11392, IPC I/O Child] WARNING: [9204DC16FF8193D8.62E6D3ACD5469EF]: Dropping message '<null>'; no connection to unknown peer 1.1: file /boot/home/src/gecko-dev/ipc/glue/NodeController.cpp:365
Exiting due to channel error.
[Parent 11130, Main Thread] WARNING: IPC Connection Error: [Parent][PContentParent] Send(msgname=PVsync::Msg_Notify) Channel error: cannot send/recv: file /boot/home/src/gecko-dev/ipc/glue/MessageChannel.cpp:1943
[Parent 11130, Main Thread] WARNING: IPC Connection Error: [Parent][PContentParent] RunMessage(msgname=PNecko::Msg_PredLearn) Channel error: cannot send/recv: file /boot/home/src/gecko-dev/ipc/glue/MessageChannel.cpp:1943
[Parent 11130, Main Thread] WARNING: IPC Connection Error: [Parent]

If UDP error is the root cause, in these two session, I enter www.haiku-os.org in address-bar, I don't believe haiku website uses QUIC(UDP-based high efficiency transfer protocol), it accesses google to show incremental candidate as I type.

If so, disabling QUIC may change its behavior.

File descriptor related (fd_num or so) may be suspicious.

waddlesplash commented 3 weeks ago

[Parent 10398, IPC I/O Parent] WARNING: Message needs unreceived descriptors channel:116534405b60 message-type:3538947 header()->num_handles:1 num_fds:1 fds_i:1: file /boot/home/src/gecko-dev/ipc/chromium/src/chrome/common/ipc_channel_posix.cc:467

This doesn't look like a UDP problem; it looks like it wanted to send a FD across to the other process but it didn't work for some reason. Is there an #ifdef we are missing that code?

kenz-gelsoft commented 3 weeks ago

Another session I don't type any URL before tab crash.

[Parent 527, Socket Thread] WARNING: 'NS_FAILED(rv)', file /boot/home/src/gecko-dev/netwerk/base/nsUDPSocket.cpp:1464
[WARN  neqo_common::log] Logging initialization error SetLoggerError(())
console.error: (new ReferenceError("WebAssembly is not defined", "resource://gre/actors/TranslationsParent.sys.mjs", 2737))
[Parent 527, Socket Thread] WARNING: 'NS_FAILED(rv)', file /boot/home/src/gecko-dev/netwerk/base/nsUDPSocket.cpp:1464
[Parent 527, Main Thread] WARNING: Failed to retarget HTML data delivery to the parser thread.: file /boot/home/src/gecko-dev/parser/html/nsHtml5StreamParser.cpp:1215
[WARN  neqo_transport::ecn] ECN validation failed, no ECN counts in ACK frame
add_native_font=/boot/system/data/fonts/ttfonts/NotoSansMono-ExtraBold.ttf
add_font=/boot/system/data/fonts/ttfonts/NotoSansMono-ExtraBold.ttf
add_native_font=/boot/system/data/fonts/ttfonts/NotoSansMono-ExtraBold.ttf
add_native_font=/boot/system/data/fonts/ttfonts/NotoSansMono-ExtraBold.ttf
add_native_font=/boot/system/data/fonts/ttfonts/NotoSansMono-ExtraBold.ttf
WaylandServer::fDisplay: 0x115705e36520
Error parsing B_ARGV_RECEIVED message. Message:
BMessage('_ARG') {
        argc = int32(0x12 or 18)
        argv[0] = string("/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/firefox", 53 bytes)
        argv[1] = string("-contentproc", 13 bytes)
        argv[2] = string("{b2c7e39d-69e6-4ddb-8de9-d3bd9bfc5b0d}", 39 bytes)
        argv[3] = string("527", 4 bytes)
        argv[4] = string("tab", 4 bytes)
        cwd = string("/boot/home/src/gecko-dev", 25 bytes)
}
wl_ips_client_connected
display: 0x115705e36520
client: 0x115705df5360
[Child 656, Main Thread] WARNING: Failed to create file monitor for /boot/home/config/settings/glib-2.0/settings/keyfile: Unable to find default local file monitor type: 'glib warning', file /boot/home/src/gecko-dev/toolkit/xre/nsSigHandlers.cpp:187

(/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/firefox:656): GLib-GIO-WARNING **: 06:12:05.551: Failed to create file monitor for /boot/home/config/settings/glib-2.0/settings/keyfile: Unable to find default local file monitor type
console.error: (new ReferenceError("WebAssembly is not defined", "resource://gre/actors/TranslationsParent.sys.mjs", 2737))
Fontconfig warning: ignoring en_US_POSIX.UTF-8: not a valid region tag
[WARN  neqo_transport::ecn] ECN validation failed, no ECN counts in ACK frame
[Child 656, IPC I/O Child] WARNING: Message needs unreceived descriptors channel:115705e4fb60 message-type:8126467 header()->num_handles:1 num_fds:1 fds_i:1: file /boot/home/src/gecko-dev/ipc/chromium/src/chrome/common/ipc_channel_posix.cc:467
[Child 656, IPC I/O Child] WARNING: [97AD972BABF5DA23.4341559340BDB7AC]: Dropping message '<null>'; no connection to unknown peer 1.1: file /boot/home/src/gecko-dev/ipc/glue/NodeController.cpp:365
[Parent 527, IPC I/O Parent] WARNING: [1.1]: Dropping message '<null>'; no connection to unknown peer 97AD972BABF5DA23.4341559340BDB7AC: file /boot/home/src/gecko-dev/ipc/glue/NodeController.cpp:365
Exiting due to channel error.
[Parent 527, Main Thread] WARNING: IPC Connection Error: [Parent][PContentParent] Send(msgname=PVsync::Msg_Notify) Channel error: cannot send/recv: file /boot/home/src/gecko-dev/ipc/glue/MessageChannel.cpp:1943
[Parent 527, Main Thread] WARNING: IPC Connection Error: [Parent][PContentParent] Send(msgname=PVsync::Msg_Notify) Channel error: cannot send/recv: file /boot/home/src/gecko-dev/ipc/glue/MessageChannel.cpp:1943
kenz-gelsoft commented 3 weeks ago

This doesn't look like a UDP problem; it looks like it wanted to send a FD across to the other process but it didn't work for some reason. Is there an #ifdef we are missing that code?

Yes, disabling HTTP3 in about:config didn't help.

I have an #ifdef not proven for haiku probably.

kenz-gelsoft commented 3 weeks ago

https://github.com/kenz-gelsoft/gecko-dev/commit/6900b9b2872ea2fcf0ca14eb81fdf50dfde02b51#diff-e43b173be5a075de9c388531bbd5c0975ce5a4344f7e35fbc87fed5402de8e10

This change may be incorrect. I shouldn't choose a codepath just compiles.

waddlesplash commented 3 weeks ago

We don't have an FD dir at all, but that shouldn't be necessary? Anyway it looks like the code checks RLIMIT_NOFILE so this should not matter.

kenz-gelsoft commented 3 weeks ago

I didn't read the code, but I confimed it is used by replacing that CloseSuperfluousFDs() to MOZ_CRASH().

[4151] Hit MOZ_CRASH(CloseSuperfluousFds() is not implemented) at /boot/home/src/gecko-dev/ipc/chromium/src/base/process_util_posix.cc:124
#01: CERT_GetFirstEmailAddress[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0x464ab1c]
#02: CERT_GetFirstEmailAddress[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0x463f856]
#03: CERT_GetFirstEmailAddress[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0x46931f9]
#04: CERT_GetFirstEmailAddress[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0x4691659]
#05: CERT_GetFirstEmailAddress[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0x46a3225]
#06: CERT_GetFirstEmailAddress[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0x3d9d19d]
#07: CERT_GetFirstEmailAddress[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0x3dbfa0d]
#08: CERT_GetFirstEmailAddress[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0x3db6c4e]
#09: CERT_GetFirstEmailAddress[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0x3dbc921]
#10: CERT_GetFirstEmailAddress[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0x46b8928]
#11: CERT_GetFirstEmailAddress[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0x46462df]
#12: CERT_GetFirstEmailAddress[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0x3db2cba]
#13: PR_Select[/boot/system/lib/libnspr4.so +0x2a50b]
#14: pthread_exit[/boot/system/lib/libroot.so +0x4e528]
#15: ??? (???:???)
diff --git a/ipc/chromium/src/base/process_util_posix.cc b/ipc/chromium/src/base/process_util_posix.cc
index a33b94c74d55..bd2b3540c41d 100644
--- a/ipc/chromium/src/base/process_util_posix.cc
+++ b/ipc/chromium/src/base/process_util_posix.cc
@@ -120,17 +120,20 @@ void CloseSuperfluousFds(void* aCtx, bool (*aShouldPreserve)(void*, int)) {
   // DANGER: no calls to malloc (or locks, etc.) are allowed from now on:
   // https://crbug.com/36678
   // Also, beware of STL iterators: https://crbug.com/331459
+#ifdef XP_HAIKU
+  MOZ_CRASH("CloseSuperfluousFds() is not implemented");
+#else // !XP_HAIKU
 #if defined(ANDROID)
   static const rlim_t kSystemDefaultMaxFds = 1024;
   static const char kFDDir[] = "/proc/self/fd";
@@ -195,6 +198,7 @@ void CloseSuperfluousFds(void* aCtx, bool (*aShouldPreserve)(void*, int)) {
       }
     }
   }
+#endif // !XP_HAIKU
 }

 #ifdef MOZ_ENABLE_FORKSERVER
kenz-gelsoft commented 3 weeks ago

[Child 656, IPC I/O Child] WARNING: Message needs unreceived descriptors channel:115705e4fb60 message-type:8126467 header()->num_handles:1 num_fds:1 fds_i:1: file /boot/home/src/gecko-dev/ipc/chromium/src/chrome/common/ipc_channel_posix.cc:467

error message comes from here, it looks related to CloseSuperfluousFds() too

bool Channel::ChannelImpl::ProcessIncomingMessages() {
// (snip)
  for (;;) {
// (snip)
    // a pointer to an array of |num_wire_fds| file descriptors from the read
    const int* wire_fds = NULL;
    unsigned num_wire_fds = 0;

    // walk the list of control messages and, if we find an array of file
    // descriptors, save a pointer to the array
// (snip)
    // A pointer to an array of |num_fds| file descriptors which includes any
    // fds that have spilled over from a previous read.
    const int* fds;
    unsigned num_fds;
    unsigned fds_i = 0;  // the index of the first unused descriptor

    if (input_overflow_fds_.empty()) {
      fds = wire_fds;
      num_fds = num_wire_fds;
    } else {
      // This code may look like a no-op in the case where
      // num_wire_fds == 0, but in fact:
      //
      // 1. wire_fds will be nullptr, so passing it to memcpy is
      // undefined behavior according to the C standard, even though
      // the memcpy length is 0.
      //
      // 2. prev_size will be an out-of-bounds index for
      // input_overflow_fds_; this is undefined behavior according to
      // the C++ standard, even though the element only has its
      // pointer taken and isn't accessed (and the corresponding
      // operation on a C array would be defined).
      //
      // UBSan makes #1 a fatal error, and assertions in libstdc++ do
      // the same for #2 if enabled.
      if (num_wire_fds > 0) {
        const size_t prev_size = input_overflow_fds_.size();
        input_overflow_fds_.resize(prev_size + num_wire_fds);
        memcpy(&input_overflow_fds_[prev_size], wire_fds,
               num_wire_fds * sizeof(int));
      }
      fds = &input_overflow_fds_[0];
      num_fds = input_overflow_fds_.size();
    }
// (snip)
      Message& m = *incoming_message_;

      if (m.header()->num_handles) {
        // the message has file descriptors
        const char* error = NULL;
        if (m.header()->num_handles > num_fds - fds_i) {
          // the message has been completely received, but we didn't get
          // enough file descriptors.
          error = "Message needs unreceived descriptors";
        }
waddlesplash commented 3 weeks ago

What is num_handles, and what have we received instead?

I see the same message on my end and poked around a bit. It looks like kControlBufferMaxFds is too high for Haiku (internally we only support about 32, and actually 16 on x86_64 due to an oversight that I will fix), but I added some panic/dprintf around this and it appears that Firefox only tries to send 3 FDs at most, so we are nowhere near the limit and that should not be the problem.

waddlesplash commented 3 weeks ago

Ah, I see the log message includes this. All the numbers are 1, so we get 1>1-1 which is true.

waddlesplash commented 3 weeks ago

Anyway, I fixed the few problems I saw in sending FDs across sockets in hrev57990, but that doesn't fix this problem.

kenz-gelsoft commented 3 weeks ago

I now interpret that error message the IPC message requires one more new FD to communicate for simultaneously parent and child each other (c.f. controlling channel and data channel).

I have no idea why it fails yet.

waddlesplash commented 3 weeks ago

The error message appears to mean that it expected at least one FD from the socket but it didn't actually receive it, or thinks it didn't receive it, yes. Investigating what we got back from cmsghdr should reveal whether we got the FD at all, or whether the kernel somehow didn't send it properly.

kenz-gelsoft commented 3 weeks ago

Thanks digging this!

I will check this tonight by grepping XP_ macros for other OSes around.

Is there an #ifdef we are missing that code?

kenz-gelsoft commented 3 weeks ago

[Child 656, IPC I/O Child] WARNING: [97AD972BABF5DA23.4341559340BDB7AC]: Dropping message ''; no connection to unknown peer 1.1: file /boot/home/src/gecko-dev/ipc/glue/NodeController.cpp:365

For this message, we have no (message) broker and message

https://searchfox.org/mozilla-esr128/source/ipc/glue/NodeController.cpp#365

waddlesplash commented 3 weeks ago

All those messages start after the first "Message needs unreceived descriptors", yes? They may just be subsequent errors caused by that first one.

kenz-gelsoft commented 3 weeks ago

Yes I start reading the code, indeed it seems a subsequent error. Thanks.

waddlesplash commented 3 weeks ago

It looks like the message type is always 8126467 (in your messages and in my local testing.) What message-type is that? Perhaps that's where the missing ifdef is?

waddlesplash commented 3 weeks ago

Ah, I've now gotten one with 3670076, so it doesn't always happen that way.

waddlesplash commented 3 weeks ago

I added some hacks to the UNIX domain socket implementation to return all FDs in the buffer up front, even if not reading their associated data yet (since the IPC code in Gecko seems to tolerate receiving them early), but the error still happens.

I haven't added any tracing to see what FDs are being written to the socket, only those being read on the output side. I guess you may have an easier time adding send-side tracing than I will.

waddlesplash commented 3 weeks ago

Added even more tracing in the kernel. The child in question doesn't seem to be sent any FDs; it doesn't receive any and there are no sent ones unaccounted for. Example:

[Child 363, IPC I/O Child] WARNING: Message needs unreceived descriptors channel:12342658f6e0 message-type:8126467 header()->num_handles:1 num_fds:1 fds_i:1: file /boot/home/src/gecko-dev/ipc/chromium/src/chrome/common/ipc_channel_posix.cc:467
[Child 363, IPC I/O Child] WARNING: [893615A1F5272DF7.42737031AAC5C5E2]: Dropping message '<null>'; no connection to unknown peer 1.1: file /boot/home/src/gecko-dev/ipc/glue/NodeController.cpp:365
Exiting due to channel error.

Debugging log up until just after 363 exited:

[257] sending FD: 43 (0xffffffff90906540)
[257] sending FD: 42 (0xffffffff909062d0)
[324] receive FD: 8 (0xffffffff90906540)
[257] sending FD: 46 (0xffffffff909065a0)
[324] receive FD: 9 (0xffffffff909062d0)
[257] sending FD: 47 (0xffffffff909064e0)
[324] receive FD: 16 (0xffffffff909065a0)
[324] receive FD: 17 (0xffffffff909064e0)
[257] sending FD: 45 (0xffffffff90906570)
[257] sending FD: 49 (0xffffffff909065d0)
[324] receive FD: 18 (0xffffffff90906570)
[324] receive FD: 19 (0xffffffff909065d0)
[248] sending FD: 45 (0xffffffff903d6338)
[324] receive FD: 16 (0xffffffff903d6338)
[248] sending FD: 47 (0xffffffff90906240)
[324] receive FD: 16 (0xffffffff90906240)
runtime_loader: Cannot open file /boot/home/Desktop/dist/bin/libosclientcerts.so (needed by /boot/system/lib/libnspr4.so): No such file or directory
[248] sending FD: 73 (0xffffffff99b23c38)
[324] receive FD: 17 (0xffffffff99b23c38)
slab memory manager: created area 0xffffffff9a801000 (34042)
[257] sending FD: 76 (0xffffffff90906540)
[257] sending FD: 75 (0xffffffff909062d0)
[364] receive FD: 8 (0xffffffff90906540)
[364] receive FD: 9 (0xffffffff909062d0)
[257] sending FD: 79 (0xffffffff99b23c38)
[364] receive FD: 16 (0xffffffff99b23c38)
[257] sending FD: 81 (0xffffffff90906240)
[257] sending FD: 82 (0xffffffff909064e0)
[364] receive FD: 17 (0xffffffff90906240)
[364] receive FD: 18 (0xffffffff909064e0)
[248] sending FD: 81 (0xffffffff99b23d58)
[248] sending FD: 81 (0xffffffff99b23d58)
[324] receive FD: 18 (0xffffffff99b23d58)
[364] receive FD: 29 (0xffffffff99b23d58)
[248] sending FD: 82 (0xffffffff909068a0)
[324] receive FD: 19 (0xffffffff909068a0)
[248] sending FD: 82 (0xffffffff909068a0)
[364] receive FD: 30 (0xffffffff909068a0)
[248] sending FD: 83 (0xffffffff909068d0)
[248] sending FD: 83 (0xffffffff909068d0)
[364] receive FD: 31 (0xffffffff909068d0)
[324] receive FD: 33 (0xffffffff909068d0)
[248] sending FD: 84 (0xffffffff90906900)
[324] receive FD: 34 (0xffffffff90906900)
[248] sending FD: 84 (0xffffffff90906900)
[364] receive FD: 32 (0xffffffff90906900)
[248] sending FD: 85 (0xffffffff90906930)
[324] receive FD: 35 (0xffffffff90906930)
[248] sending FD: 85 (0xffffffff90906930)
[364] receive FD: 33 (0xffffffff90906930)
[315] sending FD: 35 (0xffffffff903d6338)
[315] sending FD: 37 (0xffffffff90906660)
[315] sending FD: 38 (0xffffffff99b23cf8)
[257] receive FD: 84 (0xffffffff903d6338)
[257] receive FD: 85 (0xffffffff90906660)
[257] receive FD: 87 (0xffffffff99b23cf8)
[363] sending FD: 36 (0xffffffff99b23638)
[363] sending FD: 37 (0xffffffff90906960)
[257] receive FD: 90 (0xffffffff99b23638)
[257] receive FD: 91 (0xffffffff90906960)

(The address is that of the kernel's file_descriptor data structure; the FD number is of course process-specific but the address will be the same across processes for the same FDs.)

So, it looks like the FDs were never passed to sendmsg somehow; or if they were, they got lost after that.

waddlesplash commented 3 weeks ago

It looks like the num_handles is in the message header structure at offset (sizeof(int32) * 4). So, I added a hack to the kernel's _user_sendmsg to issue a panic if this was ever > 0 but we were not passed any control data for FDs. And the panic fired; the presumed value for num_handles is 1 but there are no FDs passed with the sendmsg.

So, unless I'm mistaken here (or wrote the code wrong), we just have a case where for some reason we are not being passed the FDs at all. This should be easier to debug on the Gecko side by adding an assert() before sendmsg() to make sure that the message being sent's header's num_handles indicates FDs actually being sent.

(Let me know if there's anything else I can do to assist debugging this.)

kenz-gelsoft commented 3 weeks ago

8126467=0x7C0003
3670076=0x38003C

they seem something masked enum values. I don't find those definition.

So, unless I'm mistaken here (or wrote the code wrong), we just have a case where for some reason we are not being passed the FDs at all. This should be easier to debug on the Gecko side by adding an assert() before sendmsg() to make sure that the message being sent's header's num_handles indicates FDs actually being sent.

Anyway, I will try this later!

threedeyes commented 3 weeks ago

Likely Qt Creator needs to be adjusted to support listing running processes, yes.

Done (https://github.com/haikuports/haikuports/commit/4252d8fcb357b84a695e16d862b2e8c5f9fb2650)

2024-08-20_23-04

kenz-gelsoft commented 3 weeks ago

Yesterday, I didn't successfully add assertions, as I didn't understand well what is the expectations.

I tried dumping sengmsg() args pointer values in IPC, learned some about unix sockets, silence some (un)related runtime warnings. Now I should grasp what waddlesplash pointed out and what should I assert.

All those messages start after the first "Message needs unreceived descriptors", yes? They may just be subsequent errors caused by that first one.

I confirmed this (in the searching around)

And when we get this log

[Child 656, IPC I/O Child] WARNING: Message needs unreceived descriptors channel:115705e4fb60 message-type:8126467 header()->num_handles:1 num_fds:1 fds_i:1: file /boot/home/src/gecko-dev/ipc/chromium/src/chrome/common/ipc_channel_posix.cc:467

indeed the ProcessIncomingMessage() fails like this

https://searchfox.org/mozilla-esr128/source/ipc/chromium/src/chrome/common/ipc_channel_posix.cc#476

Only caller of sengmsg() in ipc/ is

https://searchfox.org/mozilla-esr128/source/ipc/chromium/src/chrome/common/ipc_channel_posix.cc#131

and it called only from here

https://searchfox.org/mozilla-esr128/source/ipc/chromium/src/chrome/common/ipc_channel_posix.cc#665

I can refer other code using CMSG_ macros

waddlesplash commented 3 weeks ago

What is going on here is that file descriptors are being sent across the UNIX domain socket. This means that the sending application specifies the FDs it wants to send with SCM_RIGHTS when it calls sendmsg, and then the kernel sends those FDs along with the data written to the socket. When the other end of the connection calls recvmsg (at the same point in the stream that the FDs were sent along with, that is), the kernel dequeues the FD information from the socket buffer and creates new FDs in the receiving process's I/O context.

What's sent isn't the FD numbers but the FD's underlying state: so what is (for example) FD 10 in the sending process might become FD 5 or 32 or some other number in the receiving process. But the FD pointed to will be the same. In simpler terms, it's a way of doing dup() across processes, which is of course very useful in a multi-process environment.

The problem here is that the IPC message header indicates that FDs were sent with the message, but none were received. My tracing so far seems to indicate that the sending side of the connection indicated it was sending FDs but did not actually send any (i.e. sendmsg was called with no SCM_RIGHTS specified.) So debugging when or why that occurs (if indeed that's what's occurring) should help determine the problem.

kenz-gelsoft commented 3 weeks ago

Thank you for detailed explanation! I'm new to unix domain socket. So it helps me very much.

kenz-gelsoft commented 3 weeks ago

I added following assertion and dumping specified fd numbers.

There's no case that

So, sendmsg() side correctly specify FDs when SCM_RIGHTS specified. I think.

Patch of that part is:

diff --git a/ipc/chromium/src/chrome/common/ipc_channel_posix.cc b/ipc/chromium/src/chrome/common/ipc_channel_posix.cc
index 19e777a52af7..de1bea4e26fd 100644
--- a/ipc/chromium/src/chrome/common/ipc_channel_posix.cc
+++ b/ipc/chromium/src/chrome/common/ipc_channel_posix.cc
@@ -128,6 +128,28 @@ static inline ssize_t corrected_sendmsg(int socket,
   MOZ_DIAGNOSTIC_ASSERT(bytes_written < kBadValue);
   return bytes_written;
 #else
+  if (message->msg_controllen > 0)
+  {
+    cmsghdr *cmsg = CMSG_FIRSTHDR(message);
+    if (cmsg->cmsg_type == SCM_RIGHTS)
+    {
+      size_t data_len = cmsg->cmsg_len - CMSG_LEN(0);
+      MOZ_ASSERT(data_len % sizeof(int) == 0);
+      unsigned fd_count = data_len / sizeof(int);
+      MOZ_ASSERT(fd_count > 0);
+#if 1
+      pid_t pid = getpid();
+      int *data = reinterpret_cast<int *>(CMSG_DATA(cmsg));
+      printf("[%d] Sending fds=", pid);
+      for (unsigned i = 0; i < fd_count; ++i)
+      {
+        int fd = data[i];
+        printf("%s%d", i == 0 ? "[" : ", ", fd);
+      }
+      printf("]\n");
+#endif
+    }
+  }
   return sendmsg(socket, message, flags);
 #endif
 }
kenz-gelsoft commented 3 weeks ago

Sample of dump, most of the case, we send single FD, sometime send multiple FDs.

[GFX3-]: Creating null Skia image from null SourceSurface
[GFX3-]: Creating null Skia image from null SourceSurface
[GFX3-]: Creating null Skia image from null SourceSurface
[GFX3-]: Creating null Skia image from null SourceSurface
[GFX3-]: Creating null Skia image from null SourceSurface
[GFX3-]: Creating null Skia image from null SourceSurface
console.error: (new ReferenceError("WebAssembly is not defined", "resource://gre/actors/TranslationsParent.sys.mjs", 2737))
[Parent 50400, DOMCacheThread] WARNING: QM_TRY failure (WARNING): '"ToResult(file->Remove( false))" failed with resultCode 0x80520012, resultName NS_ERROR_FILE_NOT_FOUND', file dom/cache/FileUtils.cpp:809
[Parent 50400, DOMCacheThread] WARNING: QM_TRY failure (WARNING): '"ToResult(file->Remove( false))" failed with resultCode 0x80520012, resultName NS_ERROR_FILE_NOT_FOUND', file dom/cache/FileUtils.cpp:809
[50400] Sending fds=[135]
[50619] Sending fds=[17]
[50619] Sending fds=[17]
[Parent 50400, IPC I/O Parent] WARNING: [1.1]: GetUserData call for port '7013089CF33203DE.31233A6B68EFD9FD' failed: file /boot/home/src/gecko-dev/ipc/glue/NodeController.cpp:425
[50619] Sending fds=[17]
[50400] Sending fds=[136, 135]
[50400] Sending fds=[139]
[50400] Sending fds=[152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164]
[50400] Sending fds=[141]
WaylandServer::fDisplay: 0x101025c9c520
Error parsing B_ARGV_RECEIVED message. Message:
BMessage('_ARG') {
        argc = int32(0x12 or 18)
        argv[0] = string("/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/firefox", 53 bytes)
        argv[1] = string("-contentproc", 13 bytes)
        argv[2] = string("{b11e8d3e-ae62-4c0d-9736-a589ce2337ad}", 39 bytes)
        argv[3] = string("50400", 6 bytes)
        argv[4] = string("tab", 4 bytes)
        cwd = string("/boot/home/src/gecko-dev", 25 bytes)
}
wl_ips_client_connected
display: 0x101025c9c520
client: 0x101025c5b360
[Child 50801, Main Thread] WARNING: Failed to create file monitor for /boot/home/config/settings/glib-2.0/settings/keyfile: Unable to find default local file monitor type: 'glib warning', file /boot/home/src/gecko-dev/toolkit/xre/nsSigHandlers.cpp:187

(/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/firefox:50801): GLib-GIO-WARNING **: 05:06:55.284: Failed to create file monitor for /boot/home/config/settings/glib-2.0/settings/keyfile: Unable to find default local file monitor type
[50619] Sending fds=[17]
Fontconfig warning: ignoring en_US_POSIX.UTF-8: not a valid region tag
[50400] Sending fds=[135]
[50619] Sending fds=[17]
[50619] Sending fds=[17]
[Child 50738, StreamTrans #1] WARNING: Failed to retrieve memory telemetry for ResidentPeak: file /boot/home/src/gecko-dev/xpcom/base/MemoryTelemetry.cpp:344
add_native_font=���
��-��,3��-��
add_native_font=���
��-��,3��-��
add_native_font=���
��-��,3��-��
add_native_font=���
��-��,3��-��
add_native_font=�C�
                   D
                    �?�?��
                          �?�?
add_native_font=�C�
                   D
                    �?�?��
                          �?�?
add_native_font=�C�
                   D
                    �?�?��
                          �?�?
add_native_font=�C�
                   D
                    �?�?��
                          �?�?
[50616] Sending fds=[71]
[50616] Sending fds=[37]
[Parent 50400, IPC I/O Parent] WARNING: Message needs unreceived descriptors channel:10fc6034e570 message-type:3538947 header()->num_handles:1 num_fds:1 fds_i:1: file /boot/home/src/gecko-dev/ipc/chromium/src/chrome/common/ipc_channel_posix.cc:489
[Parent 50400, IPC I/O Parent] WARNING: [1.1]: Dropping message '<null>'; no connection to unknown peer ECE0FC342A04E152.81D3D8DB45291808: file /boot/home/src/gecko-dev/ipc/glue/NodeController.cpp:365
[Child 50616, IPC I/O Child] WARNING: [ECE0FC342A04E152.81D3D8DB45291808]: Dropping message '<null>'; no connection to unknown peer 1.1: file /boot/home/src/gecko-dev/ipc/glue/NodeController.cpp:365
Exiting due to channel error.

BTW, in this case (and sometimes) we could render haiku-os.org correctly

screenshot25

but navigating to forum, I got above tab crash.

In this session, passed font file path (in font descriptor) seems completely broken. I suspect this is memory corruption by buffer overrun or something.

waddlesplash commented 3 weeks ago

So, sendmsg() side correctly specify FDs when SCM_RIGHTS specified. I think.

The case we want to assert() on is when header()->num_handles > 0 but there are no SCM_RIGHTS specified.