kenz-gelsoft / gecko-dev

Read-only Git mirror of the Mercurial gecko repositories at https://hg.mozilla.org. How to contribute: https://firefox-source-docs.mozilla.org/contributing/contribution_quickref.html
https://firefox-source-docs.mozilla.org/setup/index.html
Other
12 stars 1 forks source link

Firefox crashes in short period #34

Open kenz-gelsoft opened 1 month ago

kenz-gelsoft commented 1 month ago

KDLs(#30) fixed in hrev57971 after that it crashes easily instead of enterring KDL. (C.f. Just wait fot a while, activate/deactivate window repeatedly.)

I think I’m facing the real cause of abnormal state to enter KDL previously.

Probably related to https://github.com/kenz-gelsoft/gecko-dev/issues/33

waddlesplash commented 3 weeks ago

BTW, you will want to change the kControlBufferMaxFds to be 32 on Haiku (it looks like none of the logs send that many, but it seems technically possible.)

kenz-gelsoft commented 3 weeks ago

I added following assertion and change, I hit some segv in child content process, but I don't think this is hitting newly added assertion.

@@ -661,6 +683,12 @@ bool Channel::ChannelImpl::ProcessOutgoingMessages() {
     msgh.msg_iov = iov;
     msgh.msg_iovlen = iov_count;

+    if (msg->header()->num_handles > 0) {
+      struct cmsghdr* cmsg = CMSG_FIRSTHDR(&msgh);
+      // When IPC::Message's num_handles > 0,
+      // we must specify SCM_RIGHTS cmsg_type.
+      MOZ_ASSERT(cmsg->cmsg_type == SCM_RIGHTS);
+    }
     ssize_t bytes_written =
         HANDLE_EINTR(corrected_sendmsg(pipe_, &msgh, MSG_DONTWAIT));

diff --git a/ipc/chromium/src/chrome/common/ipc_channel_posix.h b/ipc/chromium/src/chrome/common/ipc_channel_posix.h
index b70640d04e12..0cf0c70ae1aa 100644
--- a/ipc/chromium/src/chrome/common/ipc_channel_posix.h
+++ b/ipc/chromium/src/chrome/common/ipc_channel_posix.h
@@ -153,7 +153,8 @@ class Channel::ChannelImpl : public MessageLoopForIO::Watcher {
   // here. Consequently, we pick a number here that is at least CMSG_SPACE(0) on
   // all platforms. We assert at runtime, in Channel::ChannelImpl::Init, that
   // it's big enough.
-  static constexpr size_t kControlBufferMaxFds = 200;
+//  static constexpr size_t kControlBufferMaxFds = 200;
+  static constexpr size_t kControlBufferMaxFds = 32;
   static constexpr size_t kControlBufferHeaderSize = 32;
   static constexpr size_t kControlBufferSize =
       kControlBufferMaxFds * sizeof(int) + kControlBufferHeaderSize;

new_crash_reports.zip

kenz-gelsoft commented 3 weeks ago

@waddlesplash Did you investigate on https://github.com/kenz-gelsoft/gecko-dev/issues/34#issuecomment-2297851601 or so, with my previous uploaded binary?

If so, I tried anything with workaround of https://github.com/kenz-gelsoft/gecko-dev/issues/34#issuecomment-2295327516 . It maybe cause my assertions didn't fail as you expected. should I try without workaround?

waddlesplash commented 3 weeks ago

With the previous upload binary. But the problems here seem unrelated to the font code; in fact the problems look to be occurring in totally separate processes. I see you still have the same error messages in your logs, so reverting those changes shouldn't be necessary.

I hit some segv in child content process, but I don't think this is hitting newly added assertion.

It actually could be. In the case where num_handles > 0 but SCM_RIGHTS are unspecified, CMSG_FIRSTHDR may evaluate to NULL, and thus you get a crash instead of the assertion. Adding a NULL check to the MOZ_ASSERT before reading cmsg->... may reveal the problem.

kenz-gelsoft commented 3 weeks ago

It actually could be. In the case where num_handles > 0 but SCM_RIGHTS are unspecified, CMSG_FIRSTHDR may evaluate to NULL, and thus you get a crash instead of the assertion. Adding a NULL check to the MOZ_ASSERT before reading cmsg->... may reveal the problem.

Yes, it finally occured.

MOZ_ASSERT(cmsg->cmsg_type == SCM_RIGHTS) didn't fail, but MOZ_ASSERT(cmsg && cmsg->cmsg_type == SCM_RIGHTS) failed once. (I didn't reproduce this twice, because maybe NulError occurs earlier.)

Next, I'll re-apply NulError workaround patch to reproduce this assertion failure reliably. And will try use GDB to investigate this.

[23397] Hit MOZ_CRASH(called `Result::unwrap()` on an `Err` value: NulError(59, [47, 98, 111, 111, 116, 47, 115, 121, 115, 116, 101, 109, 47, 100, 97, 116, 97, 47, 102, 111, 110, 116, 115, 47, 116, 116, 102, 111, 110, 116, 115, 47, 78, 111, 116, 111, 83, 97, 110, 115, 68, 105, 115, 112, 108, 97, 121, 45, 82, 101, 103, 117, 108, 97, 114, 46, 116, 116, 102, 0])) at gfx/wr/wr_glyph_rasterizer/src/platform/unix/font.rs:225
[23512] Assertion failure: cmsg && cmsg->cmsg_type == 0x01, at /boot/home/src/gecko-dev/ipc/chromium/src/chrome/common/ipc_channel_posix.cc:690
#01: XRE_GetBootstrap[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0xa956f06]
#02: XRE_GetBootstrap[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0xa956cff]
#03: XRE_GetBootstrap[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0xa95695c]
#04: XRE_GetBootstrap[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0xb268983]
#05: XRE_GetBootstrap[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0xb25a414]
#06: XRE_GetBootstrap[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0xb25a209]
#07: XRE_GetBootstrap[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0xb268527]
#08: XRE_GetBootstrap[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0xb28e773]
#09: XRE_GetBootstrap[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0xb288ea6]
#10: XRE_GetBootstrap[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0xa741a94]
#11: XRE_GetBootstrap[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0xa743072]
#12: XRE_GetBootstrap[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0xa740e18]
#13: XRE_GetBootstrap[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0xa73535e]
#14: XRE_GetBootstrap[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0xb1f29d5]
#15: XRE_GetBootstrap[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0xb1f02ed]
#16: XRE_GetBootstrap[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0xb1ee19a]
#17: XRE_GetBootstrap[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0xb1ee941]
#18: XRE_GetBootstrap[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0xb26aeeb]
#19: pthread_exit[/boot/system/lib/libroot.so +0x4e528]
#20: ??? (???:???)
[Socket 23424, IPC I/O Child] WARNING: [725965FA15D2C5CC.2BA5027518A6BF7F]: Dropping message '<null>'; no connection to unknown peer 1.1: file /boot/home/src/gecko-dev/ipc/glue/NodeController.cpp:365
[Child 23469, IPC I/O Child] WARNING: [DCB1A33DA51823F6.CF1DB246219EF9DC]: Dropping message '<null>'; no connection to unknown peer 1.1: file /boot/home/src/gecko-dev/ipc/glue/NodeController.cpp:365
[Socket 23424, Main Thread] WARNING: Shutting down Socket process early due to a crash!: file /boot/home/src/gecko-dev/netwerk/ipc/SocketProcessChild.cpp:229
Exiting due to channel error.
#01: CERT_GetFirstEmailAddress[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0x4653ae5]
#02: CERT_GetFirstEmailAddress[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0x465579c]
#03: CERT_GetFirstEmailAddress[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0x4658bea]
#04: CERT_GetFirstEmailAddress[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0x46465ac]
#05: CERT_GetFirstEmailAddress[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0x4646d28]
#06: CERT_GetFirstEmailAddress[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0x4646fbb]
#07: CERT_GetFirstEmailAddress[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0x4648241]
#08: CERT_GetFirstEmailAddress[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0x464621f]
#09: CERT_GetFirstEmailAddress[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0x4651558]
#10: CERT_GetFirstEmailAddress[/boot/home/src/gecko-dev/obj-ff-dbg/dist/bin/libxul.so +0x464d4da]
#11: pthread_exit[/boot/system/lib/libroot.so +0x4e528]
#12: ??? (???:???)
diff --git a/ipc/chromium/src/chrome/common/ipc_channel_posix.cc b/ipc/chromium/src/chrome/common/ipc_channel_posix.cc
index 19e777a52af7..e7c96602dad6 100644
--- a/ipc/chromium/src/chrome/common/ipc_channel_posix.cc
+++ b/ipc/chromium/src/chrome/common/ipc_channel_posix.cc
@@ -661,6 +683,13 @@ bool Channel::ChannelImpl::ProcessOutgoingMessages() {
     msgh.msg_iov = iov;
     msgh.msg_iovlen = iov_count;

+    if (msg->header()->num_handles > 0) {
+      struct cmsghdr* cmsg = CMSG_FIRSTHDR(&msgh);
+      // When IPC::Message's num_handles > 0,
+      // we must specify SCM_RIGHTS cmsg_type.
+      MOZ_ASSERT(cmsg && cmsg->cmsg_type == SCM_RIGHTS);
+    }
     ssize_t bytes_written =
         HANDLE_EINTR(corrected_sendmsg(pipe_, &msgh, MSG_DONTWAIT));

BTW, I invesitaged unix/font.rs error more, I doubt someone break or it is broken SHM that parent and child process use to passing FontDescriptor info. How can I check its behavior precisely? strace or more useful tool exists?

waddlesplash commented 3 weeks ago

It depends on precisely how it's sending information across the shared memory area. If it's really shared memory then strace won't be helpful with showing what's going across the connection.

kenz-gelsoft commented 3 weeks ago

Thank you for response.

I will upload new binary with my local hack that is relatively stable (tab crashes frequently but parent process won’t die in a few minutes). This time it will be smaller symbol stripped build as you (and others) didn't use debugger.

Would you take a look at it? I will try enabling IPC logging described in

https://firefox-source-docs.mozilla.org/ipc/processes.html#debugging-with-ipdl-logging

waddlesplash commented 3 weeks ago

I'm not sure I'd be able to look into it without a debug build...

Did you get any further investigating the above assertion failures?

waddlesplash commented 3 weeks ago

And, do you have any other ideas about what might be happening with all the apparently corrupt data? Where does this data come from, is it sent across the IPC socket, or is it from a shared memory area?

kenz-gelsoft commented 3 weeks ago

I'm not sure I'd be able to look into it without a debug build...

Uploaded. If you need not stripped build, I'll upload that tonight.

https://discuss.haiku-os.org/t/progress-on-porting-firefox/13493/192?u=kenz

Did you get any further investigating the above assertion failures?

Took some time for this, but I couldn't approach to core problem successfully. So I'm looking into suspicious log output lines one by one.

And, do you have any other ideas about what might be happening with all the apparently corrupt data? Where does this data come from, is it sent across the IPC socket, or is it from a shared memory area?

I hope enabling more IPC logging may help to identify what is happening. I will try that tonight.