timja / jenkins-gh-issues-poc-06-18

0 stars 0 forks source link

[JENKINS-65911] Jenkins shuts down instead of restarting on Mac M1 #1559

Open timja opened 3 years ago

timja commented 3 years ago

This issue appeared after migrating our Jenkins instance from a Macbook to a new Mac mini (M1). I could also reproduce it with a clean install on the M1 machine.

It has been present in the last few versions (at least).

Steps to reproduce:

  1. Go to http://[JENKINS_URL]/updateCenter/
  2. Check "Restart Jenkins when installation is complete and no jobs are running" (regardless of whether there's an upgrade or not)
  3. Observe what happens in the terminal - after 10 seconds, Jenkins process is killed instead of restarted.

This is how the log looks like:

2021-06-17 08:27:47.284+0000 [id=1066]  INFO    jenkins.model.Jenkins$20#run: Restart in 10 seconds
2021-06-17 08:27:57.287+0000 [id=1066]  INFO    jenkins.model.Jenkins$20#run: Restarting VM as requested by admin
2021-06-17 08:27:57.287+0000 [id=1066]  INFO    jenkins.model.Jenkins#cleanUp: Stopping Jenkins
2021-06-17 08:27:57.339+0000 [id=1066]  INFO    jenkins.model.Jenkins$16#onAttained: Started termination
2021-06-17 08:27:57.370+0000 [id=1066]  INFO    jenkins.model.Jenkins$16#onAttained: Completed termination
2021-06-17 08:27:57.370+0000 [id=1066]  INFO    jenkins.model.Jenkins#_cleanUpDisconnectComputers: Starting node disconnection
2021-06-17 08:27:57.379+0000 [id=1066]  INFO    jenkins.model.Jenkins#_cleanUpShutdownPluginManager: Stopping plugin manager
2021-06-17 08:27:57.440+0000 [id=1066]  INFO    jenkins.model.Jenkins#_cleanUpPersistQueue: Persisting build queue
2021-06-17 08:27:57.443+0000 [id=1066]  INFO    jenkins.model.Jenkins#_cleanUpAwaitDisconnects: Waiting for node disconnection completion
2021-06-17 08:27:57.443+0000 [id=1066]  INFO    jenkins.model.Jenkins#cleanUp: Jenkins stopped
[1]    72539 killed     jenkins

Jenkins runs fine when manually started again (with "jenkins" command).


Originally reported by transcendd, imported from: Jenkins shuts down instead of restarting on Mac M1
  • status: Open
  • priority: Minor
  • resolution: Unresolved
  • imported: 2022/01/10
timja commented 3 years ago

JIRAUSER132902:

This seems to be fixed in the 2.299 version, though it's not documented.

timja commented 3 years ago

markewaite:

It is documented in the 2.297 changelog and in the 2.289.2 changelog

timja commented 2 years ago

timja:

This was not fixed, it was another issue

timja commented 2 years ago

timja:

I've been debugging it.

It fails on this line:
https://github.com/jenkinsci/jenkins/blob/195dacbf556fdf0a393d5d139df27ffba3a82b11/core/src/main/java/hudson/lifecycle/UnixLifecycle.java#L80

The file descriptor it fails on changes in different runs normally between 820-840

Catching error doesn't help:

try {} catch (Error e)

Not sure if basil would have any ideas on what to try from here =/

timja commented 2 years ago

basil:

timja Unfortunately I don't have access to a macOS system to do real debugging on, and debugging by telegram isn't very efficient. But I can try to help.

The logic makes sense at a high level: we are about to exec, so to avoid leaking open file handles we need to set the close-on-exec flag for all open file descriptors. That this not only failing in the first place but indeed crashing the process is quite surprising.

The most direct form of analysis would be to work our way backwards from the crash. Do you have a stack trace, registers, or any other diagnostic information for the crash? I don't really know where macOS puts these things, but this page seems to have some answers.

Another piece of interesting information might be which file descriptor we are trying to set the close-on-exec flag on prior to the crash. If you can put a breakpoint at the line you mentioned and run lsof to list the process's open file descriptors, then keep going until it crashes, then tell us what type of file descriptor caused the crash, that might be interesting. E.g. was it a socket or a regular file, if it was a regular file what was its path, etc.

timja commented 2 years ago

timja:

Thanks I’ll add some more debug info, currently I’ve just got the ID, there’s no stacktrace

timja commented 2 years ago

timja:

Interesting there was a crash log in the 'Crash reports'

Exception Type:EXC_GUARD
Exception Codes:       GUARD_TYPE_FD

Looks like some apple protection on one of the files

Full log at: https://gist.github.com/timja/0167c76924c5e44a09c64a545c87ef8b

timja commented 2 years ago

timja:

The thread always crashes at:

Thread 64 Crashed:: Java: restart thread
0   libsystem_kernel.dylib  0x000000018c105964 __fcntl + 8
1   libsystem_kernel.dylib  0x000000018c105930 fcntl + 88
timja commented 2 years ago

timja:

Apple docs: https://developer.apple.com/documentation/xcode/understanding-the-exception-types-in-a-crash-report

timja commented 2 years ago

timja:

This looks to be the correct fix, there's no need to set `FD_CLOEXEC` on the file descriptor that already has it set

diff --git a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
index 391998cdff..97288d5fad 100644
--- a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
+++ b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
@@ -76,7 +76,9 @@ public class UnixLifecycle extends Lifecycle {
 int sz = LIBC.getdtablesize();
 for(int i=3; iint flags = LIBC.fcntl(i, F_GETFD);
-    if(flags<0) continue;
+    if (flags < 0 || flags == FD_CLOEXEC) {
+continue;
+    }
     LIBC.fcntl(i, F_SETFD,flags| FD_CLOEXEC);
 }

It gets further after the above, it now fails at:

2021-10-04 21:48:05.292+0000 [id=1] INFO    winstone.Logger#logInternal: Jetty shutdown successfully
java.io.IOException: Failed to start Jetty
    at winstone.Launcher.(Launcher.java:194)
    at winstone.Launcher.main(Launcher.java:369)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at Main._main(Main.java:311)
    at Main.main(Main.java:114)
Caused by: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080
    at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349)
    at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310)
    at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
    at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234)
    at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
    at org.eclipse.jetty.server.Server.doStart(Server.java:401)
    at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
    at winstone.Launcher.(Launcher.java:192)
    ... 7 more
Caused by: java.net.BindException: Address already in use
    at java.base/sun.nio.ch.Net.bind0(Native Method)
    at java.base/sun.nio.ch.Net.bind(Net.java:455)
    at java.base/sun.nio.ch.Net.bind(Net.java:447)
    at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227)
    at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80)
    at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344)
    ... 14 more
2021-10-04 21:48:05.292+0000 [id=1] SEVERE  winstone.Logger#logInternal: Container startup failed
java.net.BindException: Address already in use
    at java.base/sun.nio.ch.Net.bind0(Native Method)
    at java.base/sun.nio.ch.Net.bind(Net.java:455)
    at java.base/sun.nio.ch.Net.bind(Net.java:447)
    at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227)
    at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80)
    at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344)
Caused: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080
    at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349)
    at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310)
    at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
    at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234)
    at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
    at org.eclipse.jetty.server.Server.doStart(Server.java:401)
    at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
    at winstone.Launcher.(Launcher.java:192)
Caused: java.io.IOException: Failed to start Jetty
    at winstone.Launcher.(Launcher.java:194)
    at winstone.Launcher.main(Launcher.java:369)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at Main._main(Main.java:311)
    at Main.main(Main.java:114)
timja commented 2 years ago

timja:

Any thoughts on where to check for that?

This might be where it's supposed to be shutdown: https://github.com/jenkinsci/winstone/blob/825889d1a93a49feabca297308fff7ecafba70cd/src/main/java/winstone/ShutdownHook.java#L15

timja commented 2 years ago

basil:

timja Good insight! I think you're almost there, but your fix isn't quite right. The flags is a bit field, so you can't check if FD_CLOEXEC is set on it with flags == FD_CLOEXEC – that is probably missing a lot of stuff including open sockets, which is probably why you then get the "Address already in use" error. You want to use bitwise arithmetic to check if the FD_CLOEXEC bit is set on the flags while ignoring the other bits that you don't care about.

timja commented 2 years ago

basil:

timja Come to think of it, you might not be almost there after all, because I see no evidence that whatever file descriptor macOS is freaking out about has the FD_CLOEXEC already set on it – that seems like an assumption on your part based on a faulty understanding of how the bit is represented in the flags. I don't see any reason why your theory would be false but I also don't see any reason why it would be true. So back to the drawing board...

What is clear to me is this is some weird Apple thing where certain file descriptors are "guarded resources" where "the operating system marks a file descriptor as guarded so that normal file descriptor APIs can’t modify them." So when we try to set FD_CLOEXEC on it, the OS "helpfully" crashes the process rather than trusting that we know what we're doing. I had been initially worried that we were hitting some JNA issue, but this seems unlikely. We're bumping into some macOS restriction. My main question is: "what file is so special that macOS forbids us from closing it?" Can you run lsof -p ${PID} and please let me know what file handle is so sacred that Apple does not allow developers to close it.

timja commented 2 years ago

timja:

It's not very informative...

java    37417 timja  844   NPOLICY

Only reference I found was someone asking what it is:
https://developer.apple.com/forums/thread/659475

The evidence I saw was from logging out the flags returned by 'F_GETFD' and seeing that the failing process had '1' returned which is FD_CLOEXEC from what I understand, but I haven't had to do bitwise arithmetic in a long time so likely very faulty assumption.

(There was also one other process with the flag 1 which was the preceeding FD)

timja commented 2 years ago

basil:

Maybe you'd have more luck figuring out what this weird file descriptor is with DTrace. sudo opensnoop -p ${PID} will log every file descriptor as it's opened, giving you a bunch of information about it like its file descriptor and its path. It's based on DTrace, so you'll have to disable System Integrity Protection (SIP) in order to use it. (It's this kind of thing that drove me nuts which is why I eventually quit using macOS. Now I run Linux and can do system-level tracing to my heart's content.)

timja commented 2 years ago

basil:

timja You're absolutely sure that the file descriptor we're crashing trying to set FD_CLOEXEC on is of type NPOLICY? Check out this page:

A major addition to Darwin's network stack are Network Extension Control Policies (NECPs), added to Darwin 14. NECPs are described in bsd/net/necp.c as "..high-level policy sessions, which are ingested into low-level kernel policies that control and tag traffic at the application, socket, and IP layers". In other words, NECPs enable user mode programs to control the kernel's network routing and scheduling decisions. Naturally, these also allow QoS.

[…]

Starting with Darwin 16, just about every network-enabled process in the system uses NECPs, oftentimes without the developer even knowing what they are. This is because libnetwork.dylib calls necp_open() (#501) as part of its initialization (specifically, from nw_endpoint_handler_start). This creates a necp client, a file descriptor of type NPOLICY, which is readily visible in the output of lsof(1) or procexp ..fds. The descriptor does not offer the traditional operations (read(2)/write(2)/ioctl(2)), and only supports select(2), or use in a kqueue.

[…]

Darwin 17 extends the idea of NECP client descriptors, and adds the NECP session (also an NPOLICY file descriptor^*^). These descriptors are created with necp_session_open (#522), and support just the close(2) operation (which deletes the associated session).

So it looks like an NPOLICY file descriptor could be either an NECP client or an NECP session. Now, if trying to set FD_CLOEXEC on one of these is triggering a EXC_GUARD/GUARD_TYPE_FD "guarded resource protection", then my next question would be: who created the NECP client/session, and why is it guarded? More importantly, how can we get it to release the resource before we exec?

timja commented 2 years ago

timja:

> You're absolutely sure that the file descriptor we're crashing trying to set FD_CLOEXEC on is of type NPOLICY?
Sure is a strong word here, this sort of thing is quite out of comfort zone, but yes it appears so.

These are the related entries from the last restart I just did:

java    40095 timja  840u     IPv6 0x7a007b86cdf33161       0t0 TCP localhost:http-alt->localhost:58463 (ESTABLISHED)
java    40095 timja  841u     IPv6 0x7a007b86c2092461       0t0 TCP *:ndmp (LISTEN)
java    40095 timja  842u    systm 0x7a007b936831da81       0t0     [ctl com.apple.netsrc id 6 unit 26]
java    40095 timja  843u     unix 0x7a007b86a4514369       0t0     ->0x7a007b86a450efd1
java    40095 timja  844   NPOLICY
java    40095 timja  845u     IPv6 0x7a007b86c208eb01       0t0 TCP 192.168.4.151:58480->20.190.160.136:https (ESTABLISHED)
java    40095 timja  846u     IPv6 0x7a007b86c208f7c1       0t0 TCP 192.168.4.151:58484->20.190.160.136:https (ESTABLISHED)
java    40095 timja  847u     IPv6 0x7a007b86c2091141       0t0 TCP 192.168.4.151:58488->40.126.41.96:https (ESTABLISHED)
java    40095 timja  848u     IPv6 0x7a007b86cdf36461       0t0 TCP localhost:http-alt->localhost:58490 (ESTABLISHED)
java    40095 timja  849u     IPv6 0x7a007b86cdf337c1       0t0 TCP localhost:http-alt->localhost:58491 (ESTABLISHED)
java    40095 timja  850u     IPv6 0x7a007b86cdf34481       0t0 TCP localhost:http-alt->localhost:58493 (ESTABLISHED)
java    40095 timja  851u     IPv6 0x7a007b86cdaebe21       0t0 TCP localhost:http-alt->localhost:58494 (ESTABLISHED)
java    40095 timja  852u     IPv6 0x7a007b86cdaed7a1       0t0 TCP localhost:http-alt->localhost:58495 (ESTABLISHED)

and the relevant part of the crash log:

Exception Type:EXC_GUARD
Exception Codes:       0x400000040000034c, 0xe20c7f41b3616e04
Exception Subtype:     GUARD_TYPE_FD, id=0xe20c7f41b3616e04, fd=844, flavor=0x00000004 (NOCLOEXEC)
timja commented 2 years ago

timja:

This is the result from opensnoop: https://gist.github.com/timja/65e67174af7a3a3f2e721bbeef2a9c5f

On this run the FD that caused it to crash was 847, and was of type 'NPOLICY'

timja commented 2 years ago

timja:

Interesting I found that the file descriptor is only created after logging in to Jenkins,

If you disable auth and restart it doesn't get killed
and after some more digging it seems to be the azure-ad plugin is starting that file descriptor somehow.
The built-in security realm doesn't have the problem.

But the restart doesn't work either way it still hits the:

2021-10-05 07:26:21.441+0000 [id=1] SEVERE  winstone.Logger#logInternal: Container startup failed
java.net.BindException: Address already in use
    at java.base/sun.nio.ch.Net.bind0(Native Method)
    at java.base/sun.nio.ch.Net.bind(Net.java:455)
    at java.base/sun.nio.ch.Net.bind(Net.java:447)
    at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227)
    at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80)
    at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344)
Caused: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080
    at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349)
    at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310)
    at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
    at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234)
    at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
    at org.eclipse.jetty.server.Server.doStart(Server.java:401)
    at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
    at winstone.Launcher.(Launcher.java:192)
Caused: java.io.IOException: Failed to start Jetty
    at winstone.Launcher.(Launcher.java:194)
    at winstone.Launcher.main(Launcher.java:369)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at Main._main(Main.java:311)
    at Main.main(Main.java:114)
timja commented 2 years ago

basil:

timja Setting aside the Azure AD issue, it is interesting that with the built-in security realm the restart fails with java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080. I looked into how this works (correctly) on my Linux system. Before restarting Jenkins, here are the open TCP file descriptors:

$ sudo lsof -P -p 3916237 | grep TCP
java    3916237 basil    5u     IPv6    5821933       0t0     TCP localhost:8080->localhost:49530 (ESTABLISHED)
java    3916237 basil    6u     IPv4    5807038       0t0     TCP localhost:5005->localhost:48452 (ESTABLISHED)
java    3916237 basil  120u     IPv6    5812861       0t0     TCP *:8080 (LISTEN)

The call to LIBC.fcntl(i, F_SETFD,flags | FD_CLOEXEC) marks them as CLOEXEC, then the call to LIBC.execvp execs to self, during which the file descriptors are closed and the socket released. During startup there are no open TCP file descriptors, so Jetty has no problem binding to the port.

At what point is this going off the rails on M1?

Note that winstone.Launcher#shutdown isn't executed in this process, since this isn't a "regular" JVM shutdown where the shutdown hooks are executed. Whatever Winstone/Jetty might be doing to gracefully shut down, I can't see how closing all file descriptors and calling exec would be less than that - there's really not much left to the process once its program text and open file handles are gone. Unless there's some weird Apple resource that winstone.Launcher#shutdown releases which isn't a file descriptor or in the process's address space - which would be quite strange.

timja commented 2 years ago

basil:

I can reproduce the problem on an M1 system with a stock Jenkins installation. Tracing the process with dtruss(1) I cannot see any calls to necp_session_open, but I see a call to necp_open:

  927/0x364b:      3733      12      8 necp_open(0x0, 0x0, 0x0)  = 3 0

      0x18e4d1b78
      0x1926e07ec
      0x1926df85c
      0x18f1429c8
      0x102ee0450
      0x102ede014
      0x102eddf68
      0x102eddea0
      0x18e25c1d8
      0x18e25abf8
      0x18e34ac6c
      0x18e366f68
      0x18e352208
      0x18e367cb8
      0x18e35d708
      0x18e505304
      0x18e504018

For some reason DTrace's ustack() isn't resolving symbols, but I got the symbols from lldb(1):

target create /opt/homebrew/opt/openjdk@11/bin/java
process launch --stop-at-entry -- -jar war/target/jenkins.war
process handle -p true -s false SIGBUS
process handle -p true -s false SIGSEGV
process handle -p true -s false SIGSTOP
break set -n necp_open -s libsystem_kernel.dylib
continue
bt
* thread #53, name = 'Java: GrapeHack.hack', stop reason = breakpoint 1.1
  * frame #0: 0x000000018e4d1b70 libsystem_kernel.dylib`necp_open
    frame #1: 0x0000000192715b2c libnetwork.dylib`nw_path_shared_necp_fd + 220
    frame #2: 0x00000001926e07ec libnetwork.dylib`nw_path_evaluator_evaluate + 540
    frame #3: 0x00000001926df85c libnetwork.dylib`nw_path_create_evaluator_for_endpoint + 108
    frame #4: 0x0000000192e1e8cc libnetwork.dylib`nw_nat64_v4_address_requires_synthesis + 252
    frame #5: 0x000000018e5298e4 libsystem_info.dylib`si_addrinfo + 2308
    frame #6: 0x000000018e528f34 libsystem_info.dylib`getaddrinfo + 168
    frame #7: 0x00000001021702c4 libnet.dylib`Java_java_net_Inet6AddressImpl_lookupAllHostAddr + 144
    frame #8: 0x000000011f90bbfc
    frame #9: 0x000000011f906fc8
    frame #10: 0x000000011f906fc8
    frame #11: 0x000000011f906ee0
    frame #12: 0x000000011f906fc8
    frame #13: 0x000000011f906ee0
    frame #14: 0x000000011f906ee0
    frame #15: 0x000000011f906ee0
    frame #16: 0x000000011f90719c
    frame #17: 0x000000011f90719c
    frame #18: 0x000000011f90719c
    frame #19: 0x000000011f900144
    frame #20: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736
    frame #21: 0x00000001028d4898 libjvm.dylib`invoke(InstanceKlass*, methodHandle const&, Handle, bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) + 1796
    frame #22: 0x00000001028d4d8c libjvm.dylib`Reflection::invoke_constructor(oopDesc*, objArrayHandle, Thread*) + 392
    frame #23: 0x00000001026d0554 libjvm.dylib`JVM_NewInstanceFromConstructor + 232
    frame #24: 0x00000001274e1360
    frame #25: 0x00000001203c8c84
    frame #26: 0x000000011f906ee0
    frame #27: 0x000000011f906ee0
    frame #28: 0x000000011f906ee0
    frame #29: 0x000000011f90719c
    frame #30: 0x000000011f9071f4
    frame #31: 0x000000011f9071f4
    frame #32: 0x00000001200a6434
    frame #33: 0x000000011f906ab0
    frame #34: 0x000000011f9067e4
    frame #35: 0x000000011f90678c
    frame #36: 0x000000011f9071f4
    frame #37: 0x000000011f90719c
    frame #38: 0x000000011f90719c
    frame #39: 0x000000011f90719c
    frame #40: 0x000000011f90719c
    frame #41: 0x000000011f90719c
    frame #42: 0x000000011f90719c
    frame #43: 0x000000011f90719c
    frame #44: 0x000000011f90719c
    frame #45: 0x000000011f906fc8
    frame #46: 0x000000011f906ee0
    frame #47: 0x000000011f906ee0
    frame #48: 0x000000011f906fc8
    frame #49: 0x000000011f900144
    frame #50: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736
    frame #51: 0x00000001028d4898 libjvm.dylib`invoke(InstanceKlass*, methodHandle const&, Handle, bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) + 1796
    frame #52: 0x00000001028d4d8c libjvm.dylib`Reflection::invoke_constructor(oopDesc*, objArrayHandle, Thread*) + 392
    frame #53: 0x00000001026d0554 libjvm.dylib`JVM_NewInstanceFromConstructor + 232
    frame #54: 0x00000001274e1360
    frame #55: 0x00000001203c8c84
    frame #56: 0x000000011f906ee0
    frame #57: 0x000000011f900144
    frame #58: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736
    frame #59: 0x00000001028d4898 libjvm.dylib`invoke(InstanceKlass*, methodHandle const&, Handle, bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) + 1796
    frame #60: 0x00000001028d4138 libjvm.dylib`Reflection::invoke_method(oopDesc*, Handle, objArrayHandle, Thread*) + 276
    frame #61: 0x00000001026d038c libjvm.dylib`JVM_InvokeMethod + 508
    frame #62: 0x00000001276a1270
    frame #63: 0x000000012097c16c
    frame #64: 0x000000011f90719c
    frame #65: 0x000000011f9071f4
    frame #66: 0x000000011f90719c
    frame #67: 0x000000011f90719c
    frame #68: 0x000000011f9071f4
    frame #69: 0x000000011f9071f4
    frame #70: 0x000000011f9071f4
    frame #71: 0x000000011f90719c
    frame #72: 0x000000011f9071f4
    frame #73: 0x000000011f900144
    frame #74: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736
    frame #75: 0x00000001026468d0 libjvm.dylib`JavaCalls::call_virtual(JavaValue*, Klass*, Symbol*, Symbol*, JavaCallArguments*, Thread*) + 236
    frame #76: 0x0000000102646998 libjvm.dylib`JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, Thread*) + 100
    frame #77: 0x00000001026cd0dc libjvm.dylib`thread_entry(JavaThread*, Thread*) + 120
    frame #78: 0x0000000102973134 libjvm.dylib`JavaThread::thread_main_inner() + 124
    frame #79: 0x0000000102972f64 libjvm.dylib`JavaThread::run() + 376
    frame #80: 0x0000000102970c88 libjvm.dylib`Thread::call_run() + 120
    frame #81: 0x0000000102874b94 libjvm.dylib`thread_native_entry(Thread*) + 316
    frame #82: 0x000000018e509240 libsystem_pthread.dylib`_pthread_start + 148

So some thread named GrapeHack.hack is calling Java_java_net_Inet6AddressImpl_lookupAllHostAddr, which eventually calls necp_open, which creates a file descriptor with NPOLICY, which we then set FD_CLOEXEC on, which we then crash on when exec'ing.

I looked into disabling IPv6 in Java and tried running Java with -Djava.net.preferIPv4Stack=true. That eliminates the calls to necp_open and the NPOLICY FD:

$ sudo lsof -p `pgrep java` | grep NPOLICY
$

That eliminates the crash, but Jenkins still doesn't restart properly. Now I'm getting:

2022-02-22 06:38:34.228+0000 [id=1]    SEVERE  winstone.Logger#logInternal: Container startup failed
java.net.BindException: Address already in use
    at java.base/sun.nio.ch.Net.bind0(Native Method)
    at java.base/sun.nio.ch.Net.bind(Net.java:455)
    at java.base/sun.nio.ch.Net.bind(Net.java:447)
    at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227)
    at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80)
    at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344)
Caused: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080
    at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349)
    at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310)
    at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
    at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234)
    at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
    at org.eclipse.jetty.server.Server.doStart(Server.java:401)
    at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
    at winstone.Launcher.(Launcher.java:198)
Caused: java.io.IOException: Failed to start Jetty
    at winstone.Launcher.(Launcher.java:200)
    at winstone.Launcher.main(Launcher.java:376)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at Main._main(Main.java:304)
    at Main.main(Main.java:108)

A mystery for another day.

timja commented 2 years ago

basil:

More late night debugging with java -Djava.net.preferIPv4Stack=true -jar war/target/jenkins.war... after instrumenting Winstone with

diff --git a/src/main/java/winstone/Launcher.java b/src/main/java/winstone/Launcher.java
index 2657c71..2c0b72c 100644
--- a/src/main/java/winstone/Launcher.java
+++ b/src/main/java/winstone/Launcher.java
@@ -195,6 +195,7 @@ public class Launcher implements Runnable {
     }

     try {
+Thread.sleep(10 * 1000L);
 server.start();
     } catch (Exception e) {
 throw new IOException("Failed to start Jetty",e);

I could confirm that after the restart not only was the port still open but all the other files were also still open. It appears as if nothing got closed before the exec. Instrumenting Jenkins with

diff --git a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
index ca8f50b9f5..b6d28ea313 100644
--- a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
+++ b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
@@ -82,6 +82,7 @@ public class UnixLifecycle extends Lifecycle {
 }

 // exec to self
+Thread.sleep(10 * 1000L);
 String exe = args.get(0);
 LIBC.execvp(exe, new StringArray(args.toArray(new String[0])));
 throw new IOException("Failed to exec '" + exe + "' " + LIBC.strerror(Native.getLastError()));

and running sudo lsof +f g -p $PID I could confirm that no file descriptor had the CX (close-on-exec) flag. We are simply failing to set the close-on-exec flag.

Watching the fcntl system calls with dtruss -t fcntl I see:

fcntl(0x3, 0x1, 0x0)     = 0 0
fcntl(0x3, 0x2, 0xFFFFFFFFC5EC5B50)      = 0 0
fcntl(0x4, 0x1, 0x0)     = 0 0
fcntl(0x4, 0x2, 0x29ECA00)       = 0 0
fcntl(0x5, 0x1, 0x0)     = 0 0
fcntl(0x5, 0x2, 0xFFFFFFFFC5EC5AB0)      = 0 0
fcntl(0x6, 0x1, 0x0)     = 0 0
fcntl(0x6, 0x2, 0xFFFFFFFFC5EC5AB0)      = 0 0
fcntl(0x7, 0x1, 0x0)     = 0 0
fcntl(0x7, 0x2, 0xFFFFFFFFC5EC5AB0)      = 0 0
fcntl(0x8, 0x1, 0x0)     = 0 0
fcntl(0x8, 0x2, 0xFFFFFFFFC5EC5AB0)      = 0 0
fcntl(0x9, 0x1, 0x0)     = 0 0
fcntl(0x9, 0x2, 0xFFFFFFFFC5EC5AB0)      = 0 0
fcntl(0xA, 0x1, 0x0)     = 0 0
fcntl(0xA, 0x2, 0xFFFFFFFFC5EC5AB0)      = 0 0

and on and on. The calls where the second argument is 0x2 correspond to F_SETFD. Most of them have 0xFFFFFFFFC5EC5B50 as the third argument, including the one for the TCP socket.

For comparison I wrote a simple Python program:

import fcntl
import time

with open('foo.txt') as foo:
    time.sleep(10)
    print('about to fcntl')
    flags = fcntl.fcntl(foo.fileno(), fcntl.F_GETFD)
    fcntl.fcntl(foo.fileno(), fcntl.F_SETFD, flags | fcntl.FD_CLOEXEC)
    print('finished fcntl')
    time.sleep(10)

I confirmed that after the F_SETFD the file descriptor had CX set in lsof +f g as expected. The dtruss -t fcntl output here was:

fcntl(0x3, 0x1, 0x0)        = 0 0
fcntl(0x3, 0x2, 0x1)         = 0 0

So clearly Jenkins is in the wrong here by calling SET_FD with 0xFFFFFFFFC5EC5AB0 as the flags. It should be 0x1.

Funnily enough I got this working very easily with JNR (which I am trying to get rid of in core...):

diff --git a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
index ca8f50b9f5..a68bc4ef7b 100644
--- a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
+++ b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
@@ -26,18 +26,19 @@ package hudson.lifecycle;

 import static hudson.util.jna.GNUCLibrary.FD_CLOEXEC;
 import static hudson.util.jna.GNUCLibrary.F_GETFD;
-import static hudson.util.jna.GNUCLibrary.F_SETFD;
 import static hudson.util.jna.GNUCLibrary.LIBC;

 import com.sun.jna.Native;
 import com.sun.jna.StringArray;
 import hudson.Platform;
+import hudson.os.PosixAPI;
 import java.io.IOException;
 import java.util.List;
 import java.util.logging.Level;
 import java.util.logging.Logger;
 import jenkins.model.Jenkins;
 import jenkins.util.JavaVMArguments;
+import jnr.constants.platform.Fcntl;

 /**
  * {@link Lifecycle} implementation when Hudson runs on the embedded
@@ -78,7 +79,7 @@ public class UnixLifecycle extends Lifecycle {
 for (int i = 3; i < sz; i++) {
     int flags = LIBC.fcntl(i, F_GETFD);
     if (flags < 0) continue;
-    LIBC.fcntl(i, F_SETFD, flags | FD_CLOEXEC);
+    PosixAPI.jnr().fcntlInt(i, Fcntl.F_SETFD, flags | FD_CLOEXEC);
 }

 // exec to self

With that, java -Djava.net.preferIPv4Stack=true -jar war/target/jenkins.war is fully working and the restart succeeds.

To summarize, there are two separate bugs here:

  1. When running without -Djava.net.preferIPv4Stack=true, calls to Java_java_net_Inet6AddressImpl_lookupAllHostAddr call necp_open, which creates a file descriptor with NPOLICY, and if we then set FD_CLOEXEC on that file descriptor the process crashes. I am not sure offhand what the best way to deal with this is. Perhaps we should just skip those file descriptors? But would need to make sure they don't cause problems after exec'ing. Needs more investigation. In any case, -Djava.net.preferIPv4Stack=true works around the problem.
  2. Our call to fcntl(fd, F_SETFD has a bogus flags argument. The above code for JNR shows one way to fix this. I'm sure this can also be done with JNA. I am just too tired right now to come up with the JNA version.
timja commented 2 years ago

basil:

Of note is that fcntl is a variadic function whose flags argument is an unsigned 64-bit integer. There is probably something wrong with our JNA signatures. Ironically, when I do a Google search for "fcntl jna", all I can find is the (incorrect) code in Jenkins.

timja commented 2 years ago

[Originally related to: JENKINS-65195]