'Failed to start: core-dump' after upgrade

hanserasmus commented 10 months ago

Hi all

After doing a

git pull
make build
cp /usr/bin/signaling /usr/bin/signaling.bak
systemctl stop signaling.service 
cp bin/signaling /usr/bin/signaling
systemctl start signaling.service 
systemctl status signaling.service

my signaling service does not want to start up. It fails with the following message:

-- A start job for unit signaling.service has finished successfully.
-- 
-- The job identifier is 193785.
Dec 11 11:36:12 signaling systemd[1]: signaling.service: Main process exited, code=dumped, status=31/SYS
-- Subject: Unit process exited
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
-- 
-- An ExecStart= process belonging to unit signaling.service has exited.
-- 
-- The process' exit code is 'dumped' and its exit status is 31.
Dec 11 11:36:12 signaling systemd[1]: signaling.service: Failed with result 'core-dump'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
-- 
-- The unit signaling.service has entered the 'failed' state with result 'core-dump'.
Dec 11 11:36:12 signaling systemd[1]: signaling.service: Scheduled restart job, restart counter is at 5.
-- Subject: Automatic restarting of a unit has been scheduled
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
-- 
-- Automatic restarting of the unit signaling.service has been scheduled, as the result for
-- the configured Restart= setting for the unit.
Dec 11 11:36:12 signaling systemd[1]: Stopped Nextcloud Talk signaling server.
-- Subject: A stop job for unit signaling.service has finished
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
-- 
-- A stop job for unit signaling.service has finished.
-- 
-- The job identifier is 193865 and the job result is done.
Dec 11 11:36:12 signaling systemd[1]: signaling.service: Start request repeated too quickly.
Dec 11 11:36:12 signaling systemd[1]: signaling.service: Failed with result 'core-dump'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
-- 
-- The unit signaling.service has entered the 'failed' state with result 'core-dump'.
Dec 11 11:36:12 signaling systemd[1]: Failed to start Nextcloud Talk signaling server.
-- Subject: A start job for unit signaling.service has failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
-- 
-- A start job for unit signaling.service has finished with a failure.
-- 
-- The job identifier is 193865 and the job result is failed.

My build did not throw out any errors:

Any ideas where I can search further for log entries as to why this happened and hopefully get a fix for it?

Thanks in advance for any help!

fancycode commented 10 months ago

You could try starting it manually (./bin/signaling -config /path/to/your/server.conf) and check for any output.

fancycode commented 10 months ago

Also depending on your previous version, please check the signaling.service file, there have been changes in the past (e.g. https://github.com/strukturag/nextcloud-spreed-signaling/pull/322) that fixed some startup problems.

hanserasmus commented 10 months ago

Thanks again for the quick reply @fancycode.

root@signaling:/opt/nextcloud-spreed-signaling# ./bin/signaling --config /etc/signaling/server.conf
main.go:133: Starting up version 1.2.2/go1.21.5 as pid 1503
main.go:142: Using a maximum of 10 CPUs
natsclient.go:108: Connection established to nats://localhost:4222 (NBDVP4CI3SVEUCD3WR4WNMUQCQVAAPH625SK7AD434JYUGCQ7TWZ7EJU)
grpc_common.go:167: WARNING: No GRPC server certificate and/or key configured, running unencrypted
grpc_common.go:169: WARNING: No GRPC CA configured, expecting unencrypted connections
hub.go:190: WARNING: No shared secret has been set for internal clients.
backend_storage_static.go:72: Backend backend-1 added for https://client1/
backend_storage_static.go:72: Backend backend-2 added for https://client2/
backend_storage_static.go:72: Backend backend-3 added for https://client3/
backend_storage_static.go:72: Backend backend-4 added for https://client4/
backend_storage_static.go:72: Backend backend-5 added for https://client5/
hub.go:202: Using a maximum of 8 concurrent backend connections per host
hub.go:209: Using a timeout of 10s for backend connections
hub.go:262: Not using GeoIP database
mcu_janus.go:292: Connected to Janus WebRTC Server 1.1.2 by Meetecho s.r.l.
mcu_janus.go:298: Found JANUS VideoRoom plugin 0.0.9 by Meetecho s.r.l.
mcu_janus.go:303: Data channels are supported
mcu_janus.go:307: Full-Trickle is enabled
mcu_janus.go:309: Maximum bandwidth 1048576 bits/sec per publishing stream
mcu_janus.go:310: Maximum bandwidth 2097152 bits/sec per screensharing stream
mcu_janus.go:316: Created Janus session 6387500080897936
mcu_janus.go:323: Created Janus handle 394968486920516
main.go:263: Using janus MCU
hub.go:387: Using a timeout of 10s for MCU requests
backend_server.go:98: Using configured TURN API key
backend_server.go:99: Using configured shared TURN secret
backend_server.go:101: Adding "turn:signaling.hbps.co.za:5349?transport=udp" as TURN server
backend_server.go:101: Adding "turn:signaling.hbps.co.za:5349?transport=tcp" as TURN server
backend_server.go:114: No IPs configured for the stats endpoint, only allowing access from 127.0.0.1
main.go:339: Listening on 127.0.0.1:8080
client.go:283: Client from CLIENT1_IP has RTT of 194 ms (194.062168ms)
client.go:302: Error reading from CLIENT1_IP: read tcp 127.0.0.1:8080->127.0.0.1:45182: use of closed network connection
client.go:283: Client from CLIENT1_IP has RTT of 19 ms (19.460148ms)
client.go:302: Error reading from CLIENT1_IP: read tcp 127.0.0.1:8080->127.0.0.1:53558: use of closed network connection
client.go:283: Client from CLIENT1_IP has RTT of 60 ms (60.369351ms)
client.go:302: Error reading from CLIENT1_IP: read tcp 127.0.0.1:8080->127.0.0.1:53572: use of closed network connection
client.go:283: Client from CLIENT2_IP has RTT of 7 ms (7.669592ms)
client.go:283: Client from CLIENT1_IP has RTT of 568 ms (568.05427ms)
client.go:302: Error reading from CLIENT1_IP: read tcp 127.0.0.1:8080->127.0.0.1:60018: use of closed network connection
client.go:302: Error reading from CLIENT2_IP: read tcp 127.0.0.1:8080->127.0.0.1:60010: use of closed network connection
client.go:283: Client from CLIENT2_IP has RTT of 8 ms (8.494554ms)
client.go:283: Client from CLIENT1_IP has RTT of 127 ms (127.452376ms)
client.go:302: Error reading from CLIENT2_IP: read tcp 127.0.0.1:8080->127.0.0.1:60020: use of closed network connection
client.go:283: Client from CLIENT2_IP has RTT of 7 ms (7.927286ms)
client.go:302: Error reading from CLIENT1_IP: read tcp 127.0.0.1:8080->127.0.0.1:60026: use of closed network connection
client.go:283: Client from CLIENT1_IP has RTT of 35 ms (35.520615ms)
client.go:302: Error reading from CLIENT2_IP: read tcp 127.0.0.1:8080->127.0.0.1:41474: use of closed network connection
client.go:283: Client from CLIENT2_IP has RTT of 7 ms (7.991057ms)
client.go:302: Error reading from CLIENT1_IP: read tcp 127.0.0.1:8080->127.0.0.1:41490: use of closed network connection
client.go:302: Error reading from CLIENT2_IP: read tcp 127.0.0.1:8080->127.0.0.1:41502: use of closed network connection
client.go:283: Client from CLIENT1_IP has RTT of 49 ms (49.926458ms)
client.go:283: Client from CLIENT2_IP has RTT of 8 ms (8.148702ms)
client.go:302: Error reading from CLIENT1_IP: read tcp 127.0.0.1:8080->127.0.0.1:41508: use of closed network connection
client.go:302: Error reading from CLIENT2_IP: read tcp 127.0.0.1:8080->127.0.0.1:41524: use of closed network connection
client.go:283: Client from CLIENT2_IP has RTT of 7 ms (7.989346ms)
hub.go:884: Invalid message {Marshaler:<nil> Unmarshaler:<nil> Id: Type: Hello:<nil> Bye:<nil> Room:<nil> Message:<nil> Control:<nil> Internal:<nil> TransientData:<nil>} from CLIENT2_IP: type missing
client.go:302: Error reading from CLIENT2_IP: read tcp 127.0.0.1:8080->127.0.0.1:36020: use of closed network connection
client.go:283: Client from CLIENT1_IP has RTT of 3 ms (3.772848ms)
client.go:283: Client from CLIENT2_IP has RTT of 7 ms (7.808306ms)
client.go:302: Error reading from CLIENT2_IP: read tcp 127.0.0.1:8080->127.0.0.1:36028: use of closed network connection
client.go:302: Error reading from CLIENT1_IP: read tcp 127.0.0.1:8080->127.0.0.1:36024: use of closed network connection
client.go:283: Client from CLIENT2_IP has RTT of 8 ms (8.436012ms)
hub.go:884: Invalid message {Marshaler:<nil> Unmarshaler:<nil> Id: Type: Hello:<nil> Bye:<nil> Room:<nil> Message:<nil> Control:<nil> Internal:<nil> TransientData:<nil>} from CLIENT2_IP: type missing
client.go:283: Client from CLIENT1_IP has RTT of 7 ms (7.80567ms)
client.go:302: Error reading from CLIENT2_IP: read tcp 127.0.0.1:8080->127.0.0.1:36040: use of closed network connection
client.go:283: Client from CLIENT2_IP has RTT of 8 ms (8.061372ms)
client.go:302: Error reading from CLIENT1_IP: read tcp 127.0.0.1:8080->127.0.0.1:36056: use of closed network connection
client.go:302: Error reading from CLIENT2_IP: read tcp 127.0.0.1:8080->127.0.0.1:36058: use of closed network connection
client.go:283: Client from CLIENT2_IP has RTT of 8 ms (8.769925ms)
client.go:283: Client from CLIENT1_IP has RTT of 104 ms (104.228874ms)
client.go:302: Error reading from CLIENT2_IP: read tcp 127.0.0.1:8080->127.0.0.1:48866: use of closed network connection
client.go:302: Error reading from CLIENT1_IP: read tcp 127.0.0.1:8080->127.0.0.1:48868: use of closed network connection
client.go:283: Client from CLIENT2_IP has RTT of 7 ms (7.212955ms)
^Cmain.go:362: Interrupted
janus_client.go:404: conn.NextReader: read tcp 127.0.0.1:53648->127.0.0.1:8188: use of closed network connection
mcu_janus.go:269: Connection to Janus gateway was interrupted, reconnecting in 1s

This is what I get. I have rebooted the server in order to try and get rid of stale connections. But I still get the above errors. I have not made any config changes anywhere. And it is the first time my upgrade process has broken the way it has here. So I am at a bit of a loss.

fancycode commented 10 months ago

hub.go:884: Invalid message {Marshaler: Unmarshaler: Id: Type: Hello: Bye: Room: Message: Control: Internal: TransientData:} from CLIENT2_IP: type missing

This looks like #180, please try a make clean && make build.

hanserasmus commented 10 months ago

Also depending on your previous version, please check the signaling.service file, there have been changes in the past (e.g. #322) that fixed some startup problems.

My previous update was 2 months ago, and the old bin still starts up if I use that instead of the newly compiled one, which make me think there is something wrong with my new bin file or my setup as a whole.

fancycode commented 10 months ago

My previous update was 2 months ago, and the old bin still starts up if I use that instead of the newly compiled one, which make me think there is something wrong with my new bin file or my setup as a whole.

Yeah, makes sense. Did you try the make clean && make build to check if the Invalid message log entry goes away?

hanserasmus commented 10 months ago

Hi @fancycode. Sorry for the delayed response. My issue seems to persist. Today I did a git pull, make clean && make build and when I start it up with ./bin/signaling --config /etc/signaling/server.conf it all starts up without any issues and there are no errors. However, when I copy this bin into the /usr/bin/ folder, the service throws the following error:

Dec 31 11:26:23 signaling systemd[1]: signaling.service: Main process exited, code=dumped, status=31/SYS
Dec 31 11:26:23 signaling systemd[1]: signaling.service: Failed with result 'core-dump'.
Dec 31 11:26:24 signaling systemd[1]: signaling.service: Scheduled restart job, restart counter is at 3.
Dec 31 11:26:24 signaling systemd[1]: Stopped Nextcloud Talk signaling server.
Dec 31 11:26:24 signaling systemd[1]: Started Nextcloud Talk signaling server.
Dec 31 11:26:24 signaling kernel: [1726194.342198] audit: type=1326 audit(1704014784.256:48): auid=4294967295 uid=997 gid=997 ses=4294967295 pid=48825 comm="signaling" exe="/usr/bin/signaling" sig=31 arch=c000003e syscall=160 compat=0 ip=0x40802e code=0x80000000
Dec 31 11:26:24 signaling systemd[1]: signaling.service: Main process exited, code=dumped, status=31/SYS
Dec 31 11:26:24 signaling systemd[1]: signaling.service: Failed with result 'core-dump'.
Dec 31 11:26:24 signaling systemd[1]: signaling.service: Scheduled restart job, restart counter is at 4.
Dec 31 11:26:24 signaling systemd[1]: Stopped Nextcloud Talk signaling server.
Dec 31 11:26:24 signaling systemd[1]: Started Nextcloud Talk signaling server.
Dec 31 11:26:24 signaling kernel: [1726194.842208] audit: type=1326 audit(1704014784.756:49): auid=4294967295 uid=997 gid=997 ses=4294967295 pid=48836 comm="signaling" exe="/usr/bin/signaling" sig=31 arch=c000003e syscall=160 compat=0 ip=0x40802e code=0x80000000
Dec 31 11:26:24 signaling systemd[1]: signaling.service: Main process exited, code=dumped, status=31/SYS
Dec 31 11:26:24 signaling systemd[1]: signaling.service: Failed with result 'core-dump'.
Dec 31 11:26:25 signaling systemd[1]: signaling.service: Scheduled restart job, restart counter is at 5.
Dec 31 11:26:25 signaling systemd[1]: Stopped Nextcloud Talk signaling server.
Dec 31 11:26:25 signaling systemd[1]: signaling.service: Start request repeated too quickly.
Dec 31 11:26:25 signaling systemd[1]: signaling.service: Failed with result 'core-dump'.
Dec 31 11:26:25 signaling systemd[1]: Failed to start Nextcloud Talk signaling server.
Dec 31 11:26:40 signaling kernel: [1726211.023187] [UFW BLOCK] IN=eth0 OUT= MAC=33:33:00:00:00:01:a6:b4:70:22:07:5e:86:dd SRC=fe80:0000:0000:0000:a4b4:70ff:fe22:075e DST=ff02:0000:0000:0000:0000:0000:0000:0001 LEN=192 TC=0 HOPLIMIT=1 FLOWLBL=496516 PROTO=UDP SPT=5678 DPT=5678 LEN=152

Any ideas as to why this is still happening? The old bin file with the .service invocation works 100%, just the newly compiled one is the issue.

fancycode commented 6 months ago

Do you have SELinux or something similar enabled? The "kernel: audit ..." log entries seem to indicate that something was blocked and the signaling server was killed because of that. Do you have other logs with more details? These rules might only be active for applications run from /usr/bin which is why it works if you run it from your build directory.

hanserasmus commented 6 months ago

Nope. On Ubuntu 20.04. No SELinux enabled. And nothing fancy done to a minimal server install. No idea which logs to look at honestly, as I have never encountered a "blocking" in Ubuntu before.

fancycode commented 6 months ago

Another guess: the syscall from the audit log is 160 (syscall=160), which according to https://filippo.io/linux-syscall-table/ is setrlimit. The service seems to run with uid/gid 997/997 which might not be allowed to execute this call while your user is and executing the service manually works for you.

Could you try running as root (or your user) from the systemd service and check if this works?

Also could you please check /var/log/syslog, /var/log/auth.log and dmesg for any messages when the process crashes?

fancycode commented 6 months ago

This looks like a potential cause for the setrlimit call during startup with newer builds: https://github.com/golang/go/issues/46279

Could you please provide your .service file? Maybe you need to apply the changes from https://github.com/strukturag/nextcloud-spreed-signaling/pull/322 (as mentioned above).

In https://github.com/restic/rest-server/pull/249 a similar problem was reported where changing the .service file fixed the startup crash.

fancycode commented 5 months ago

@hanserasmus did you get a chance to look at your .service file yet?

hanserasmus commented 5 months ago

I am sorry for the late reply @fancycode , but here is my service file's contents:

[Unit]
Description=Nextcloud Talk signaling server
After=janus.service

[Service]
ExecStart=/usr/bin/signaling --config /etc/signaling/server.conf
User=signaling
Group=signaling
Restart=on-failure

# Makes sure that /etc/signaling is owned by this service
ConfigurationDirectory=signaling

# Hardening - see systemd.exec(5)
CapabilityBoundingSet=
ExecPaths=/usr/bin/signaling /usr/lib
LockPersonality=yes
MemoryDenyWriteExecute=yes
NoExecPaths=/
NoNewPrivileges=yes
PrivateDevices=yes
PrivateTmp=yes
PrivateUsers=yes
ProcSubset=pid
ProtectClock=yes
ProtectControlGroups=yes
ProtectHome=yes
ProtectHostname=yes
ProtectKernelLogs=yes
ProtectKernelModules=yes
ProtectKernelTunables=yes
ProtectProc=invisible
ProtectSystem=strict
RemoveIPC=yes
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
RestrictNamespaces=yes
RestrictRealtime=yes
RestrictSUIDSGID=yes
SystemCallArchitectures=native
SystemCallFilter=@system-service
SystemCallFilter=~ @privileged @resources

[Install]
WantedBy=multi-user.target

Looking at the PR you mentioned, should I remove the ~@resources part in my service file?

hanserasmus commented 5 months ago

I think this worked!!!

I was able to compile a new version, and with the content of my service file now like this:

[Unit]
Description=Nextcloud Talk signaling server
After=janus.service

[Service]
ExecStart=/usr/bin/signaling --config /etc/signaling/server.conf
User=signaling
Group=signaling
Restart=on-failure

# Makes sure that /etc/signaling is owned by this service
ConfigurationDirectory=signaling

# Hardening - see systemd.exec(5)
CapabilityBoundingSet=
ExecPaths=/usr/bin/signaling /usr/lib
LockPersonality=yes
MemoryDenyWriteExecute=yes
NoExecPaths=/
NoNewPrivileges=yes
PrivateDevices=yes
PrivateTmp=yes
PrivateUsers=yes
ProcSubset=pid
ProtectClock=yes
ProtectControlGroups=yes
ProtectHome=yes
ProtectHostname=yes
ProtectKernelLogs=yes
ProtectKernelModules=yes
ProtectKernelTunables=yes
ProtectProc=invisible
ProtectSystem=strict
RemoveIPC=yes
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
RestrictNamespaces=yes
RestrictRealtime=yes
RestrictSUIDSGID=yes
SystemCallArchitectures=native
SystemCallFilter=@system-service
#SystemCallFilter=~ @privileged @resources
SystemCallFilter=~ @privileged

[Install]
WantedBy=multi-user.target

It seems the service stays up and no longer fails!

Thank you @fancycode and I am really sorry it took me so long to respond and fix this. It has been insanely busy on my end. Not an excuse I know. But thank you again! I hope this will help someone else as well!

fancycode commented 4 months ago

Perfect, thanks for confirming. So this is fixed already in the repository then.

No worries about the response delay, I'm not quick in answering in some cases, too :-)

strukturag / nextcloud-spreed-signaling

'Failed to start: core-dump' after upgrade #615