libretro / Lakka-LibreELEC

Lakka is a lightweight Linux distribution that transforms a small computer into a full blown game console.
https://www.lakka.tv
1.73k stars 286 forks source link

Continuous cycle of core dumps after headless reboot #548

Closed scarabdesign closed 5 years ago

scarabdesign commented 5 years ago

- Which version of Lakka are you using? Rockchip.TinkerBoard.arm-2.1.1

- What system hardware are you using? ASUS Tinker Board QuadCore ARM SOC 1.8GHz 2GB of RAM Rockchip RK3288 Soc with Mali - T764 GPU

- What did you do? "reboot" command while connected with ssh and no display attached

- What did you expect to happen? Reboot will happen and Lakka will start and be stable

- What happened instead? Upon reboot completion, Lakka segfaults, dumps a core, repeats segfault/dump in a loop, until SD card is full, still continues dumping cores of 0 bites. It stops crashing and runs fine when I switch the HDMI display on.

I mentioned this issue in the formus a couple days ago and it has not been responded to (https://forums.libretro.com/t/lakka-continuous-cycle-of-core-dumps-after-headless-reboot/18340), so I'm trying here. I will try to get logs about why it's crashing, if needed, but I'll have to figure out how to make logging survive after a reboot. I'm assuming it's because it can't find a valid display.

I honestly don't mind that it crashes until it finds a valid display, but I don't like it filling up the space. At the bare minimum, how do I turn off core dumps?

ToKe79 commented 5 years ago

Lakka does not wait until a display is connected and tries to start retroarch, which segfaults, as it does not find output device.

Idea how to solve this: Instead of starting /usr/bin/retroarch directly (via systemd), we could have a script (e.g. /usr/bin/retroarch-start.sh), which would loop until a display connected (and sleep between the loops) and when it finds out a connected display, it would execute /usr/bin/retroarch.

To find out if display is connected on Generic platform: cat /sys/class/drm/*/status

I do not know where to look on other platforms and I think it is different on each platform, but we could parse /etc/release to find out on which platform we are and use the platform specific path to look for connected display device (or even skip checking for connected display).

I will prepare something, however input from others would help - where to look for the status of the display connector on various platforms?

ToKe79 commented 5 years ago

This is first draft of that feature: https://github.com/ToKe79/Lakka-LibreELEC/commit/97bb6a26ffa8e338c2c32b1248d0f057e6b6cd62

natinusala commented 5 years ago

The systemd service already takes care of restarting RetroArch when it crashes. The issue here is that it spams dumps and fills the SD card, right ? What you propose is a good idea but what if in a future version it starts to crash if no audio device is available? Would you adapt your script to check for video and audio ?

IMO the correct fix would be to disable crash logging to a file unless it's explicitly enabled (kernel cmdline?).

scarabdesign commented 5 years ago

The systemd service already takes care of restarting RetroArch when it crashes.

True, but then you are assuming that continuous crashing is acceptable behavior. If so, OK, but the loop/check that Tomáš Kelemen suggests (even adding audio) is at least be more graceful, right?

IMO the correct fix would be to disable crash logging to a file unless it's explicitly enabled

I would agree with this either way, because I've not been able to get any useful info from the core dumps with GDB, presumably because there's no debug flags in the binary. Maybe I'm doing something wrong here, but I'd just prefer not to get the dumps.

natinusala commented 5 years ago

True, but then you are assuming that continuous crashing is acceptable behavior.

Well I don't assume that continuous crashing is acceptable but if it were to continuously crash, I'd rather not have two mechanisms to take care of that (systemd service + launch script). This is not about fixing the crash, it is about handling it properly : it will crash, let's take care of it elegantly.

Having RA crash and the service relaunch it forever is equivalent to having the script wait forever IMO. One case adds a script, another doesn't.

ToKe79 commented 5 years ago

This https://github.com/ToKe79/Lakka-LibreELEC/commit/c51d46b64abafbc0c66fa1e4c20cda4ef46b60f9 should probably disable the coredumps (for now only for TinkerBoard).

@scarabdesign can you test with the image here? http://nightly.builds.lakka.tv/special_builds/tkb-disable-coredumps/

@natinusala I think having a start-script, we could use it to create logs automatically, e.g. user would place empty file debug_me.txt into the root of the FAT32 partition, start-script would notice the file and create necessary logs on the FAT32 partition, which could be then provided to the devs. systemd unit is there to restart RetroArch not only after a crash, but also after quitting RetroArch (e.g. by mistake after pressing ESC or on purpose - some settings take effect after restart of RA).

scarabdesign commented 5 years ago

Ok, looks like it's working as expected. Here is the journalctl of the loop just before switching HDMI back on:


Oct 12 02:20:18 Lakka systemd[1]: retroarch.service: Main process exited, code=killed, status=11/SEGV
Oct 12 02:20:18 Lakka systemd[1]: retroarch.service: Unit entered failed state.
Oct 12 02:20:18 Lakka systemd[1]: retroarch.service: Failed with result 'signal'.
Oct 12 02:20:20 Lakka systemd[1]: retroarch.service: Service hold-off time over, scheduling restart.
Oct 12 02:20:20 Lakka systemd[1]: Stopped Retroarch.
Oct 12 02:20:20 Lakka systemd[1]: Starting Retroarch...
Oct 12 02:20:20 Lakka systemd[1]: Started Retroarch.
Oct 12 02:20:21 Lakka kernel: Core dump to |/dev/null pipe failed
Oct 12 02:20:21 Lakka systemd[1]: retroarch.service: Main process exited, code=killed, status=11/SEGV
Oct 12 02:20:21 Lakka systemd[1]: retroarch.service: Unit entered failed state.
Oct 12 02:20:21 Lakka systemd[1]: retroarch.service: Failed with result 'signal'.
Oct 12 02:20:23 Lakka systemd[1]: retroarch.service: Service hold-off time over, scheduling restart.
Oct 12 02:20:23 Lakka systemd[1]: Stopped Retroarch.
Oct 12 02:20:23 Lakka systemd[1]: Starting Retroarch...
Oct 12 02:20:23 Lakka systemd[1]: Started Retroarch.
Oct 12 02:20:23 Lakka kernel: Core dump to |/dev/null pipe failed
Oct 12 02:20:23 Lakka systemd[1]: retroarch.service: Main process exited, code=killed, status=11/SEGV
Oct 12 02:20:23 Lakka systemd[1]: retroarch.service: Unit entered failed state.
Oct 12 02:20:23 Lakka systemd[1]: retroarch.service: Failed with result 'signal'.
Oct 12 02:20:25 Lakka kernel: rockchip-vop ff930000.vop: [drm:vop_crtc_enable] Update mode to 1024*768, close all win
Oct 12 02:20:25 Lakka systemd[1]: retroarch.service: Service hold-off time over, scheduling restart.
Oct 12 02:20:25 Lakka systemd[1]: Stopped Retroarch.
Oct 12 02:20:25 Lakka systemd[1]: Starting Retroarch...
Oct 12 02:20:25 Lakka systemd[1]: Started Retroarch.
Oct 12 02:20:26 Lakka kernel: rockchip-vop ff930000.vop: [drm:vop_crtc_enable] Update mode to 1440*900, close all win
Oct 12 02:20:26 Lakka kernel: dwhdmi-rockchip ff980000.hdmi: Rate 106000000 missing; compute N dynamically
scarabdesign commented 5 years ago

No dump, RetroArch running smoothly. Thanks!

Is there a switch I can use to re-enable the dumps if I want to try debugging crashes with some game cores?

ToKe79 commented 5 years ago

From the log it looks like it still wants to write the coredump, but the |/dev/null prevents it (maybe the pipe is unnecessary). So I guess with sysctl -w kernel.core_pattern=/storage/.cache/cores/core.%E.%t.%p you should get the original behavior back, i.e. the coredumps will be saved. And maybe ulimit -c unlimited is needed too.