Cinnamon crash on start

linuxmint / cinnamon

A Linux desktop featuring a traditional layout, built from modern technology and introducing brand new innovative features.

GNU General Public License v2.0

4.53k stars 735 forks source link

Cinnamon crash on start #11635

Closed Soapux closed 1 year ago

Soapux commented 1 year ago

 * Cinnamon version (5.6.8)
 * Distribution - Linux From Scratch
 * Nvidia 1070 Ti with 525.105.17
 * 64 bit
 * GCC 12.2
 * Glibc 2.37
 * Binutils 2.40
 * Attach ~/.xsession-errors, or /var/log/syslog

Backtrace

#0  0x00007f137da449c7 in g_closure_unref (closure=0xed158b4c08245489) at ../gobject/gclosure.c:621
#1  0x00007f137d1b43e5 in gjs_marshal_callback_release(JSContext*, GjsArgumentCache*, GjsFunctionCallState*, GIArgument*, GIArgument*) (in_arg=0x211aa90, out_arg=<optimized out>) at ../gi/arg-cache.cpp:939
#2  0x00007f137d1c1d92 in Gjs::Function::finish_invoke(JSContext*, JS::CallArgs const&, GjsFunctionCallState*, _GIArgument*) (this=0x211f350, cx=0x17cf4f0, args=..., state=0x7ffc93238840, r_value=0x0) at ../gi/function.cpp:1040
#3  0x00007f137d1c3528 in Gjs::Function::invoke(JSContext*, JS::CallArgs const&, JS::Handle<JSObject*>, _GIArgument*) (this=<optimized out>, context=0x17cf4f0, args=<optimized out>, this_obj=..., r_value=<optimized out>) at ../gi/function.cpp:1003
#4  0x00007f137d1c406f in Gjs::Function::call(JSContext*, unsigned int, JS::Value*) (context=context@entry=0x17cf4f0, js_argc=<optimized out>, vp=<optimized out>) at /usr/include/mozjs-78/js/RootingAPI.h:596
#5  0x00007f137a32b40a in CallJSNative(JSContext*, bool (*)(JSContext*, unsigned int, JS::Value*), js::CallReason, JS::CallArgs const&) (cx=0x17cf4f0, native=0x7f137d1c3f90 <Gjs::Function::call(JSContext*, unsigned int, JS::Value*)>, reason=js::CallReason::Call, args=...) at /home/user/sources/general_libs/firefox-78.15.0/js/src/vm/Interpreter.cpp:493
#6  js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct, js::CallReason) (cx=cx@entry=0x17cf4f0, args=..., construct=construct@entry=js::NO_CONSTRUCT, reason=reason@entry=js::CallReason::Call) at /home/user/sources/general_libs/firefox-78.15.0/js/src/vm/Interpreter.cpp:565
#7  0x00007f137a3243ad in InternalCall(JSContext*, js::AnyInvokeArgs const&, js::CallReason) (cx=0x17cf4f0, args=..., reason=js::CallReason::Call) at /home/user/sources/general_libs/firefox-78.15.0/js/src/vm/Interpreter.cpp:648
#8  js::CallFromStack(JSContext*, JS::CallArgs const&) (cx=0x17cf4f0, args=...) at /home/user/sources/general_libs/firefox-78.15.0/js/src/vm/Interpreter.cpp:652
#9  Interpret(JSContext*, js::RunState&) (cx=<optimized out>, cx@entry=0x17cf4f0, state=...) at /home/user/sources/general_libs/firefox-78.15.0/js/src/vm/Interpreter.cpp:3312
#10 0x00007f137a31b9ef in js::RunScript(JSContext*, js::RunState&) (cx=0x17cf4f0, state=...) at /home/user/sources/general_libs/firefox-78.15.0/js/src/vm/Interpreter.cpp:465
#11 0x00007f137a32c07b in js::ExecuteKernel(JSContext*, JS::Handle<JSScript*>, JS::Handle<JSObject*>, JS::Handle<JS::Value>, js::AbstractFramePtr, JS::MutableHandle<JS::Value>) (cx=0xed158b4c08245489, script=..., envChainArg=..., newTargetValue=..., evalInFrame=..., result=...) at /home/user/sources/general_libs/firefox-78.15.0/js/src/vm/Interpreter.cpp:840
#12 0x00007f137a412b58 in EvaluateSourceBuffer<char16_t>(JSContext*, js::ScopeKind, JS::Handle<JSObject*>, JS::ReadOnlyCompileOptions const&, JS::SourceText<char16_t>&, JS::MutableHandle<JS::Value>) (cx=cx@entry=0x17cf4f0, scopeKind=js::ScopeKind::NonSyntactic, env=env@entry=..., optionsArg=..., srcBuf=..., rval=rval@entry=...) at /home/user/sources/general_libs/firefox-78.15.0/js/src/vm/CompilationAndEvaluation.cpp:498
#13 0x00007f137a412c92 in JS::Evaluate(JSContext*, JS::Handle<JS::StackGCVector<JSObject*, js::TempAllocPolicy> >, JS::ReadOnlyCompileOptions const&, JS::SourceText<char16_t>&, JS::MutableHandle<JS::Value>) (cx=0x17cf4f0, envChain=envChain@entry=..., options=..., srcBuf=..., rval=rval@entry=...) at /home/user/sources/general_libs/firefox-78.15.0/js/src/vm/CompilationAndEvaluation.cpp:529
#14 0x00007f137d1f0321 in GjsContextPrivate::eval_with_scope(JS::Handle<JSObject*>, char const*, long, char const*, JS::MutableHandle<JS::Value>) (this=0x17c28f0, scope_object=..., script=<optimized out>, script_len=<optimized out>, filename=<optimized out>, retval=...) at /usr/include/mozjs-78/js/RootingAPI.h:903
#15 0x00007f137d1f05e8 in GjsContextPrivate::eval(char const*, long, char const*, int*, _GError**) (this=0x17c28f0, script=0x7f137dcebf50 "imports.ui.environment.init();imports.ui.main.start();", script_len=-1, filename=0x7f137dcebee2 "<main>", exit_status_p=0x7ffc93239cd4, error=0x7ffc93239cd8) at /usr/include/mozjs-78/js/RootingAPI.h:596
#16 0x00007f137d1f0798 in gjs_context_eval(GjsContext*, char const*, gssize, char const*, int*, GError**) (js_context=js_context@entry=0x17c2a50 [GjsContext], script=script@entry=0x7f137dcebf50 "imports.ui.environment.init();imports.ui.main.start();", script_len=script_len@entry=-1, filename=filename@entry=0x7f137dcebee2 "<main>", exit_status_p=exit_status_p@entry=0x7ffc93239cd4, error=error@entry=0x7ffc93239cd8) at ../cjs/context.cpp:1192
#17 0x00007f137dcd990d in cinnamon_plugin_start (plugin=<optimized out>) at ../src/cinnamon-plugin.c:127
#18 0x00007f137d58f9c7 in meta_plugin_manager_new (compositor=compositor@entry=0x1a76bd0 [MetaCompositorX11]) at ../src/compositor/meta-plugin-manager.c:113
#19 0x00007f137d5887ff in meta_compositor_manage (compositor=0x1a76bd0 [MetaCompositorX11]) at ../src/compositor/compositor.c:619
#20 0x00007f137d5a473b in enable_compositor (display=0x1af8ea0 [MetaDisplay]) at ../src/core/display.c:628
#21 meta_display_open () at ../src/core/display.c:949
#22 0x00007f137d5aedbc in meta_run () at ../src/core/main.c:660
#23 0x0000000000402772 in main (argc=<optimized out>, argv=<optimized out>) at ../src/main.c:388

xsession-errors:

discover_other_daemon: 1discover_other_daemon: 1discover_other_daemon: 1mutter-Message: 18:20:54.824: Enabling experimental feature 'x11-randr-fractional-scaling'
Gjs-Message: 18:20:55.078: Profiler is disabled. Not setting up signals.
Gjs-Message: 18:20:55.200: JS LOG: About to start Cinnamon
Gjs-Message: 18:20:55.220: JS LOG: [LookingGlass/info] Cinnamon.AppSystem.get_default() started in 18 ms
Gjs-Message: 18:20:55.222: JS LOG: [LookingGlass/info] loading user theme: /usr/share/themes/Mint-Y-Dark-Sand/cinnamon/cinnamon.css
Gjs-Message: 18:20:55.234: JS LOG: [LookingGlass/info] added icon directory: /usr/share/themes/Mint-Y-Dark-Sand/cinnamon
Exception in thread Thread-2 (wait_for_process):
Traceback (most recent call last):
  File "/usr/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.11/threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/bin/cinnamon-launcher", line 88, in wait_for_process
    os.execvp(FALLBACK_COMMAND, (FALLBACK_COMMAND,) + FALLBACK_ARGS)
  File "<frozen os>", line 574, in execvp
  File "<frozen os>", line 616, in _execvpe
  File "<frozen os>", line 607, in _execvpe
FileNotFoundError: [Errno 2] No such file or directory
Failed to play sound: Sound disabled
** Message: 18:20:59.646: nemo-desktop: session is cinnamon, establishing proxy
cinnamon-session[383]: WARNING: t+5.48871s: Detected that screensaver has appeared on the bus

** (nemo-desktop:595): WARNING **: 18:21:04.943: nemo-desktop: Desktop failsafe timeout reached, applying fallback behavior

(xapp-sn-watcher:590): GLib-GObject-CRITICAL **: 18:21:29.937: g_object_set: assertion 'G_IS_OBJECT (object)' failed

(xapp-sn-watcher:590): GLib-GIO-CRITICAL **: 18:21:29.937: g_dbus_interface_skeleton_flush: assertion 'G_IS_DBUS_INTERFACE_SKELETON (interface_)' failed

(xapp-sn-watcher:590): GLib-GObject-CRITICAL **: 18:21:29.937: g_object_set: assertion 'G_IS_OBJECT (object)' failed

(xapp-sn-watcher:590): GLib-GIO-CRITICAL **: 18:21:29.937: g_dbus_interface_skeleton_flush: assertion 'G_IS_DBUS_INTERFACE_SKELETON (interface_)' failed

(xapp-sn-watcher:590): GLib-GObject-CRITICAL **: 18:21:29.937: invalid (NULL) pointer instance

(xapp-sn-watcher:590): GLib-GObject-CRITICAL **: 18:21:29.937: g_signal_emit_by_name: assertion 'G_TYPE_CHECK_INSTANCE (instance)' failed
cinnamon-session[383]: WARNING: t+35.49135s: Detected that screensaver has left the bus
cinnamon-session[383]: WARNING: t+1257.15547s: Application 'cinnamon-settings-daemon-automount.desktop' killed by signal 15
cinnamon-session[383]: WARNING: t+1257.15550s: Application 'cinnamon-settings-daemon-housekeeping.desktop' killed by signal 15
cinnamon-session[383]: WARNING: t+1257.15551s: Application 'cinnamon-settings-daemon-screensaver-proxy.desktop' killed by signal 15
cinnamon-session[383]: WARNING: t+1257.15552s: Application 'cinnamon-settings-daemon-a11y-settings.desktop' killed by signal 15
cinnamon-session[383]: WARNING: t+1257.15553s: Application 'cinnamon-settings-daemon-power.desktop' killed by signal 15
cinnamon-session[383]: WARNING: t+1257.15554s: Application 'cinnamon-settings-daemon-clipboard.desktop' killed by signal 15
cinnamon-session[383]: WARNING: t+1257.15674s: Application 'cinnamon-settings-daemon-keyboard.desktop' killed by signal 15
cinnamon-session[383]: WARNING: t+1257.15677s: Application 'cinnamon-settings-daemon-media-keys.desktop' killed by signal 15
cinnamon-session[383]: WARNING: t+1257.16166s: Application 'cinnamon-settings-daemon-xsettings.desktop' killed by signal 15
cinnamon-session[383]: WARNING: t+1257.16173s: Application 'cinnamon-settings-daemon-background.desktop' killed by signal 15
cinnamon-session[383]: WARNING: t+1257.16176s: Application 'cinnamon-settings-daemon-color.desktop' killed by signal 15

Nightwing0815 commented 1 year ago

Hey there, I've got crashes since a few days, too. @Soapux : sorry for hijacking your thread, but i think, it's almost the same error?!? :smile:

My HP Laptop crashes, my ASUS with NVidia not, it's weird... I just created a new user and did update some mesa-drivers on hold, they'll do the crash 100% safe, but when i go back with timeshift, two or three times a day, after restart, it crashes too. Attaching full mintreport with stack trace... Maybe it's a good hint for you.

ToM

cinnamon crash.tar.gz

Note: The used kernel isn't the problem, 5.15 / 6.10oem / 6.2 / 6.3 == same same

Gaturinho commented 1 year ago

I've also been having catastrophic cinnamon-session crashes for 40 days now, and it's kind of random: Sometimes right at the start of the session, sometimes after 6 hours, there's a total Cinnamon crash, and restarting Cinnamon doesn't fix it ( It doesn't even restart. It just doesn't respond to ANY commands anymore ). Icons disappear, launchers are unresponsive. Terminal and the Shutdown button, only respond (when they respond) "input/output error"; and after a few minutes the screen goes all black, and the only way to turn it off is by using the machine's button, turning it off hot, without dismounting the disks. The scariest thing is the syslog, with a "hole" for the entire time the "phenomenon" lasts. No data, no stack trace, no journaling. In the moments leading up to the crash, the syslog always shows just something like:

Apr 27 16:36:01 Valentine at-spi-bus-launcher[5151]: dbus-daemon[5151]: Activating service name='org.a11y. atspi.Registry' requested by ':1.0' (uid=1000 pid=4767 comm="cinnamon-session --session cinnamon " label="unconfined") Apr 27 16:36:01 Valentine at-spi-bus-launcher[5151]: dbus-daemon[5151]: Successfully activated service 'org.a11y.atspi.Registry' Apr 27 16:36:01 Valentine at-spi-bus-launcher[5154]: SpiRegistry daemon is running with well-known name - org.a11y.atspi.Registry Apr 27 16:36:01 Valentine cinnamon-session[4767]: WARNING: t+0.02787s: Could not read /home/invernos/.config/autostart/mainline-notify.desktop: Key file has no key “Name ” in the “Desktop Entry” group Apr 27 16:36:01 Valentine cinnamon-session[4767]: WARNING: t+0.03145s: Could not read /home/invernos/.config/autostart/MOC player.desktop: Key file contains line “�# 032meta#001��t#011��#005�� `W%x��:��#003��#002z��#004��\EF \BF

i.e. random dbus warnings ( no failure indication, but always some dbus warning, but randomly could be dbus-brocker , dbus-daemon , or at-spi-bus-launcher ) followed by a string of invalid characters that say nothing (the invalid characters can go on and on for as long as it takes me to shut down and restart the machine), and I can't find what the "input/output error" could be -- Has the shell lost connection to the kernell? Disk I/O failure ? Only Saint Torvalds knows... I've already changed the Kernell, reinstalled the whole system. But in common, always Mint 21.1 Vera. And as I have found other similar complaints on forums out there ( random catastrophic crash of Cinnamon session, without relevant information from syslog, dmsg, journal or stack trace ) but always with one thing in common: the most current version of Cinnamon, no only in Mint Vera, but also in manjaro, Debian, gentoo, Arch and other distros that may or may not use cinnamon. So, following my philosophy that developers should always be the first to know, and have the right to be kept informed, advise and be heard, I decided to run here. and honestly, in the 15 years I've been using Cinnamon on Mint (and I've used it on Elementary OS, when Pantheon broke), this is the first time I've seen a serious flaw in Cinnamon. It has always worked flawlessly, and the developers' work has always been of great excellence. But, it was really expected that in the migration to GNOME 3 something bad could go wrong...

In time, the configuration here is i5, 64 bit 16Gb DDR4, Tuned in throghput-perfomance, and session with clementine, 3 Nemo windows and 20 Chrome tabs open. Swapness set to just 5, cpulimit and Ulimit set fine. the only abnormality is the Kernell parameters: GRUB_CMDLINE_LINUX_DEFAULT="plymouth:debug splash zswap.enabled=1 zswap.max_pool_percent=20 zswap.compressor=lz4 acpi = force quiet vga=current fsck.repair=yes" which even I find extravagant , and sometimes I get warnings that acpi is misconfigured.

Gaturinho commented 1 year ago

Excuse me! I completely forgot one important detail: In the months leading up to the issue reported just now, there was another issue I thought I'd never see since Bionic Beaver: interrupt gpe6F. Despite my heavy use (as I reported above. I always work with an exorbitant amount of heavy applications open), the 4 cores of my processor almost never exceed 8%. It is necessary to open another heavy application ( such as Gimp or Koloupaint ) for it to reach 16%; and from there, and ONLY from there, the CPU consumption goes up quickly. But I've been noticing the system monitor reporting cpu consumption spikes of 98% on a single core, then jumping to another core while the first one goes back to normal. I checked and saw that the processor was experiencing a gpe6F interrupt about 200 times in a single cycle. For those who don't have experience, it looks like an overheating problem, although thermald and psensor report that both the CPU and the disks are at normal temperatures (on my machine, "normal" temperature is really low: 30 to 38°C for cores, and SSDs are in the 86°C range). But running the command

grep . -r /sys/firmware/acpi/interrupts/

It shows several Kworks in operation. 5 for gpe66, sci, and gpe_all, but about 200 for the gpe13 switch, which is an indication of a driver malfunction, so it would be a Kernell problem. I adopted the same solution from the times of Mint based on Bionic, I included the line echo "disable" | sudo tee /sys/firmware/acpi/interrupts/gpe6F

in my rc.local to automatically disable gpe6F on startup, and that way no more spikes in cpu consumption.

I don't know if this has relevance or anything to do with the cinnamon-session crashes or the Dbus warnings, but I thought it was good to report. The current kernel is 5.19.0-41-generic, and the previous ones (when the problems started) were 5.19.0-32-generic and -35 and I noticed the following line appearing in the older logs with some frequency:

g_dbus_connection_call_internal: assertion 'bus_name == NULL || g_dbus_is_name (bus_name)' failed

They are always Dbus warnings, and the system monitor always indicates duplicate bus processes: 2 dbus-broker, 2 dbus-broker-launch, 2 dbus-daemon, but all with normal resource consumption. But I never had the opportunity to check if the consumption of these processes continues normally during the crashes, because I can't even open the terminal to check with the top and iotop commands, and because the crashes don't generate logs, as I reported.

Gaturinho commented 1 year ago

Well, news... but not so good. Due to the problem with the syslogs, I took tactical actions: when there was a new occurrence, 2 days later, at startup, I didn't insist; I went into a live session to retrieve /var/log/syslog and /home/.xsessions, and copy them to a secure device -- because if I successfully rebooted, they would be overwritten. Then I rebooted (almost always reboots after running fsck -- but that's weird, because I do this regularly and I don't delete lost+found, so there's no reason to require and re-require fsck. So it's definitely not a problem of badblocks on the 3 SSDs and the HD < of backup > But, it is a fact that the linux kernell is increasingly intolerant of any disk errors, and nothing less than perfect disks are required. Pure stupidity of the kernel developers, but pointing that out is not helpful ) Checking the logs, there were error logs, in several Mint autostart services ( /home/"user"/.config/autostart ). The strange thing is that almost all of them are disabled autostart services, some coming from several previous installations such as pulseaudio ( I use pipewire exclusively today - with success ), but for some reason, the system was reading ALL, and crashing. I deleted all unnecessary autostart files. By the way, I did NOT delete it; I created a "garbage" folder in the autostart folder and dragged unnecessary files there, safer. It greatly improved stability and eliminated a bunch of error warnings that polluted syslog and .xsessions. It took 2 days for a new crash on boot. This time, warnings involving gio and pixbuf, AFTER a series of previously unnoticed kernell warnings, and finally dbus collapse. The set looked like this (I cleaned it up a bit, leaving only the relevant lines):

May 4 17:14:27 Valentine kernel: [ 41.694053] ata1: link is slow to respond, please be patient (ready=0) May 4 17:14:27 Valentine kernel: [ 42.276605] [UFW BLOCK] IN=wlan0 OUT= MAC=01:00:5e:00:00:01:98:7e:ca:1a:4d:90:08: 00 SRC=192.168.15.1 DST=224.0.0.1 LEN=28 TOS=0x00 PREC=0x00 TTL=1 ID=0 DF PROTO=2 May 4 17:14:27 Valentine kernel: [ 42.276997] [UFW BLOCK] IN=wlan0 OUT= MAC=01:00:5e:00:00:01:98:7e:ca:1a:4d:90:08: 00 SRC=192.168.15.1 DST=224.0.0.1 LEN=28 TOS=0x00 PREC=0x00 TTL=1 ID=0 DF PROTO=2 May 4 17:14:27 Valentine kernel: [ 42.277335] [UFW BLOCK] IN=wlan0 OUT= MAC=01:00:5e:00:00:01:98:7e:ca:1a:4d:90:08: 00 SRC=192.168.15.1 DST=224.0.0.1 LEN=28 TOS=0x00 PREC=0x00 TTL=1 ID=0 DF PROTO=2 May 4 17:14:27 Valentine dbus-daemon[4932]: [session uid=1000 pid=4932] Activating via systemd: service name='org.freedesktop.Tracker3.Miner.Extract' unit='tracker-extract-3 .service' requested by ':1.23' (uid=1000 pid=5354 comm="/usr/libexec/tracker-miner-fs-3 " label="unconfined")

The effective break happened even after the following lines:

May 4 17:14:33 Valentine kernel: [ 48.222044] ata4.00: exception Emask 0x50 SAct 0x80 SErr 0x4090800 action 0xe frozen May 4 17:14:33 Valentine kernel: [ 48.222050] ata4.00: irq_stat 0x00400040, connection status changed May 4 17:14:33 Valentine kernel: [ 48.222052] ata4: SError: { HostInt PHYRdyChg 10B8B DevExch }

Looking at old logs, it always started with the line

ata1: link is slow to respond, please be patient (ready=0)

followed by several failure warnings until

ata4.00: exception Emask 0x50 SAct 0x80 SErr 0x4090800 action 0xe frozen

but the ata drive (all SSD) could vary, 1, 2, 3, 4; but always "exception Emask 0x50 SAct 0x80 SErr 0x4090800 action 0xe frozen"

Searching, I found an enlightening article on a forum:

https://serverfault.com/questions/749433/hard-resetting-link-exception-emask-0x50-sact-0x0-serr-0x4090800-action-0xe-froz

The machine is new and all discs are only 1 year old, and the SMART does not find faults, but the section called my attention

"very high-latency IOPS operations (eg: caused by SSD controller's garbage collection) resulting in SATA command timeout. Do your drive supports SATA Trim command? If so, try running fstrim /. Does it change anything?" The SSDs are Kingstom Phison Driven ( SSD III ), which has a garbage collector, which indeed generates "very high-latency IOPS operations" as the iotop command confirmed. But what was most enlightening was the comment, a few paragraphs before saying "Notice that the system is renegotiating at 1.5 Gbps. Try forcing 1.5 Gbps and see if that makes the system stable. It's a data point. Try askubuntu.com/a/146290/ 11751 for a short writeup on how to". Briefly, there is a sudden change (a catastrophic slowdown) in I/O speed at a critical time. Normally, I/O scheduling is not an issue because everything is limited by CPU capacity, which is a bigger bottleneck than read and write access; but the SSD garbage collector and Trim are largely performed by the SSD's own internal systems, without demanding the CPU (it does, but disproportionately little) but greatly limits the read and write capacity of an SSD III, at a time when that the CPU does not reduce the reading and writing demands in the same proportion, so the bottleneck becomes the I/O scheduling, and the Kernell is not acting up to it, letting the scheduling Buffer overflow and crashing the system, the idea this is it. Researching this I found on Archlinux.org, a very strong reference site for linuxers:

https://wiki.archlinux.org/title/Improving_performance

in the "Kernel's I/O schedulers" item which seems to be responsible for the I/O overflow I noticed. I'll try the optimization tip mentioned in item 2.5.4 Changing I/O scheduler. I've noticed so far an apparent drop in I/O load at startup, but it's hard to open iotop at the time of maximum CPU load (even with Tuned) at startup. And for that I tried, include the iotop command at startup

gnome-terminal -- sh -c "iotop"

successfully. It remains now to see if it solved the problem or if it's just lost hope...

But, of course, the most immediate prophylactic measure is to space the processes that start automatically via autostart, along 1, instead of cramming them to start them all in zero seconds. This I did there at the beginning, but without good results

For my part, I think it would be more productive if Kernell waits until the CPU consumption drops to start the garbage collector service of SSDs.

Nightwing0815 commented 1 year ago

Interesting findings, took some time to research, I guess? I've done nothing so far, just updated yesterday to kernel 6.3.1 mainline, since then I had a bunch of (re)starts, without crashing cinnamon, so that's better, cause I've not that time to research, lot's of work to do...

ToM

Gaturinho commented 1 year ago

Nightwing, I'm not one of those radicals who are against Mainline, Timeshift, against proprietary drivers, PPAs, grub-customizer and want a return to wood burning computer time - all these "right-wing Linux radicals", like the fanatics of security - I don't even want to talk like one... but... an advice for you: be careful with Mainline, because in some distros like Linux Mint, the Kernell is optimized for the distro, because the distro has some different things ( o Mint, for example, uses tmpfs and usermerge by default, which ubuntu does not ), so an ubuntu kernel in it can produce some problems that appear sooner or later (usually only a few days before instabilities appear ), so, it is better to stay only with official Mint kernel releases. I speak from experience, and I've been saved by Timeshift, more than once after problems caused by Mainline. By mailine, only stable or oldstable versions are safe (that is, the old ones). If the official release of Mint Vera is in 5.19.0-41-generic, you risk a lot trying to kernell 5.30 or 6.0 . But as far as my research is concerned (6 hours of research per day, for 5 days), what I wanted to transmit was not a solution -- I'm still testing it -- but a test methodology: Don't restart, go to the live-session and search the logs, and take advantage of the live-session to run fsck on all partitions ( sudo fsck.ext4 -yf /dev/sdx ) if the partition is on ext4. This solves the problem, and you can reboot afterwards with certainty that the system will come back fine.

Gaturinho commented 1 year ago

One more tip: Another "problem" is Timeshift: By default, it doesn't restore boot in Mint, because even though the "boot" directory appears in the root of the system, it's not really there, it's in another partition, and what has in the boot directory is just a symlink to that partition (usually partition 2) and timeshift doesn't read symlinks. You have to manually include "boot" in Timeshift ( Timeshift > cofiguration > filters > manually include the path to "boot" > it will look like - , change it to + Extra tip: to copy the path just select the entire boot folder, click on "copy", and paste in the field to add the path. instead of pasting the folder in the field, which is impossible, the system will paste the path to it. it's easier that way ) just in case you can also duplicate the boot folder in the system folder, but you'll have to update it (delete it and then duplicate the original again) every time you do a Kernell update.

Nightwing0815 commented 1 year ago

@Gaturinho thanks for your tips, I#ll have a look onto it over the weekend! Question: how to get newest drivers from a kernel into a mint-optimized kernel from the mint-devs? I need to get SPDIF working, on both machines. Sound in Linux drives me crazy, that's the only thing...

ToM

leigh123linux commented 1 year ago

@Gaturinho thanks for your tips, I#ll have a look onto it over the weekend! Question: how to get newest drivers from a kernel into a mint-optimized kernel from the mint-devs? I need to get SPDIF working, on both machines. Sound in Linux drives me crazy, that's the only thing...

ToM

There is no such thing as a mint-optimized kernel, mint uses the ubuntu kernel packages.

Gaturinho commented 1 year ago

Tom's comment was very correct and timely. To clarify, then: By "optimized" I just mean that certain features "up" (are loaded) by default in Mint, such as tmpfs and usrmerge. Pipewire, zram and zswap are available but their kernel modules are not loaded by default. I just thought saying "optimized" was more intuitive, and I still do, and Tom is right to point out that this is misleading. All features available in a linux kernel version are available to all distros using that same kernel version, such as tmpfs, smart, usrmerge, pipewire, zram and zswap. But some distros are optimized to use these features, some are not - i.e. it's more the distro that is optimized than the kernell, not that it makes much difference, because in the end, the kernell reflects the distro. For example, pipewire in Debian requires following some long-suffering tutorials. The pipewire (much better and more stable sound driver than pulseaudio) in Mint 21.1 Vera just needs to be enabled. If you install the Easy Effects equalizer, it automatically enables pipewire. Zram and zswap, just include in Kernell parameters. In short, Kernell and System are not separate things, much less independent; the system enables the Kernell and its features, and the Kernel provides those features to the system. It would be possible to debate this subject for hours, but in the end it would be the same as debating whether the chicken or the egg came first. As for your problem with sound, maybe you should look into Pipewire. Pulseudio is hell! Unstable and baroque. It has to do with the very way pulseaudio was developed. at first it was just a new Alsa dependency, but it has grown, chaotically and disorganized, and is the closest thing to a slum that exists in Linux. Rather than revamp it yet again, the developers abandoned it, and created something new from scratch, better suited to digital audio processing. If you type linux*pipewire*SPDIF into Google (written that way, with asterisks) you'll find lots of links and forum discussions, and lots of solutions to problems if any occur. I can't indicate the tutorial I used because it was in Mint 18, and it doesn't work anymore. But I didn't need it in Mint 20 or 21, because the settings were legacy from 18, and I only needed to install the necessary packages. Then it entered by itself, and I don't even know what went so well ( maybe because I'm used to fighting with pulseaudio ) If you type linux/mint/21*pipewire you'll find installation tips and tutorials. All you need to do is install pipewire-pulse and wireplumber ( sudo apt install pipewire pipewire-pulse wireplumber pavucontrol ) and take the opportunity to also install pavucontrol and Easy Effects, but the latter can only be installed through Flatpak ( the program center ) since the version via apt is buggy. The flatpak version is also there, there is a buggy runtime producing false negative warnings, but just apply the command "runtime/org.freedesktop.Platform/x86_64/19.08 -y" and then "flatpak --user repair" and restart the machine ( restart is mandatory or it will keep repeating the error message no matter how many times you reinstall the damn runtime ! ) and everything will be available. To confirm, open sound and then open system monitor and you will see the "pipewire" "pipewire-pulse" and "wireplumber" processes working. If you accept an additional tip, instead of VLC or Smplayer you want a lightweight audio (not video) player, try Clementine or Strawberry. VLC and SMPlayer are very heavy and consume a lot of machine resources and also have frequent bugs (1 time every 2 years). I have both, to use one in place of the other at these times, but it has happened to me that both of them go wrong at the same time ( .m3u playlists have been giving them problems for some time now... )

Nightwing0815 commented 1 year ago

Cool, thx. Much input after works end :smile: Okay, I understand all you wrote, but I have a few questions:

which components from a kernel are only in mint active (and not in e.g. oem or mainline), and how i can figure this out?
if i can figure that things out: can i activate it for example with a parameter?
I'd check this on the mainline 6.3.1...
I installed pipewire months ago, and from time to time i do a research for digital output (pcm signal via hdmi and/or spdif) to my teufel receiver. Had no success in the past and for now...
I don't like flatpak, so I don't use it. Simple as that :smile:

ToM

Gaturinho commented 1 year ago

Honestly, I don't know all Mint's optimizations, just the ones that required a lot of discussion (it takes years of discussions, before implementing them -- usrmerge took about 10 years, pipewire almost that much), but you're wrong to think they are " unique optimizations" of Mint. They are only by default in Mint, but "exclusive" literally isn't. But being the default has some important implications: In the case of Tmpfs, which I like to keep quoting (you'll already know why), there are old tutorials (from about 10-8 years ago, like "10 things to do after installing Mint 19 " - or Ubuntu 20 - names like that ) that taught how to install zram, zswap and tmpfs. However, since Mint 20, tmpfs is already standard and dynamically optimized (tmpfs is to move the most volatile temporary files from the system's tmp directory, to RAM, to "speed up" the system. But Mint stands out for dynamically adjusting the amount of memory for this, to avoid tmpfs taking up an unnecessary amount of memory. But what these tutorials teach, is to enable tmpfs in Kernell by assigning a fixed amount of memory to it, which is the most technical way to get it wrong, and make tmpfs drop the performance of your system instead of increasing it. This goes for all optimizations. And returning to the Kernell issue, note that its updates are made by the Mint repository, not by Ubuntu. It is precisely to maintain the Kernels adjusted to Mint's default optimizations. So, yes, Kernell is optimized ( in the parameters that define which modules are loaded by default, and which are not. For example, one of Pulseaudio's problems is precisely that it didn't load some modules by default. This forced the user to create a file called "pa" in the pulseaudio folder ( /home/.config/pulseaudio I think ) listing all the pulseaudio modules that should be loaded by default. I learned this one from Mestre Pinduvoz, from the Viva o Linux forum who, in turn, learned from the Grand Master Mr. Ein-sama, one of the Mint Forums administrators. But it was necessary to use commands to list all active modules in a pulseaudio session, and copy their names, one by one ( there were 83 ! ) to that pa file, and run a few more commands. It was easy enough - compared to the hell that is dealing with networks - but still, you felt like you were obligated to be a NASA engineer. And all this because pulse tended to "switch off" in the middle of the session, and that forced you to restart the machine - or kill the pulseaudio process in the system monitor and restart the media player... That's why I say that Pipewire it's better - its failures are rare, one per month at most ! ) By the way, not all optimizations are applied by adding parameters in Kernell. Preload, for example, just install (but it's already installed by default in Mint). There is a tutorial here about zram and zswap:

 https://linuxdicasesuporte.blogspot.com/2018/06/usar-zram-e-zswap-no-lugar-da-swap.html

It's in Portuguese but the google translator can handle it. There are English versions, but the instructions are crude.

Gaturinho commented 1 year ago

As for the pipewire, it depends on your version of Mint. On Mint 21, it was running much better than on Mint 20 and 19. As for Flatpak, it's great that you don't use it, it fills the var directory with garbage (30 Gb of garbage in temporary and cache files in mine, and some report 40 ! ), so I moved the flatpak folder to another partition, and left only a symlink in place (it's like the flatpak folder is still in the var, but is taking up space elsewhere). The problem is that everything in the Software Center is flatpak (they are good for sensitive applications that break with some updates, or development is slow and doesn't keep up with daily Linux updates, so it better come with its own, separate dependencies of system root dependencies) apart from the fact that sooner or later, a version of some application may be buggy, and you can then resort (even if only temporarily) to the flatpak or Snap version. For example, Kolourpaint is buggy in color palette options ( Kolourpaint > colors > KDE colors ), in regular and flatpak versions, but works fine in Snap version. As for Easy Effects, there is a bug in the Github version, there is no deb or PPA or snap option, and only the flatpak remains. Connman only works the Github version well. And is that

Gaturinho commented 1 year ago

My experiment failed (changing I/O schedulers), and the system crashed again on boot. So this was not the solution. I will now try to restrict the Ulimit and CPUlimit limits further. I'll be back when I have news. The working hypothesis now will be that the task scheduler ( scheduler ) is doing something silly and messing up the processes.

Gaturinho commented 1 year ago

I've just made some amazing progress: trying Ulimt didn't work (changing the rtprio parameter doesn't respond) but upon examining the problem, I found that an update deleted my Cpulimit. I Restored it, and now I'll wait and see. Anyway, it's safer to mess with it than rtprio or the scheduler settings. And it's simpler too!

In time: To take effect at startup, you need to upload the Cpulimit settings to be available at startup with the command:

sudo update-rc.d cpulimit defaults

Nightwing0815 commented 1 year ago

@Gaturinho what should be the response of the command? I'm sure, not this:

sudo update-rc.d cpulimit defaults
update-rc.d: error: unable to read /etc/init.d/cpulimit

ToM

Gaturinho commented 1 year ago

This type of error notice happens a lot. It just means you don't have a cpulimit file in /etc/init.d/cpulimit. CPUlimit involves 2 files: cpulimit_daemon.sh which is in /usr/bin; and cpulimit which lives in /etc/init.d, and is the initiator of cpulimit_daemon.sh. You can launch the cpulimit with the command

 sudo service cpulimit start

Check it with the command

 sudo service cpulimit status

and stop it with the command

  sudo service cpulimit stop

As it is automatically enabled via init.d, it is not necessary to enable it via systemd, so much so that if you use the traditional enable via systemd, which would be "sudo systemctl enable cpulimit", you will only get an error message, like this :

 update-rc.d: error: cpulimit Default-Start contains no runlevels, aborting

that does not produce any damage, thanks to a safety mechanism. To enable it at startup, just use the cpulimit initiator file in /etc/init.d.

The need to run "sudo update-rc.d cpulimit defaults", it's just to enable init.d. the answer you got is standard for cases where the launcher file to be enabled is missing -- the phrase "contains no runlevels" from the eros warning refers to the fact that starting a service via init.d requires the launcher to define the application runlevel. That's how we did it back in the days before systemd. Therefore, we say that init.d is for starting "legacy services", that is, things that were legacy ("inherited") from previous versions of the current system. In English, I think, "legacy" and "inherited" are kind of almost synonymous. But in other languages, like mine (Brazilian Portuguese, Lusitanian and Castilian Portuguese) these words have a very different meaning. That's why I decided to detail the term used, for those who don't speak English well. ( I don't master it, I use google translate, but it makes a lot of context errors, so I have to be aware of the ambiguities that will cause translation errors. This means a lot of manual correction in my texts. That's why I edit so much. )

The cpulimit_daemon.sh script can be found, with instructions, at

https://ubuntuforums.org/showthread.php?t=992706

The script should be named "The cpulimit_daemon.sh script can be found, with instructions, at

https://ubuntuforums.org/showthread.php?t=992706" and placed in /usr/bin.

And, the cpulimit.sh launcher file installed in init.d is automatically installed when installing cpulimit. If it fails, try

 sudo apt install --reinstall cpulimit

Gaturinho commented 1 year ago

Irritating. Cpulimit is dead and buried. First, it let start with the system, but it was possible to launch manually. But then it totally died, and any attempt to activate it just gave USELESS error warnings! Look, well, it's the same problem I had with Tuned ages ago: all of a sudden, it stopped being accepted by Systemd.

I decided not to fight with systemd, I gave up cpulimit and replaced it with Ananacy (Affectionately nicknamed "BANANACY" ). Best tutorial at: https://www.routech.ro/pt-br/como-controlar-as-prioridades-do-aplicativo-com-ananicy-no-linux/

Gaturinho commented 1 year ago

Now things start to make more sense, and the problem was more insane than expected: I was right to point out the change in I/O speed as the cause, but the reason was not the I/O itself, but the triggering sudden effect of Tuned, which seems to affect the I/O scheduler with a sudden increase in I/O requests, and it reacts by dropping the I/O speed (probably, it crashes). I only understood why until a few months ago, I was having to activate Tuned manually, because systemd did not accept its automatic startup. Now that it is back to starting automatically, the startup hangs. At first cpulimit held, if you were lucky, but cpulimit stopped starting automatically before tuned, and started crashing everything. The mid-session crashes were due to the absence of cpulimit, which started randomly, being present in the session sometimes, sometimes not. Now the final test will be to see if Ananacy solves the problem (or I'll have to go back to the Tuned manual). The annoying thing is that "bananacy" only acts on specific applications, or that at least fall into general categories such as web-browsers or audio-servers ( pulseaudio, pipewire ). it already comes with rules files ( .rules ) for practically all the most used applications in Linux ( there are for VLC, Clementine, Thunar, and so on ! ), but for my use, at least, some important ones are missing can dispense monitoring: Stacer is a CPU hog, if I need to use it, it can crash the whole system, even with 4 cores available. The initramfs has an old unnice problem, and after some updates it can become very slow, leaving the system vulnerable every time an update is performed ( It's a security issue: There's no point in an application that can break all dependencies, have a priority of just zero! It sometimes takes 30 minutes for apt to finish installing new dependencies, because the initramfs has lower priority than a music player. It risks the integrity of the system a lot, so I change the initramfs to maximum priority, nice -20, using a script ), and easyeffects is missing, which needs the same priority as pipewire (so I duplicated the pipewire file, and renamed the copy easyeffects ). By default, ananacy allows for pulseaudio and the nice -11 pipewire. In the configuration file ( "pipewire.rules" ) it doesn't seem to have any nice rules, but it has a treatment rule: "type": "audio-server" and everything with this rule receives by default the nice -11, which is enough for real-time applications.

Gaturinho commented 1 year ago

I think it's time for a definitive answer, unless something surprising happens. I'll get it straight from the start: Kernell 5.19.0-41's input/output scheduler is buggy, and this is likely to be the problem many people using newer versions of Mint have had, so; but the input/output scheduler seems to be misbehaving on any version of Kernell 5.xx.x, on mixed disk systems (SSD with mechanical HDD) including, I found the following wikipedia page: see [](https://www .wikiwand.com/en/Native_Command_Queuing) , the section "Hard disk drives" - "Performance" says it all - has an explicit reference to the input/output scheduler; so, I was on the right track, no doubt. There are many reports involving mainly Sansung hard drives, and I have also seen cases of problems with Hitachi and even older Seagates, always on Linux - they are certainly problems with the disk driver - The NCQ can be disabled by including the parameter libata.force=noncq in GRUB, right after "splash"

 GRUB_CMDLINE_LINUX_DEFAULT="quiet splash libata.force=noncq"

and this does not affect SSDs, but it does affect other mechanical disks if any, making them possibly very slow, especially on laptops. There are forums that say that it is possible to disable the NCQ of just a specific HD, using it in the terminal

 $ sudo su

 # echo "1" > /sys/block/sdX/device/queue_depth

Where "X" is the unit to have the NCQ disabled. But this has not worked in Linux Mint, because the /sys/block/sdc/device/queue_depth file is not permanent, but generated on the fly, so the instruction "# echo "1" > /sys/block/sdc/device/ queue_depth" has to be included in bash. Not a solution for everyone...

However, it is possible to mitigate the problems (except in kernell 5.19.0-41) using the resources that I exposed earlier, and for me, they were a success (except in the buggy kernell). Use Ananicy or CPUlimitter. Tuned can be problematic, but your tuned.service file in

  /etc/systemd/system/multi-user.target.wants/tuned.service

has a time delay of 20 seconds by default in the section

 [Timer]
 # start this 20 Sec after boot:
 OnBootSec=20

which is enough for secure boot even on slow linux to boot. For heavier or very customized systems, it is possible to change the parameter OnBootSec=20 ( seconds ) to 25 or even 30, to ensure Tuned doesn't start before Ananicy; and that should fix it. Ananicy works better than CPUlimit -- but be warned: the Chrome browser still produces high CPU usage spikes: They're brief, but the core hits 100% before Ananicy cuts it off, hence the next spike goes to the next core. Chrome doesn't seem good at distributing tasks across multiple cores, one of the cores will always spike.

But for Kernell 5.19.0-41 these "fixes" are useless; and I explain why: In his case, always there will be problems with any badblocks on the disks, even after running fsck and fixing the bad disk. Kernell simply ignores fsck and block corrections, and insists on reading and rereading any badblocks, over and over, until the input/output scheduler crashes and crashes the entire system,Only new, perfect disks will work on it, a statistical impossibility. Hence the insistent and repeated errors I found. Something like:

Apr 30 01:00:28 Valentine kernel: [ 10.669621] ata4.00: exception Emask 0x50 SAct 0x78c06003 SErr 0x4090800 action 0xe frozen Apr 30 01:00:28 Valentine kernel: [ 10.669626] ata4.00: irq_stat 0x00400040, connection status changed Apr 30 01:00:28 Valentine kernel: [ 10.669628] ata4: SError: { HostInt PHYRdyChg 10B8B DevExch } Apr 30 01:00:28 Valentine kernel: [ 10.669631] ata4.00: failed command: READ FPDMA QUEUED Apr 30 01:00:28 Valentine kernel: [ 10.669633] ata4.00: cmd 60/10:00:60:08:00/00:00:0c:00:00/40 tag 0 ncq dma 8192 in Apr 30 01:00:28 Valentine kernel: [ 10.669633] res 40/00:d8:08:08:80/00:00:0d:00:00/40 Emask 0x50 (ATA bus error) Apr 30 01:00:28 Valentine kernel: [ 10.669638] ata4.00: status: { DRDY } Apr 30 01:00:28 Valentine kernel: [ 10.669640] ata4.00: failed command: READ FPDMA QUEUED Apr 30 01:00:28 Valentine kernel: [ 10.669641] ata4.00: cmd 60/30:08:00:08:00/00:00:0c:00:00/40 tag 1 ncq dma 24576 in Apr 30 01:00:28 Valentine kernel: [ 10.669641] res 40/00:d8:08:08:80/00:00:0d:00:00/40 Emask 0x50 (ATA bus error) Apr 30 01:00:28 Valentine kernel: [ 10.669646] ata4.00: status: { DRDY } Apr 30 01:00:28 Valentine kernel: [ 10.669648] ata4.00: failed command: READ FPDMA QUEUED Apr 30 01:00:28 Valentine kernel: [ 10.669649] ata4.00: cmd 60/20:68:50:08:40/00:00:0d:00:00/40 tag 13 ncq dma 16384 in Apr 30 01:00:28 Valentine kernel: [ 10.669649] res 40/00:d8:08:08:80/00:00:0d:00:00/40 Emask 0x50 (ATA bus error) Apr 30 01:00:28 Valentine kernel: [ 10.669653] ata4.00: status: { DRDY } Apr 30 01:00:28 Valentine kernel: [ 10.669655] ata4.00: failed command: READ FPDMA QUEUED Apr 30 01:00:28 Valentine kernel: [ 10.669656] ata4.00: cmd 60/28:70:00:08:40/00:00:0d:00:00/40 tag 14 ncq dma 20480 in Apr 30 01:00:28 Valentine kernel: [ 10.669656] res 40/00:d8:08:08:80/00:00:0d:00:00/40 Emask 0x50 (ATA bus error) Apr 30 01:00:28 Valentine kernel: [ 10.669660] ata4.00: status: { DRDY } Apr 30 01:00:28 Valentine kernel: [ 10.669662] ata4.00: failed command: READ FPDMA QUEUED Apr 30 01:00:28 Valentine kernel: [ 10.669663] ata4.00: cmd 60/10:b0:48:08:80/00:00:0c:00:00/40 tag 22 ncq dma 8192 in Apr 30 01:00:28 Valentine kernel: [ 10.669663] res 40/00:d8:08:08:80/00:00:0d:00:00/40 Emask 0x50 (ATA bus error) Apr 30 01:00:28 Valentine kernel: [ 10.669667] ata4.00: status: { DRDY } Apr 30 01:00:28 Valentine kernel: [ 10.669669] ata4.00: failed command: READ FPDMA QUEUED Apr 30 01:00:28 Valentine kernel: [ 10.669670] ata4.00: cmd 60/08:b8:00:08:80/00:00:0c:00:00/40 tag 23 ncq dma 4096 in Apr 30 01:00:28 Valentine kernel: [ 10.669670] res 40/00:d8:08:08:80/00:00:0d:00:00/40 Emask 0x50 (ATA bus error) Apr 30 01:00:28 Valentine kernel: [ 10.669674] ata4.00: status: { DRDY } Apr 30 01:00:28 Valentine kernel: [ 10.669676] ata4.00: failed command: READ FPDMA QUEUED Apr 30 01:00:28 Valentine kernel: [ 10.669677] ata4.00: cmd 60/08:d8:08:08:80/00:00:0d:00:00/40 tag 27 ncq dma 4096 in Apr 30 01:00:28 Valentine kernel: [ 10.669677] res 40/00:d8:08:08:80/00:00:0d:00:00/40 Emask 0x50 (ATA bus error) Apr 30 01:00:28 Valentine kernel: [ 10.669681] ata4.00: status: { DRDY } Apr 30 01:00:28 Valentine kernel: [ 10.669682] ata4.00: failed command: READ FPDMA QUEUED Apr 30 01:00:28 Valentine kernel: [ 10.669684] ata4.00: cmd 60/08:e0:48:08:40/00:00:0c:00:00/40 tag 28 ncq dma 4096 in Apr 30 01:00:28 Valentine kernel: [ 10.669684] res 40/00:d8:08:08:80/00:00:0d:00:00/40 Emask 0x50 (ATA bus error) Apr 30 01:00:28 Valentine kernel: [ 10.669688] ata4.00: status: { DRDY }

I had to change the Kernell in use to 5.19.0-35, but so did others work fine whether mainline or longterm (like 5.15.x). I've tried -- unsuccessfully -- uninstalling 5.19.0-41, then reinstalling it, but it's no use, and the 5.19.0 series has important fixes for wifi drivers that many users can't do without, myself included. The way was to go to the mainline and install kernell 5.19.2-051902-generic, although I don't like the idea, as I said before. There is always the risk of a Kernell coming from the Ubuntu repository, not working well in Mint, I already explained why -- but it's just a possibility, there being no choice, it doesn't hurt to try, there's just a risk of some loss of efficiency, kinda random. But it was the possible solution, and it worked fine, so far. Well, that's it. Hope it's helpful.

Gaturinho commented 1 year ago

I just thought! I could have included the ignore-errors parameter in fstab, for the disk with badblocks, and that would probably have worked with Kernell 5.19.0-41. But I already uninstalled ( the 5.19.0-41 ) it 2 times, and I don't intend to reinstall it again just to test this. Simply because it's not something you can test on a virtual machine, that is, I had to test everything in practice on a production machine. But I thought it was good to include that.

Gaturinho commented 1 year ago

It's nasty, very nasty, REALLY nasty! But it always seems to have something else to do (as if there wasn't enough already! ) I had no more boot problems for 3 days, but then I had a cinnamon-session crash after a 6 hour session, and then reboot failed. But the interesting thing was that I saw, finally, the cause of the session breaking problem, what I reported there in my first post, that my session broke, but nothing appeared in the syslog, nor in the .xsessions. It's amazing that I didn't notice it earlier, given the similarity of the screen. Only this time I was lucky to be looking at the screen at the moment the warning flashed, and the most amazing thing, I happened to be looking at the exact spot on the screen where the notification appeared! It was really luck, because, again, nothing appeared in the syslog, but I managed to catch the warning before the interface went black: "Fallback Mode" ! Cinnamon suddenly restarted (when it restarts, you barely notice it), and returned in Fallback Mode. That was the "session break" ! I don't know why this happens when there is I/O scheduler overload on startup, but it was also happening - for no visible reason - in the middle of sessions. The good thing is that this has "medicine". To avoid a fallback mode, just add a parameter to Grub: "noapic", right after the "quiet splash". He is:

GRUB_CMDLINE_LINUX_DEFAULT=" quiet splash noapic "

(It works differently if you put it first, but I don't know the difference. I found a reference to it on a forum, but there was no description of what happens) So, my exotic and baroque Grub looks like this:

GRUB_CMDLINE_LINUX_DEFAULT="plymouth:debug quiet splash zswap.enabled=1 zswap.max_pool_percent=20 zswap.compressor=lz4 quiet noapic vga=current fsck.repair=yes"

Note that in my case, "quiet" appears 2 times. That's because if it isn't, plymouth isn't visible on startup (I have Abstract-Ring-Alt and Brabuntu animated themes on startup -- and they're the only ones that still work in Mint 21 and 21.1 ). I saw this on a Plymouth forum many years ago, it has to be "plymouth:debug splash quiet", or the Plymouth theme doesn't show up. Note that "quiet splash" is the opposite, "splash quiet", so, in order not to conflict with other parameters, which by obligation have to be after the "quiet splash", it ended up being "quiet splash XXX XXX quiet", where XXX are parameters like zswap and zram, which have to go after, but others like "noapic", "vga=current", and "fsck.repair=yes" have to go after quiet, but not after zram or zswap , who knows why... But the combination is functional, if I change it, problems appear (zswap or Plymouth animation doesn't work). But what matters is that the boot and session medium are stable now.

References in:

https://stackoverflow.com/questions/53001737/what-do-boot-option-noapic-and-noacpi-actually-do

https://forums.linuxmint.com/viewtopic.php?t=33760

And, taking the opportunity to say: I had no problems with the USB ports. Even kernell 5.15.0-41 is working again !

Nightwing0815 commented 1 year ago

The parameter noapic doesn't work for me, same procedure as every startup and login: fallbackmode... The only way for me: timeshift restore from 7 days ago and then cinnamon works, if i don't restart after successful timeshift restore with necessary restart. Just annoying, but I think, it'll go as fast as it comes...

ToM

Gaturinho commented 1 year ago

Tom, you use a laptop, so... On a lap it's to be expected that "noapic" doesn't work, and even turns off some laptop-specific features. In older versions of Mint (R series: Rafaella, Rebecca, and Rosa -- Rosa was the best Mint ever) we been used the "nomodeset" parameter instead of "noapic", but in current versions it might work, but you could goes straight to software rendering, that is, video rendering is no longer done by hardware and starts to be done by software, which ends up requiring an exceptional amount of memory (and CPU), which a lap certainly DOESN'T have to spare , and the lap will get very hot too. ("nomodeset" tells Kernell not to load the video driver, and wait for the system to take over; which is only useful in very specific situations like a video driver not present or defective in the kernell, but which may be provided by some system system secondary. And, anyway, "nomodeset" would hardly work on a laptop ). Fallback mode was one of Linux's worst inventions. Fortunately it will be phased out in the next version of Gnome ( 3.8 ) and Mint ( which is now based on Gnome 3 ), it has already been officially announced. The fallback mode was important at the time, but then it became a problem in itself, because due to stupid development design decisions that the developers insisted on keeping, they made it have automatic selection (that is, it starts to work on its own when the machine runs out of resources - mostly low memory - They say you can select the option fallbackmode or not on the logon screen, but until today I only heard of one guy who solved it that way and succeeded. The term we use in my country for things like fallback mode, which are more of a hindrance than a help, is "estrovenga". It means a stumbling stone. But at least his problem changed to fallback mode. Having a name for your problem, instead of just saying it's a "crash", is already a great advance.

Let's see, we would need to know how much memory and swappiness you have available. But it is better to try the parameter "acpi=noirq" instead of "noapic". It could be that the laptop has some default power saving settings turned on without you knowing it, and "acpi=noirq" will turn them off, so give it a try. Laptop manufacturers tend to include "hidden" and "mandatory" power-saving settings because they fear rapid battery degradation, which can damage the product's reputation and favor a competitor. Heating problems are more of the same. Even on a desktop, sometimes the kernel misrecognizes the hardware, and thinks it's on a laptop. It happened to mine, which I think is a stupid bug, as the system has full possibilities of knowing if it's using a desktop motherboard or a specific laptop one.

To know the swappiness, try the command

  cat /proc/sys/vm/swappiness

if it's too low and there isn't enough swapp, that's enough to send the system into fallbackmode. Most users don't even suspect this; it's the kind of thing you learn in practice, over the years.

You can test swappiness changes with the command (change "10" to the value you want):

 sudo sysctl vm.swappiness=10

but the effect is only temporary. You gain speed and agility, but if you don't have a lot of memory available to compensate, it will crash. Testing several different swappiness is the best way to test that you have at least a decent amount of memory. Highest swappiness is 60 (can make the machine super slow), and lowest is zero. For a modern laptop, start with 50, then 40 and finally 30. Swappiness 20 is only for a high-end laptop, otherwise it's not even worth testing. As it is a temporary change and will disappear after restarting, it will only test how much speed and agility you lose. But to know if this takes you out of fallback mode you would need to restart, so to make it permanent, working after restarting and to be able to test if you left fallback mode or not, you would have to manually edit the sysctl.conf configuration file:sudo xed /etc/sysctl.conf:***

 sudo xed /etc/sysctl.conf

and add the lines at the end of the file:

# Reduce SWAP usage
 vm.swappiness=10
# Improve cache management
vm.vfs_cache_pressure=50
# Prevent buffer overflow and improve file transfer
vm.dirty_background_ratio = 20
vm.dirty_ratio = 50

and change the value "vm.swappiness=10" to any value you want, save, and run

   sudo update-initramfs -u

to start at the beginning of the boot (and not only after the Shell loads) and restart the machine ( Note: the items "# Improves cache management" and "# Prevents buffer overflow and improves file transfer" are bonus .

It would also be good for you to check the memory settings. Sometimes it seems that the machine has so much, but a misplaced or defective memory stick can betray you. By installing cpu-x ( sudo apt install cpu-x ) you can see all the device's hardware, or simply with the "lshw" command, right in the first items of lshw, you can see how much memory the system has access to, what it is quite different from how much memory is physically installed if there are problems in the machine. On the other hand,

  free -h

will show all the memory you actually have available, and how it's being used; It is

 glxinfo | egrep -i memory

will show how much your video cache is more directly. As a last resort, you can try "acpi=off" instead of "acpi=noirq".

Another thing that works to get rid of fallback mode is to increase the video card's cache memory, a feature that not all machines have because it's a BIOS feature. It's worth trying there, but I don't know the ideal values. I would try doubling it, or 2 tests of 150% and 180% of the default value, and picking the first one that works. But this will fatally reduce the amount of memory available to the system, slowing it down. I think this last possibility ( small video cache ) on a laptop is unlikely, but it is necessary to keep in mind that the video drivers for linux are not so efficient in memory usage, compared to "Ruinwindows", but there also comes a question of collusion with the manufacturer, and Microsoft is a living legend of unfair competition. (but it's fair to say: Windows driver management is top notch, and it's their big trade secret. According to a hacker friend of mine, a veteran of auditing who has devoted a portion of his life to hacking Windows, driver management Windows is very refined. It doesn't run drivers "dry", right on the stack, like Linux. No. It uses something like an ODBC library to configure drivers before running them. The autorun files are for that. Calls to configure the driver library before it provides the driver to the run )

Soapux commented 1 year ago

So, I believe the crash I was seeing was in mozjs-78, and I noticed cjs master now uses a new mozjs. I updated to cjs master and as expected it didn't crash. My guess is mozjs-78 probably didn't like the newest glibc. So I'm going to close this.