Open antheas opened 2 months ago
Can you get a stack trace here using env G_DEBUG=fatal-warnings rpm-ostree usroverlay
e.g.?
Hm does not seem to generate one
bazzite@bazzite:~$ sudo G_DEBUG=fatal-warnings rpm-ostree usroverlay --hotfix
(ostree admin unlock:7580): GLib-CRITICAL **: 00:40:26.705: g_atomic_ref_count_dec: assertion 'old_value > 0' failed
Trace/breakpoint trap
bazzite@bazzite:~$ sudo coredumpctl dump /usr/bin/ostree
PID: 7580 (ostree)
UID: 0 (root)
GID: 0 (root)
Signal: 5 (TRAP)
Timestamp: Sun 2024-09-15 00:40:26 EEST (3min 6s ago)
Command Line: ostree admin unlock --hotfix
Executable: /usr/bin/ostree
Control Group: /user.slice/user-1000.slice/session-4.scope
Unit: session-4.scope
Slice: user-1000.slice
Session: 4
Owner UID: 1000 (bazzite)
Boot ID: 133c87fa3e2649419c6c4e4dbaccc514
Machine ID: 26b347316e3f40818770598196fe5a8f
Hostname: bazzite
Storage: none
Message: Process 7580 (ostree) of user 0 terminated abnormally without generating a coredump.
Coredump entry has no core attached (neither internally in the journal nor externally on disk).
This with that alternative hardened malloc? If so the fact we're seeing SIGTRAP here makes me think somehow it's hooking things earlier than the g_critical()? Or hmm is it changing what abort()
does?
No, this is the stock version on bazzite. There is a chance we are patching rpm-ostree to fix the deployment error that is on next release but I don't think this would affect it.
Malloc should be the same as the one in kinoite
Trap was caused after adding the environment variable
These errors happen only on usroverlay, normal updates are fine
(gdb) bt 15
#0 0x00007fb49c72ec28 in g_logv () from /lib64/libglib-2.0.so.0
#1 0x00007fb49c72eea3 in g_log () from /lib64/libglib-2.0.so.0
#2 0x00007fb49c73f5da in g_atomic_ref_count_dec () from /lib64/libglib-2.0.so.0
#3 0x00007fb49c774853 in g_variant_unref () from /lib64/libglib-2.0.so.0
#4 0x00007fb49ca39db6 in checkout_tree_at_recurse () from /lib64/libostree-1.so.1
#5 0x00007fb49ca3aa10 in checkout_tree_at_recurse () from /lib64/libostree-1.so.1
#6 0x00007fb49ca3b3e1 in checkout_tree_at () from /lib64/libostree-1.so.1
#7 0x00007fb49ca3b70f in ostree_repo_checkout_at () from /lib64/libostree-1.so.1
#8 0x00007fb49cab1840 in prepare_deployment_etc.isra () from /lib64/libostree-1.so.1
#9 0x00007fb49caa97b7 in sysroot_initialize_deployment.constprop ()
from /lib64/libostree-1.so.1
#10 0x00007fb49ca75465 in ostree_sysroot_deploy_tree_with_options ()
from /lib64/libostree-1.so.1
#11 0x00007fb49ca75532 in ostree_sysroot_deploy_tree () from /lib64/libostree-1.so.1
#12 0x00007fb49ca6c249 in ostree_sysroot_deployment_unlock ()
from /lib64/libostree-1.so.1
#13 0x0000564770576bd2 in ot_admin_builtin_unlock (argc=<optimized out>,
argv=<optimized out>, invocation=<optimized out>, cancellable=0x0,
error=0x7ffd1e77f9d0) at src/ostree/ot-admin-builtin-unlock.c:73
#14 0x00005647705683c8 in ostree_builtin_admin (argc=<optimized out>,
argv=<optimized out>, invocation=0x7ffd1e77f9d8, cancellable=0x0,
error=0x7ffd1e77f9d0) at src/ostree/ot-builtin-admin.c:178
(More stack frames follow...)
ran gdb on top. debuginfo does not work so I cannot see the symbols.
Figured it out. Happens during checking out /etc
. Has 2 checkout_tree_at_recurse
They have these args:
(gdb) info args
self = 0x55f9912ccb20
options = 0x7ffd811fab10
state = 0x7ffd811faa80
destination_parent_fd = 12
destination_name = <optimized out>
dirtree_checksum = 0x7ffd811fa7e0 "3ed2a6f136bf47dbc493a1eb7cacb97ec5b9805d6192abc0d97e0eb65a5ed5cb"
dirmeta_checksum = <optimized out>
cancellable = <optimized out>
error = <optimized out>
(gdb) info args
self = 0x556f67fa3b20
options = 0x7ffd77afeb90
state = 0x7ffd77afeb00
destination_parent_fd = 13
destination_name = 0x7f95af8f5cde "etc"
dirtree_checksum = 0x556f681a29b0 "996d76ae160ea0e6512f4af45cbe76ac85c9594d99547b03356a1826a6b1853d"
dirmeta_checksum = 0x556f681a2a00 "c1fa8905c02e0199c4f6f215923914173d7030e336c0922fb2f0d800bf7a9b40"
cancellable = 0x0
error = 0x7ffd77aff3c0
ostree ls...
d00755 0 0 0 3ed2a6f136bf47dbc493a1eb7cacb97ec5b9805d6192abc0d97e0eb65a5ed5cb ec90a49ea284b4c39846e1f440091f539709d01dc7986e6f61a090afb2a8c6ac /usr/etc/fonts
cannot find 996d76ae160ea0e6512f4af45cbe76ac85c9594d99547b03356a1826a6b1853d
EDIT: Seems like 996d76ae160ea0e6512f4af45cbe76ac85c9594d99547b03356a1826a6b1853d is common in all of them. Could the ref count be caused by this missing?
Unfortunately there's a ton of g_variant_unref()
invocations there and a whole lot of code so it'd be really helpful to narrow this down farther.
@antheas any chance you can run with a build of ostree with debuginfo? Even better, build with CFLAGS="-ggdb -O0"
.
I tried to do that. But unfortunately compiling ostree on a system without dev packages is not particularly easy...
I have a command you can use to reproduce it though:
> sudo bootc switch ghcr.io/ublue-os/bazzite:unstable-41.20241027
layers already present: 58; layers needed: 14 (1.0 GB)
Fetched layers: 974.69 MiB in 2 minutes (6.86 MiB/s)
(process:9057): GLib-CRITICAL **: 21:46:13.583: g_atomic_ref_count_dec: assertion 'old_value > 0' failed
(process:9057): GLib-CRITICAL **: 21:46:13.945: g_atomic_ref_count_dec: assertion 'old_value > 0' failed
(process:9057): GLib-CRITICAL **: 21:46:13.947: g_atomic_ref_count_dec: assertion 'old_value > 0' failed
(process:9057): GLib-CRITICAL **: 21:46:13.948: g_atomic_ref_count_dec: assertion 'old_value > 0' failed
(process:9057): GLib-CRITICAL **: 21:46:13.952: g_atomic_ref_count_dec: assertion 'old_value > 0' failed
(process:9057): GLib-CRITICAL **: 21:46:13.967: g_atomic_ref_count_dec: assertion 'old_value > 0' failed
(process:9057): GLib-CRITICAL **: 21:46:13.969: g_atomic_ref_count_dec: assertion 'old_value > 0' failed
Pruned images: 0 (layers: 0, objsize: 1.3 GB)
Queued for next boot: ghcr.io/ublue-os/bazzite:unstable-41.20241027
Version: unstable-41.20241027
Digest: sha256:7260cc16cff9624393c751bf25749f7a6f5dc34ca8a56a82166fa95918626a31
I am pretty sure you don't need to boot that image, just pulling it should work
Bootc looks a lot nicer to work with, really nice in F41.
Probably the cousin of issue https://github.com/secureblue/secureblue/issues/369
Running
sudo rpm-ostree usroverlay --hotfix
causes an assertion error to be emitted for invalid reference counting. But the command completes successfully.I think there is a chance the rpm-ostree version here is patched to fix the deployment bug that was found recently, but is not changed other than that.