Open NicoHood opened 7 years ago
'snapper debug' might give you more information why the config is in use, e.g. some other client is connected to snapperd or some background comparsion os snapshots is running.
Once a colleague had the problem and using gdb on snapperd we saw that one thread of snapperd was waiting for a btrfs ioctl (send/receive related). Unfortunately we didn't have the time to further debug the ioctl - but it looked like a kernel issue to us.
[arch@arch trunk]$ snapper delete 2588
Config is in use.
[arch@arch trunk]$ sudo snapper debug
clients:
name:':1.49'
name:':1.238', myself
backgrounds:
meta-snappers:
name:'root', loaded, use count 1
name:'home'
name:'hackallthethings'
compile options:
version 0.4.1
flags btrfs,lvm,ext4,xattrs,rollback,btrfs-quota,no-selinux
[arch@arch trunk]$ ps aux | grep snapper
root 1678 0.0 0.0 64004 3760 ? Ss Dec27 0:00 /usr/lib/snapper/systemd-helper --cleanup
root 1680 0.0 0.0 367536 14528 ? Sl Dec27 0:07 /usr/bin/snapperd
arch 23753 0.0 0.0 10756 2140 pts/10 S+ 15:39 0:00 grep snapper
I can use gdb, just tell me what to do and I am happy to debug the issue. It is on my productive system (laptop) so we have to be somewhat careful still.
Looks as if the systemd-helper is using the 'root' config and thus you cannot delete snapshots.
Check how long systemd-helper has been running and use e.g. strace, ltrace and gdb to see what it is doing.
I will edit this post from time to time with more debug info
[arch@arch trunk]$ sudo systemctl status snapper-cleanup.service
[sudo] password for arch:
● snapper-cleanup.service - Daily Cleanup of Snapper Snapshots
Loaded: loaded (/usr/lib/systemd/system/snapper-cleanup.service; static; vendor preset: disabled)
Active: active (running) since Tue 2016-12-27 11:04:19 CET; 2 days ago
Docs: man:snapper(8)
man:snapper-configs(5)
Main PID: 1678 (systemd-helper)
Tasks: 1 (limit: 4915)
CGroup: /system.slice/snapper-cleanup.service
└─1678 /usr/lib/snapper/systemd-helper --cleanup
Dec 27 11:04:19 arch systemd[1]: Started Daily Cleanup of Snapper Snapshots.
[arch@arch trunk]$ sudo systemctl status snapper-timeline.service
● snapper-timeline.service - Timeline of Snapper Snapshots
Loaded: loaded (/usr/lib/systemd/system/snapper-timeline.service; static; vendor preset: disabled)
Active: inactive (dead) since Thu 2016-12-29 22:00:15 CET; 46min ago
Docs: man:snapper(8)
man:snapper-configs(5)
Process: 32552 ExecStart=/usr/lib/snapper/systemd-helper --timeline (code=exited, status=0/SUCCESS)
Main PID: 32552 (code=exited, status=0/SUCCESS)
Dec 29 22:00:14 arch systemd[1]: Started Timeline of Snapper Snapshots.
--> the cleanup algorithm seems to hang as you said.
I tried to recompile with debug options:
[arch@arch trunk]$ gdb /usr/lib/snapper/systemd-helper
GNU gdb (GDB) 7.12
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/lib/snapper/systemd-helper...done.
(gdb) r --cleanup
Starting program: /usr/lib/snapper/systemd-helper --cleanup
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
Config is in use.
[Inferior 1 (process 13419) exited with code 01]
(gdb) bt full
No stack.
(gdb)
--> it does not crash.
[arch@arch trunk]$ snapper delete 2609
Config is in use.
[arch@arch trunk]$ ps aux | grep snapper
root 1680 0.0 0.0 367536 15392 ? Sl Dec27 0:09 /usr/bin/snapperd
arch 13475 0.0 0.0 10756 2036 pts/10 S+ 23:01 0:00 grep snapper
--> snapperd is still running. I do not know how this is started. It possibly blocks the deletion
[arch@arch trunk]$ sudo strace -eopen /usr/lib/snapper/systemd-helper --cleanup
[sudo] password for arch:
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libsnapper.so.4", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libboost_system.so.1.62.0", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libdbus-1.so.3", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libstdc++.so.6", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libboost_thread.so.1.62.0", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libxml2.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libacl.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libz.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libmount.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libbtrfs.so.0", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libm.so.6", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/librt.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libsystemd.so.0", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/liblzma.so.5", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libattr.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libblkid.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libuuid.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libresolv.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libcap.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/liblz4.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libgcrypt.so.20", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libgpg-error.so.0", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
open("/etc/localtime", O_RDONLY|O_CLOEXEC) = 4
open("/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 4
open("/usr/share/locale/en_US.utf8/LC_MESSAGES/snapper.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en_US/LC_MESSAGES/snapper.mo", O_RDONLY) = 4
open("/usr/share/locale/en.utf8/LC_MESSAGES/snapper.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en/LC_MESSAGES/snapper.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
Config is in use.
+++ exited with 1 +++
It seems that something blocked the cleanup service from terminating. I did not experience this issue again, but In already had it at minimum 2 times. If it happens again, I will try to debug it further.
I had this issue lots of times now and I have no idea why it occurs. Any idea/help how to debug?
$ snapper delete 3012
Config is in use.
$ ps aux | grep snapper
root 3570 0.0 0.0 63940 6360 ? Ss 10:39 0:00 /usr/lib/snapper/systemd-helper --cleanup
root 3577 0.1 0.1 302040 19960 ? Sl 10:39 0:03 /usr/bin/snapperd
arch 7289 0.0 0.0 10756 2212 pts/3 S+ 11:23 0:00 grep snapper
$ sudo systemctl status snapper-timeline.service
[sudo] password for arch:
● snapper-timeline.service - Timeline of Snapper Snapshots
Loaded: loaded (/usr/lib/systemd/system/snapper-timeline.service; static; vendor preset: disabled)
Active: inactive (dead) since Sat 2017-02-11 11:00:40 CET; 24min ago
Docs: man:snapper(8)
man:snapper-configs(5)
Process: 18028 ExecStart=/usr/lib/snapper/systemd-helper --timeline (code=exited, status=0/SUCCESS)
Main PID: 18028 (code=exited, status=0/SUCCESS)
Feb 11 11:00:40 arch systemd[1]: Started Timeline of Snapper Snapshots.
[arch@arch trunk]$ sudo systemctl status snapper-cleanup.service
● snapper-cleanup.service - Daily Cleanup of Snapper Snapshots
Loaded: loaded (/usr/lib/systemd/system/snapper-cleanup.service; static; vendor preset: disabled)
Active: active (running) since Sat 2017-02-11 10:39:01 CET; 46min ago
Docs: man:snapper(8)
man:snapper-configs(5)
Main PID: 3570 (systemd-helper)
Tasks: 1 (limit: 4915)
CGroup: /system.slice/snapper-cleanup.service
└─3570 /usr/lib/snapper/systemd-helper --cleanup
Feb 11 10:39:01 arch systemd[1]: Started Daily Cleanup of Snapper Snapshots.
[arch@arch trunk]$ sudo systemctl status snapper-boot.service
● snapper-boot.service - Take snapper snapshot of root on boot
Loaded: loaded (/etc/systemd/system/snapper-boot.service; static; vendor preset: disabled)
Active: inactive (dead) since Sat 2017-02-11 10:28:56 CET; 56min ago
Process: 613 ExecStart=/usr/bin/snapper -c root create -c number --description boot (code=exited, status=0/SUCCESS)
Main PID: 613 (code=exited, status=0/SUCCESS)
Feb 11 10:28:56 arch systemd[1]: Started Take snapper snapshot of root on boot.
I found out that the cleanup algorithm seems to hang. Is there any way to get further details of the process so we can see where it fails? Maybe my btrfs is somehow a bit broken?
# never ends
sudo /usr/lib/snapper/systemd-helper --cleanup
same issue here with quota enabled. any idea?
Still appears in 0.5
I've just noticed that the issue with
sudo /usr/lib/snapper/systemd-helper --cleanup
not ending
seems to disappear when I change when is set EMPTY_PRE_POST_CLEANUP
to no
in my root config.
@pkreussel I've tried that, but it did not help. I still got the issue, combined with snap-sync:
At snapshot snapshot
Deleting old snapshot for root...
Config is in use.
Exited due to error on line 406.
exit status: 1
command: snapper -c "$x" delete "$old_num"
bash line: 0
function name: main
My config:
$ snapper get-config
Key | Value
-----------------------+------
ALLOW_GROUPS | wheel
ALLOW_USERS |
BACKGROUND_COMPARISON | yes
EMPTY_PRE_POST_CLEANUP | no
EMPTY_PRE_POST_MIN_AGE | 1800
FSTYPE | btrfs
NUMBER_CLEANUP | yes
NUMBER_LIMIT | 10
NUMBER_LIMIT_IMPORTANT | 3
NUMBER_MIN_AGE | 1800
QGROUP |
SPACE_LIMIT | 0.5
SUBVOLUME | /
SYNC_ACL | no
TIMELINE_CLEANUP | yes
TIMELINE_CREATE | yes
TIMELINE_LIMIT_DAILY | 10
TIMELINE_LIMIT_HOURLY | 10
TIMELINE_LIMIT_MONTHLY | 0
TIMELINE_LIMIT_WEEKLY | 3
TIMELINE_LIMIT_YEARLY | 0
TIMELINE_MIN_AGE | 1800
snapper cleanup empty-pre-post
sits there doing nothing and never exits for me with the root config. It works for the home config. number
and timeline
also work for root, except when systemd-helper
is stuck, presumably doing empty-pre-post
cleanup, in which case they both report "Config is in use". Interestingly empty-pre-post
does not fail with "Config in use" even when it is in use—it just sits there.
Setting EMPTY_PRE_POST_CLEANUP="no"
does not fix systemd-helper
for me—it still hangs, as does snapper cleanup empty-pre-post
.
It seems to be stuck in a poll()
call:
# strace snapper cleanup empty-pre-post
[...]
poll([{fd=3, events=POLLIN}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\2\1\1\0\0\0\0c\2\0\0'\0\0\0\6\1s\0\6\0\0\0:1.318\0\0"..., iov_len=2048}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 56
recvmsg(3, {msg_namelen=0}, MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\1\0\1\24\0\0\0\r\0\0\0\211\0\0\0\1\1o\0\25\0\0\0/org/ope"..., iov_len=160}, {iov_base="\4\0\0\0root\0\0\0\0\357\0\0\0001\1\0\0", iov_len=20}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 180
poll([{fd=3, events=POLLIN}], 1, -1
After that it just sits there and doesn't even write the closing parenthesis.
I don't really know how to use gdb but I tried:
# gdb snapper
(gdb) run cleanup empty-pre-post
[wait for it to hang]
^C
Program received signal SIGINT, Interrupt.
0x00007ffff5a94884 in poll () from /lib64/libc.so.6
(gdb) bt
#0 0x00007ffff5a94884 in poll () from /lib64/libc.so.6
#1 0x00007ffff6ac1def in socket_do_iteration () from /lib64/libdbus-1.so.3
#2 0x00007ffff6ac0a9d in _dbus_transport_do_iteration () from /lib64/libdbus-1.so.3
#3 0x00007ffff6aa917c in _dbus_connection_do_iteration_unlocked () from /lib64/libdbus-1.so.3
#4 0x00007ffff6aa9b4d in _dbus_connection_block_pending_call () from /lib64/libdbus-1.so.3
#5 0x00007ffff6aaa11e in dbus_connection_send_with_reply_and_block () from /lib64/libdbus-1.so.3
#6 0x0000555555594d27 in DBus::Connection::send_with_reply_and_block (this=this@entry=0x5555557c7068, m=...) at DBusConnection.cc:101
#7 0x000055555557b3cb in command_create_comparison (conn=..., config_name="root", number1=239, number2=number2@entry=305) at commands.cc:357
#8 0x00005555555849b1 in ProxyComparisonDbus::ProxyComparisonDbus (this=0x5555557c83c0, backref=0x5555557c9ce0, lhs=..., rhs=..., mount=<optimized out>) at proxy-dbus.cc:305
#9 0x000055555558525f in ProxySnapperDbus::createComparison (this=0x5555557c9ce0, lhs=..., rhs=..., mount=<optimized out>) at proxy-dbus.cc:233
#10 0x000055555558178c in EmptyPrePostCleaner::calculate_candidates[abi:cxx11](ProxySnapshots&, Range::Value) (this=0x7fffffffd4b0, snapshots=..., value=<optimized out>) at cleanup.cc:644
#11 0x00005555555803a5 in Cleaner::cleanup_quota_unaware (this=this@entry=0x7fffffffd4b0, snapshots=...) at cleanup.cc:233
#12 0x00005555555806b1 in Cleaner::cleanup (this=this@entry=0x7fffffffd4b0) at cleanup.cc:286
#13 0x0000555555580fd1 in do_cleanup_empty_pre_post (snapper=0x5555557c9ce0, verbose=<optimized out>) at cleanup.cc:664
#14 0x000055555556d73a in command_cleanup (snappers=<optimized out>, snapper=0x5555557c9ce0) at snapper.cc:1352
#15 0x0000555555566e1a in main (argc=3, argv=0x7fffffffe238) at snapper.cc:1657
So it seems to be doing a comparison between snapshot 239 and 305, and in deed there's no filelist-239.txt
and running snapper status 239..305
also seems to get stuck in a poll()
.
...Or perhaps it's just taking a long time to find all the differences. There's two days of history between those two snapshots, which come from pam_snapper
and a sudo -i
session. After snapper rm 239
, cleanup empty-pre-post
completes almost instantly.
As an optimization, perhaps empty-pre-post
could be made to simply look for any changes, rather than having to find all of them, since it doesn't actually need to do that.
Also, why is it doing all this work even if this cleanup is disabled?
@dag We had bug reports (e.g. https://bugzilla.suse.com/show_bug.cgi?id=1049574) where apparently the btrfs send ioctl is stuck. That looks like a kernel problem but unfortunately was never debugged.
snapper is split into the server snapperd and the command line client snapper. If you want to analyse the problem you have to debug, e.g. with gdb, snapperd. You could also try to use btrfs send to see if that is also blocking.
Yeah the poll is probably waiting on snapperd. But it's still clear from the above that it was stuck on computing status 239..305
, a pre-post spanning two days, the only set spanning any significant time. I feel that's probably relevant. I've deleted them now so I can't test further unless this issue comes back.
It happened again. Out of 22 pre-post snapshot pairs, only two lack the filelist-N.txt
files that should be created because I have BACKGROUND_COMPARISON="yes"
. Those two are also the only pairs that span multiple days. Deleting them makes snapper cleanup empty-pre-post
complete normally.
Do you want me to file new issues for these two suggestions?
EMPTY_PRE_POST_CLEANUP="no"
doesn't work as a workaround for this bug, because systemd-helper
still runs the equivalent snapper cleanup empty-pre-post
, and that still hangs on these problematic snapshots. That implies either that this config option doesn't do anything, or possibly, snapper does all the work for computing which empty pre-post to cleanup even when it doesn't actually clean them up. It seems to me it would be better to skip all that work for cleanups that are disabled, and then you could also work around this bug by disabling cleanup of empty pre-post. Note that manually running snapper cleanup number|timeline
doesn't trigger this bug, but systemd-helper
does, regardless of EMPTY_PRE_POST_CLEANUP
.snapper status
. But this isn't really necessary, because to determine if a snapshot pair has no changes between them, all you need to do is find one change. Thus, you could optimize cleanup empty-pre-post
by bailing on pairs that you find any changes between. I suspect that would also fix this bug in many cases, because the bug happens for pre-post pairs that span many days: finding a single change between them would probably be fast and not usually trigger whatever causes this bug, and then you can skip those pairs because they're clearly not empty.@aschnell any updates?
I'm seeing this as well on Ubuntu bionic with snapper 0.5.4
Hi guys! I'm seeing this 'config is in use' issue on Debian Testing as well (snapper 0.8.9). Snapper isn't deleting any pre/post snapshots.
I have been experiencing this problem for years. Currently in Ubuntu 18.04. I have lost count the number of times I've had to reboot my systems to work around this problem and spend the better part of the day cleaning up snapshots and re-balancing.
With 0.4.1 I cannot delete snapshots anymore. It always says config in use:
FYI: The bug was found when using snap-sync: https://github.com/wesbarnett/snap-sync/issues/21 snapper log
I am using archlinux with the latest btrfs tools and kernel 4.8-13