openSUSE / snapper

Manage filesystem snapshots and allow undo of system modifications
http://snapper.io/
GNU General Public License v2.0
896 stars 126 forks source link

0.4.1 deletetion of snapshots impossible - config is in use #311

Open NicoHood opened 7 years ago

NicoHood commented 7 years ago

With 0.4.1 I cannot delete snapshots anymore. It always says config in use:

$ snapper delete 2583
Config is in use.

FYI: The bug was found when using snap-sync: https://github.com/wesbarnett/snap-sync/issues/21 snapper log

I am using archlinux with the latest btrfs tools and kernel 4.8-13

aschnell commented 7 years ago

'snapper debug' might give you more information why the config is in use, e.g. some other client is connected to snapperd or some background comparsion os snapshots is running.

Once a colleague had the problem and using gdb on snapperd we saw that one thread of snapperd was waiting for a btrfs ioctl (send/receive related). Unfortunately we didn't have the time to further debug the ioctl - but it looked like a kernel issue to us.

NicoHood commented 7 years ago
[arch@arch trunk]$ snapper delete 2588
Config is in use.
[arch@arch trunk]$ sudo snapper debug 
clients:
    name:':1.49'
    name:':1.238', myself
backgrounds:
meta-snappers:
    name:'root', loaded, use count 1
    name:'home'
    name:'hackallthethings'
compile options:
    version 0.4.1
    flags btrfs,lvm,ext4,xattrs,rollback,btrfs-quota,no-selinux
[arch@arch trunk]$ ps aux | grep snapper
root      1678  0.0  0.0  64004  3760 ?        Ss   Dec27   0:00 /usr/lib/snapper/systemd-helper --cleanup
root      1680  0.0  0.0 367536 14528 ?        Sl   Dec27   0:07 /usr/bin/snapperd
arch     23753  0.0  0.0  10756  2140 pts/10   S+   15:39   0:00 grep snapper

I can use gdb, just tell me what to do and I am happy to debug the issue. It is on my productive system (laptop) so we have to be somewhat careful still.

aschnell commented 7 years ago

Looks as if the systemd-helper is using the 'root' config and thus you cannot delete snapshots.

Check how long systemd-helper has been running and use e.g. strace, ltrace and gdb to see what it is doing.

NicoHood commented 7 years ago

I will edit this post from time to time with more debug info

[arch@arch trunk]$ sudo systemctl status snapper-cleanup.service 
[sudo] password for arch: 
● snapper-cleanup.service - Daily Cleanup of Snapper Snapshots
   Loaded: loaded (/usr/lib/systemd/system/snapper-cleanup.service; static; vendor preset: disabled)
   Active: active (running) since Tue 2016-12-27 11:04:19 CET; 2 days ago
     Docs: man:snapper(8)
           man:snapper-configs(5)
 Main PID: 1678 (systemd-helper)
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/snapper-cleanup.service
           └─1678 /usr/lib/snapper/systemd-helper --cleanup

Dec 27 11:04:19 arch systemd[1]: Started Daily Cleanup of Snapper Snapshots.
[arch@arch trunk]$ sudo systemctl status snapper-timeline.service 
● snapper-timeline.service - Timeline of Snapper Snapshots
   Loaded: loaded (/usr/lib/systemd/system/snapper-timeline.service; static; vendor preset: disabled)
   Active: inactive (dead) since Thu 2016-12-29 22:00:15 CET; 46min ago
     Docs: man:snapper(8)
           man:snapper-configs(5)
  Process: 32552 ExecStart=/usr/lib/snapper/systemd-helper --timeline (code=exited, status=0/SUCCESS)
 Main PID: 32552 (code=exited, status=0/SUCCESS)

Dec 29 22:00:14 arch systemd[1]: Started Timeline of Snapper Snapshots.

--> the cleanup algorithm seems to hang as you said.

I tried to recompile with debug options:

[arch@arch trunk]$ gdb /usr/lib/snapper/systemd-helper 
GNU gdb (GDB) 7.12
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/lib/snapper/systemd-helper...done.
(gdb) r --cleanup
Starting program: /usr/lib/snapper/systemd-helper --cleanup
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
Config is in use.
[Inferior 1 (process 13419) exited with code 01]
(gdb) bt full
No stack.
(gdb) 

--> it does not crash.

[arch@arch trunk]$ snapper delete 2609
Config is in use.
[arch@arch trunk]$ ps aux | grep snapper
root      1680  0.0  0.0 367536 15392 ?        Sl   Dec27   0:09 /usr/bin/snapperd
arch     13475  0.0  0.0  10756  2036 pts/10   S+   23:01   0:00 grep snapper

--> snapperd is still running. I do not know how this is started. It possibly blocks the deletion

[arch@arch trunk]$ sudo strace -eopen /usr/lib/snapper/systemd-helper --cleanup
[sudo] password for arch: 
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libsnapper.so.4", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libboost_system.so.1.62.0", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libdbus-1.so.3", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libstdc++.so.6", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libboost_thread.so.1.62.0", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libxml2.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libacl.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libz.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libmount.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libbtrfs.so.0", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libm.so.6", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/librt.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libsystemd.so.0", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/liblzma.so.5", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libattr.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libblkid.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libuuid.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libresolv.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libcap.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/liblz4.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libgcrypt.so.20", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libgpg-error.so.0", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
open("/etc/localtime", O_RDONLY|O_CLOEXEC) = 4
open("/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 4
open("/usr/share/locale/en_US.utf8/LC_MESSAGES/snapper.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en_US/LC_MESSAGES/snapper.mo", O_RDONLY) = 4
open("/usr/share/locale/en.utf8/LC_MESSAGES/snapper.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en/LC_MESSAGES/snapper.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
Config is in use.
+++ exited with 1 +++
NicoHood commented 7 years ago

It seems that something blocked the cleanup service from terminating. I did not experience this issue again, but In already had it at minimum 2 times. If it happens again, I will try to debug it further.

NicoHood commented 7 years ago

I had this issue lots of times now and I have no idea why it occurs. Any idea/help how to debug?

$ snapper delete 3012
Config is in use.
$ ps aux | grep snapper
root      3570  0.0  0.0  63940  6360 ?        Ss   10:39   0:00 /usr/lib/snapper/systemd-helper --cleanup
root      3577  0.1  0.1 302040 19960 ?        Sl   10:39   0:03 /usr/bin/snapperd
arch      7289  0.0  0.0  10756  2212 pts/3    S+   11:23   0:00 grep snapper
$ sudo systemctl status snapper-timeline.service 
[sudo] password for arch: 
● snapper-timeline.service - Timeline of Snapper Snapshots
   Loaded: loaded (/usr/lib/systemd/system/snapper-timeline.service; static; vendor preset: disabled)
   Active: inactive (dead) since Sat 2017-02-11 11:00:40 CET; 24min ago
     Docs: man:snapper(8)
           man:snapper-configs(5)
  Process: 18028 ExecStart=/usr/lib/snapper/systemd-helper --timeline (code=exited, status=0/SUCCESS)
 Main PID: 18028 (code=exited, status=0/SUCCESS)

Feb 11 11:00:40 arch systemd[1]: Started Timeline of Snapper Snapshots.
[arch@arch trunk]$ sudo systemctl status snapper-cleanup.service 
● snapper-cleanup.service - Daily Cleanup of Snapper Snapshots
   Loaded: loaded (/usr/lib/systemd/system/snapper-cleanup.service; static; vendor preset: disabled)
   Active: active (running) since Sat 2017-02-11 10:39:01 CET; 46min ago
     Docs: man:snapper(8)
           man:snapper-configs(5)
 Main PID: 3570 (systemd-helper)
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/snapper-cleanup.service
           └─3570 /usr/lib/snapper/systemd-helper --cleanup

Feb 11 10:39:01 arch systemd[1]: Started Daily Cleanup of Snapper Snapshots.
[arch@arch trunk]$ sudo systemctl status snapper-boot.service 
● snapper-boot.service - Take snapper snapshot of root on boot
   Loaded: loaded (/etc/systemd/system/snapper-boot.service; static; vendor preset: disabled)
   Active: inactive (dead) since Sat 2017-02-11 10:28:56 CET; 56min ago
  Process: 613 ExecStart=/usr/bin/snapper -c root create -c number --description boot (code=exited, status=0/SUCCESS)
 Main PID: 613 (code=exited, status=0/SUCCESS)

Feb 11 10:28:56 arch systemd[1]: Started Take snapper snapshot of root on boot.
NicoHood commented 7 years ago

I found out that the cleanup algorithm seems to hang. Is there any way to get further details of the process so we can see where it fails? Maybe my btrfs is somehow a bit broken?

# never ends
sudo /usr/lib/snapper/systemd-helper --cleanup
comio commented 7 years ago

same issue here with quota enabled. any idea?

NicoHood commented 7 years ago

Still appears in 0.5

pkreussel commented 7 years ago

I've just noticed that the issue with sudo /usr/lib/snapper/systemd-helper --cleanup not ending seems to disappear when I change when is set EMPTY_PRE_POST_CLEANUP to no in my root config.

NicoHood commented 7 years ago

@pkreussel I've tried that, but it did not help. I still got the issue, combined with snap-sync:

At snapshot snapshot
Deleting old snapshot for root...
Config is in use.
Exited due to error on line 406.
exit status: 1
command: snapper -c "$x" delete "$old_num"
bash line: 0
function name: main

My config:

$ snapper get-config
Key                    | Value
-----------------------+------
ALLOW_GROUPS           | wheel
ALLOW_USERS            |      
BACKGROUND_COMPARISON  | yes  
EMPTY_PRE_POST_CLEANUP | no   
EMPTY_PRE_POST_MIN_AGE | 1800 
FSTYPE                 | btrfs
NUMBER_CLEANUP         | yes  
NUMBER_LIMIT           | 10   
NUMBER_LIMIT_IMPORTANT | 3    
NUMBER_MIN_AGE         | 1800 
QGROUP                 |      
SPACE_LIMIT            | 0.5  
SUBVOLUME              | /    
SYNC_ACL               | no   
TIMELINE_CLEANUP       | yes  
TIMELINE_CREATE        | yes  
TIMELINE_LIMIT_DAILY   | 10   
TIMELINE_LIMIT_HOURLY  | 10   
TIMELINE_LIMIT_MONTHLY | 0    
TIMELINE_LIMIT_WEEKLY  | 3    
TIMELINE_LIMIT_YEARLY  | 0    
TIMELINE_MIN_AGE       | 1800 
dag commented 6 years ago

snapper cleanup empty-pre-post sits there doing nothing and never exits for me with the root config. It works for the home config. number and timeline also work for root, except when systemd-helper is stuck, presumably doing empty-pre-post cleanup, in which case they both report "Config is in use". Interestingly empty-pre-post does not fail with "Config in use" even when it is in use—it just sits there.

Setting EMPTY_PRE_POST_CLEANUP="no" does not fix systemd-helper for me—it still hangs, as does snapper cleanup empty-pre-post.

dag commented 6 years ago

It seems to be stuck in a poll() call:

# strace snapper cleanup empty-pre-post
[...]
poll([{fd=3, events=POLLIN}], 1, -1)    = 1 ([{fd=3, revents=POLLIN}])
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\2\1\1\0\0\0\0c\2\0\0'\0\0\0\6\1s\0\6\0\0\0:1.318\0\0"..., iov_len=2048}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 56
recvmsg(3, {msg_namelen=0}, MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\1\0\1\24\0\0\0\r\0\0\0\211\0\0\0\1\1o\0\25\0\0\0/org/ope"..., iov_len=160}, {iov_base="\4\0\0\0root\0\0\0\0\357\0\0\0001\1\0\0", iov_len=20}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 180
poll([{fd=3, events=POLLIN}], 1, -1

After that it just sits there and doesn't even write the closing parenthesis.

dag commented 6 years ago

I don't really know how to use gdb but I tried:

# gdb snapper
(gdb) run cleanup empty-pre-post
[wait for it to hang]
^C
Program received signal SIGINT, Interrupt.
0x00007ffff5a94884 in poll () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff5a94884 in poll () from /lib64/libc.so.6
#1  0x00007ffff6ac1def in socket_do_iteration () from /lib64/libdbus-1.so.3
#2  0x00007ffff6ac0a9d in _dbus_transport_do_iteration () from /lib64/libdbus-1.so.3
#3  0x00007ffff6aa917c in _dbus_connection_do_iteration_unlocked () from /lib64/libdbus-1.so.3
#4  0x00007ffff6aa9b4d in _dbus_connection_block_pending_call () from /lib64/libdbus-1.so.3
#5  0x00007ffff6aaa11e in dbus_connection_send_with_reply_and_block () from /lib64/libdbus-1.so.3
#6  0x0000555555594d27 in DBus::Connection::send_with_reply_and_block (this=this@entry=0x5555557c7068, m=...) at DBusConnection.cc:101
#7  0x000055555557b3cb in command_create_comparison (conn=..., config_name="root", number1=239, number2=number2@entry=305) at commands.cc:357
#8  0x00005555555849b1 in ProxyComparisonDbus::ProxyComparisonDbus (this=0x5555557c83c0, backref=0x5555557c9ce0, lhs=..., rhs=..., mount=<optimized out>) at proxy-dbus.cc:305
#9  0x000055555558525f in ProxySnapperDbus::createComparison (this=0x5555557c9ce0, lhs=..., rhs=..., mount=<optimized out>) at proxy-dbus.cc:233
#10 0x000055555558178c in EmptyPrePostCleaner::calculate_candidates[abi:cxx11](ProxySnapshots&, Range::Value) (this=0x7fffffffd4b0, snapshots=..., value=<optimized out>) at cleanup.cc:644
#11 0x00005555555803a5 in Cleaner::cleanup_quota_unaware (this=this@entry=0x7fffffffd4b0, snapshots=...) at cleanup.cc:233
#12 0x00005555555806b1 in Cleaner::cleanup (this=this@entry=0x7fffffffd4b0) at cleanup.cc:286
#13 0x0000555555580fd1 in do_cleanup_empty_pre_post (snapper=0x5555557c9ce0, verbose=<optimized out>) at cleanup.cc:664
#14 0x000055555556d73a in command_cleanup (snappers=<optimized out>, snapper=0x5555557c9ce0) at snapper.cc:1352
#15 0x0000555555566e1a in main (argc=3, argv=0x7fffffffe238) at snapper.cc:1657

So it seems to be doing a comparison between snapshot 239 and 305, and in deed there's no filelist-239.txt and running snapper status 239..305 also seems to get stuck in a poll().

...Or perhaps it's just taking a long time to find all the differences. There's two days of history between those two snapshots, which come from pam_snapper and a sudo -i session. After snapper rm 239, cleanup empty-pre-post completes almost instantly.

As an optimization, perhaps empty-pre-post could be made to simply look for any changes, rather than having to find all of them, since it doesn't actually need to do that.

dag commented 6 years ago

Also, why is it doing all this work even if this cleanup is disabled?

aschnell commented 6 years ago

@dag We had bug reports (e.g. https://bugzilla.suse.com/show_bug.cgi?id=1049574) where apparently the btrfs send ioctl is stuck. That looks like a kernel problem but unfortunately was never debugged.

snapper is split into the server snapperd and the command line client snapper. If you want to analyse the problem you have to debug, e.g. with gdb, snapperd. You could also try to use btrfs send to see if that is also blocking.

dag commented 6 years ago

Yeah the poll is probably waiting on snapperd. But it's still clear from the above that it was stuck on computing status 239..305, a pre-post spanning two days, the only set spanning any significant time. I feel that's probably relevant. I've deleted them now so I can't test further unless this issue comes back.

dag commented 6 years ago

It happened again. Out of 22 pre-post snapshot pairs, only two lack the filelist-N.txt files that should be created because I have BACKGROUND_COMPARISON="yes". Those two are also the only pairs that span multiple days. Deleting them makes snapper cleanup empty-pre-post complete normally.

Do you want me to file new issues for these two suggestions?

  1. Even though this bug only affects cleanup of empty-pre-post, EMPTY_PRE_POST_CLEANUP="no" doesn't work as a workaround for this bug, because systemd-helper still runs the equivalent snapper cleanup empty-pre-post, and that still hangs on these problematic snapshots. That implies either that this config option doesn't do anything, or possibly, snapper does all the work for computing which empty pre-post to cleanup even when it doesn't actually clean them up. It seems to me it would be better to skip all that work for cleanups that are disabled, and then you could also work around this bug by disabling cleanup of empty pre-post. Note that manually running snapper cleanup number|timeline doesn't trigger this bug, but systemd-helper does, regardless of EMPTY_PRE_POST_CLEANUP.
  2. To find empty pre-post pairs to clean up, snapper currently computes a list of all changes between the snapshots, because it's using the same code as that for snapper status. But this isn't really necessary, because to determine if a snapshot pair has no changes between them, all you need to do is find one change. Thus, you could optimize cleanup empty-pre-post by bailing on pairs that you find any changes between. I suspect that would also fix this bug in many cases, because the bug happens for pre-post pairs that span many days: finding a single change between them would probably be fast and not usually trigger whatever causes this bug, and then you can skip those pairs because they're clearly not empty.
AtosNicoS commented 6 years ago

@aschnell any updates?

shuhaowu commented 4 years ago

I'm seeing this as well on Ubuntu bionic with snapper 0.5.4

Borromini commented 4 years ago

Hi guys! I'm seeing this 'config is in use' issue on Debian Testing as well (snapper 0.8.9). Snapper isn't deleting any pre/post snapshots.

mcgrathd commented 4 years ago

I have been experiencing this problem for years. Currently in Ubuntu 18.04. I have lost count the number of times I've had to reboot my systems to work around this problem and spend the better part of the day cleaning up snapshots and re-balancing.