kdave / btrfs-progs

Development of userspace BTRFS tools
GNU General Public License v2.0
557 stars 242 forks source link

btrfs replace start failed with segmentation fault , and seems to hang my system #363

Open icebluey opened 3 years ago

icebluey commented 3 years ago

btrfs-progs: v5.11.1 linux kernel: 5.10.31

After segmentation fault, 'systemctl --force --force reboot' can't even reboot system.

Here's what I did:

# umount /mnt ; wipefs -a -f /dev/sd[b-g]
umount: /mnt: not mounted
/dev/sdc: 8 bytes were erased at offset 0x00010040 (btrfs): 5f 42 48 52 66 53 5f 4d
# mkfs.btrfs --checksum xxhash -d raid1 -m raid1 /dev/sd[bc]
btrfs-progs v5.11.1 
See https://btrfs.wiki.kernel.org for more information.

Label:              (null)
UUID:               1bba71d2-b89a-448f-9189-6354be3e06cc
Node size:          16384
Sector size:        4096
Filesystem size:    4.00GiB
Block group profiles:
  Data:             RAID1           204.75MiB
  Metadata:         RAID1           256.00MiB
  System:           RAID1             8.00MiB
SSD detected:       no
Incompat features:  extref, skinny-metadata
Runtime features:   
Checksum:           xxhash64
Number of devices:  2
Devices:
   ID        SIZE  PATH
    1     2.00GiB  /dev/sdb
    2     2.00GiB  /dev/sdc

# mount -v UUID=1bba71d2-b89a-448f-9189-6354be3e06cc /mnt
mount: /dev/sdc mounted on /mnt.

# dd status=progress if=/dev/urandom of=/mnt/delete.dd1 bs=1M count=257 iflag=fullblock oflag=dsync
257+0 records in
257+0 records out
269484032 bytes (269 MB) copied, 9.0711 s, 29.7 MB/s

# dd status=progress if=/dev/urandom of=/mnt/delete.dd2 bs=1M count=257 iflag=fullblock oflag=dsync
257+0 records in
257+0 records out
269484032 bytes (269 MB) copied, 10.3438 s, 26.1 MB/s

# dd status=progress if=/dev/urandom of=/mnt/delete.dd3 bs=1M count=257 iflag=fullblock oflag=dsync
257+0 records in
257+0 records out
269484032 bytes (269 MB) copied, 7.91096 s, 34.1 MB/s

# sleep 60 ; dd status=progress if=/dev/urandom of=/mnt/test.dd bs=1M iflag=fullblock oflag=dsync
dd: error writing ‘/mnt/test.dd’: No space left on device
1012+0 records in
1011+0 records out
1060765696 bytes (1.1 GB) copied, 30.3296 s, 35.0 MB/s

# sleep 60 ; dd status=progress if=/dev/urandom of=/mnt/test.dd1 bs=1K iflag=fullblock oflag=dsync
dd: error writing ‘/mnt/test.dd1’: No space left on device
190+0 records in
189+0 records out
193536 bytes (194 kB) copied, 0.428447 s, 452 kB/s

# umount /mnt

# wipefs -a -f /dev/sdb
/dev/sdb: 8 bytes were erased at offset 0x00010040 (btrfs): 5f 42 48 52 66 53 5f 4d

# mount -v -o degraded UUID=1bba71d2-b89a-448f-9189-6354be3e06cc /mnt 
mount: /dev/sdc mounted on /mnt.
# rm -fr /mnt/delete.dd1
# rm -fr /mnt/delete.dd2

# btrfs replace start -B 1 /dev/sdf /mnt
ERROR: ioctl(DEV_REPLACE_START) failed on "/mnt": No space left on device

# rm -fr /mnt/delete.dd3 

# gdb -q --args btrfs replace start -B 1 /dev/sdf /mnt
Reading symbols from /usr/sbin/btrfs...(no debugging symbols found)...done.
(gdb) run
Starting program: /sbin/btrfs replace start -B 1 /dev/sdf /mnt
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
ERROR: ioctl(DEV_REPLACE_START) failed on "/mnt": No space left on device
[Inferior 1 (process 1622) exited with code 01]
(gdb) run
Starting program: /sbin/btrfs replace start -B 1 /dev/sdf /mnt
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
ERROR: ioctl(DEV_REPLACE_START) failed on "/mnt": Read-only file system
[Inferior 1 (process 1633) exited with code 01]
(gdb) run
Starting program: /sbin/btrfs replace start -B 1 /dev/sdf /mnt
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
ERROR: ioctl(DEV_REPLACE_START) failed on "/mnt": Read-only file system
[Inferior 1 (process 1638) exited with code 01]
(gdb) run
Starting program: /sbin/btrfs replace start -B 1 /dev/sdf /mnt
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
ERROR: ioctl(DEV_REPLACE_START) failed on "/mnt": Read-only file system
[Inferior 1 (process 1643) exited with code 01]
(gdb) quit

# umount -v /mnt
umount: /mnt (/dev/sdc) unmounted

# mount -v -o degraded UUID=1bba71d2-b89a-448f-9189-6354be3e06cc /mnt
mount: /dev/sdc mounted on /mnt.

# gdb -q --args btrfs replace start -B 1 /dev/sdf /mnt
Reading symbols from /usr/sbin/btrfs...(no debugging symbols found)...done.
(gdb) run
Starting program: /sbin/btrfs replace start -B 1 /dev/sdf /mnt
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Program terminated with signal SIGSEGV, Segmentation fault.
The program no longer exists.
(gdb) backtrace
No stack.
(gdb) quit

In /var/log/messages:


kernel: assertion failed: 0, in fs/btrfs/dev-replace.c:499
kernel: ------------[ cut here ]------------
kernel: kernel BUG at fs/btrfs/ctree.h:3291!
kernel: invalid opcode: 0000 [#1] SMP PTI
kernel: CPU: 0 PID: 1679 Comm: btrfs Tainted: G        W         5.10.31-1.el7.x86_64 #1
kernel: RIP: 0010:assertfail.constprop.0+0x18/0x1a
kernel: Code: 00 00 00 74 05 e8 3a 97 04 00 48 83 c4 30 5d 41 5c c3 89 f1 48 c7 c2 14 96 c4 ad 48 89 fe 48 c7 c7 28 fe c3 ad e8 51 69 fe ff <0f> 0b 66 66 66 66 90 3e 80 a7 40 0a 00 00 f7 31 c9 48 81 c7 d8 0b
kernel: RSP: 0018:ffffa45441d33d70 EFLAGS: 00010246
kernel: RAX: 0000000000000032 RBX: ffff994444fc0000 RCX: 0000000000000000
kernel: RDX: 0000000000000000 RSI: ffff99447ce18a80 RDI: ffff99447ce18a80
kernel: RBP: ffff994444c64000 R08: ffff99447ffc7da8 R09: 0000000000027ffb
kernel: R10: 00000000ffff8000 R11: 3fffffffffffffff R12: ffff994446506e00
kernel: R13: ffff99444141d140 R14: ffff994444c64b10 R15: ffff994445962a00
kernel: FS:  00007ffff7ee78c0(0000) GS:ffff99447ce00000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007f49d24cfe58 CR3: 0000000035430000 CR4: 00000000000006f0
kernel: Call Trace:
kernel: btrfs_dev_replace_by_ioctl.cold+0x5e/0x276
kernel: btrfs_ioctl+0x2867/0x3000
kernel: ? do_sigaction+0x1c6/0x240
kernel: ? __x64_sys_ioctl+0x83/0xb0
kernel: __x64_sys_ioctl+0x83/0xb0
kernel: do_syscall_64+0x33/0x80
kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
kernel: RIP: 0033:0x7ffff7280307
kernel: Code: 44 00 00 48 8b 05 69 1b 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 1b 2d 00 f7 d8 64 89 01 48
kernel: RSP: 002b:00007fffffffc938 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
kernel: RAX: ffffffffffffffda RBX: 00007fffffffe5fd RCX: 00007ffff7280307
kernel: RDX: 00007fffffffcd70 RSI: 00000000ca289435 RDI: 0000000000000007
kernel: RBP: 0000000000000008 R08: 0000000000000000 R09: 00007fffffffc890
kernel: R10: 0000000000000008 R11: 0000000000000246 R12: 000055555564e050
kernel: R13: 000055555564e050 R14: 0000000000000001 R15: 0000000000000007
kernel: ---[ end trace e150833b1864e129 ]---
kernel: RIP: 0010:assertfail.constprop.0+0x18/0x1a
kernel: Code: 00 00 00 74 05 e8 3a 97 04 00 48 83 c4 30 5d 41 5c c3 89 f1 48 c7 c2 14 96 c4 ad 48 89 fe 48 c7 c7 28 fe c3 ad e8 51 69 fe ff <0f> 0b 66 66 66 66 90 3e 80 a7 40 0a 00 00 f7 31 c9 48 81 c7 d8 0b
kernel: RSP: 0018:ffffa45441d33d70 EFLAGS: 00010246
kernel: RAX: 0000000000000032 RBX: ffff994444fc0000 RCX: 0000000000000000
kernel: RDX: 0000000000000000 RSI: ffff99447ce18a80 RDI: ffff99447ce18a80
kernel: RBP: ffff994444c64000 R08: ffff99447ffc7da8 R09: 0000000000027ffb
kernel: R10: 00000000ffff8000 R11: 3fffffffffffffff R12: ffff994446506e00
kernel: R13: ffff99444141d140 R14: ffff994444c64b10 R15: ffff994445962a00
kernel: FS:  00007ffff7ee78c0(0000) GS:ffff99447ce00000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007f49d24cfe58 CR3: 0000000035430000 CR4: 00000000000006f0
kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 0 PID: 1673 at fs/btrfs/volumes.c:2973 btrfs_remove_chunk+0x638/0x6a0
kernel: CPU: 0 PID: 1673 Comm: btrfs-cleaner Tainted: G      D W         5.10.31-1.el7.x86_64 #1
kernel: RIP: 0010:btrfs_remove_chunk+0x638/0x6a0
kernel: Code: 8b 55 50 3e 48 0f ba aa 40 0a 00 00 02 8b 04 24 72 1d 83 f8 fb 74 39 83 f8 e2 74 34 89 c6 48 c7 c7 70 05 c4 ad e8 f4 49 79 00 <0f> 0b 8b 04 24 89 c1 ba 9d 0b 00 00 4c 89 ef 89 04 24 48 c7 c6 d0
kernel: RSP: 0018:ffffa45441dc3d90 EFLAGS: 00010286
kernel: RAX: 0000000000000000 RBX: 000000001a000000 RCX: ffff99447ce18a88
kernel: RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff99447ce18a80
kernel: RBP: ffff9944794bd3f0 R08: ffff99447ffc7da8 R09: 0000000000027ffb
kernel: R10: 00000000ffff8000 R11: 3fffffffffffffff R12: ffff994446448000
kernel: R13: ffff994444b8d270 R14: ffff994444c64380 R15: ffff994444c64380
kernel: FS:  0000000000000000(0000) GS:ffff99447ce00000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007f1cbf559000 CR3: 00000000054bc000 CR4: 00000000000006f0
kernel: Call Trace:
kernel: btrfs_delete_unused_bgs+0x6b6/0x770
kernel: cleaner_kthread+0xef/0x120
kernel: ? btree_invalidatepage+0x40/0x40
kernel: kthread+0x11b/0x140
kernel: ? __kthread_bind_mask+0x60/0x60
kernel: ret_from_fork+0x22/0x30
kernel: ---[ end trace e150833b1864e12a ]---
kernel: BTRFS: error (device sdc) in btrfs_remove_chunk:2973: errno=-28 No space left
kernel: BTRFS info (device sdc): forced readonly
kdave commented 3 years ago
assertion failed: 0, in fs/btrfs/dev-replace.c:499

Matching code from 5.10.31:

 491         down_write(&dev_replace->rwsem);                                                                                                                                            
 492         switch (dev_replace->replace_state) {                                                                                                                                       
 493         case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED:                                                                                                                           
 494         case BTRFS_IOCTL_DEV_REPLACE_STATE_FINISHED:                                                                                                                                
 495         case BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED:                                                                                                                                
 496                 break;                                                                                                                                                              
 497         case BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED:                                                                                                                                 
 498         case BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED:                                                                                                                               
 499                 ASSERT(0);                                                                                                                                                          
 500                 ret = BTRFS_IOCTL_DEV_REPLACE_RESULT_ALREADY_STARTED;                                                                                                               
 501                 up_write(&dev_replace->rwsem);                                                                                                                                      
 502                 goto leave;                                                                                                                                                         
 503         }   

Introduced by commit https://git.kernel.org/linus/5c06147128fbb and it says the assertion is impossible to hit. From the reproducer it seems that when replace fails to start due to ENOSPC it's in a state that the replace state machine does not expect or at least the assertion is based on a wrong assumption.