Closed rayjanoka closed 1 year ago
Hi. Oops...))
It seems that I introduced this bug in e950c024d3a2d52839f79d9f54ae3d38cf223fcb while trying to fix another problem %).
what(): Error while doing local stabilize operation: Device or resource busy
means that an OSD tried to "stabilize" = "2 phase second phase" (commit) an object which wasn't synced yet...
I.e. normally a EC write looks like: write -> sync -> stabilize. But in your case something probably crashed or timed out and an OSD... which probably didn't crash (!) got an unsynced write left in its RAM. Then another OSD was restarted and it requested a listing of current objects from this one. It got this unstable write in the listing and tried to "stabilize" it and got EBUSY which means "write is not synced yet" in this case.
Soooooo. The bug won't reproduce if you use SSDs with capacitors immediate_commit=all (or if you use replication instead of EC).
And for non-immediate-commit modes the proper fix seems to be to forcibly sync this object when "stabilizing" instead of returning an error. I'll take a look into it.
Thanks for the detailed explanation @vitalif.
I will give it a try with EC
+ immediate_commit=all
, we aren't worried about corruption for this particular data.
I got better performance for a bit (9700 iops), but something eventually failed.
this machine locked up and restarted itself after this error
bmhv9 kernel: watchdog: BUG: soft lockup - CPU#20 stuck for 22s! [vitastor-nbd:2014263]
seems like it can't recover
ubuntu@bmhv9:~$ vitastor-cli status
cluster:
etcd: 5 / 6 up, 1 M database size
mon: 5 up, master bmhv6
osd: 5 / 5 up
data:
raw: 8.5 G used, 8.4 T / 8.4 T available
state: 2.1 G clean, 1 M degraded
pools: 0 / 1 active
pgs: 248 active
3 repeering+degraded+has_degraded+has_unclean
5 stopping+degraded+has_degraded+has_unclean
bmhv9 kernel: watchdog: BUG: soft lockup - CPU#20 stuck for 22s! [vitastor-nbd:2014263]
Which kernel do you use? Seems like a kernel bug, may be related to io_uring
(I don't know what it is exactly but I've seen io_uring misbehaviour on various kernels several times %))
ok. this is on Ubuntu 20.04 5.15.0-67-generic #74~20.04.1-Ubuntu SMP Wed Feb 22 14:52:34 UTC 2023 x86_64
I can move my lab to Ubuntu 22.04, we are planning on doing that anyway.
I had to build from source using latest from liburing. Should I consider using a specific version of liburing or just update to a newer kernel?
git clone https://github.com/axboe/liburing.git
cd liburing
./configure && make && make install
First of all check logs on that machine, maybe we can discover the cause... I mostly test everything on Debian and usually I don't get kernel errors. The last problems I got were related to RDMA and on ~5.4... Guys from the chat had some issues several months ago with hyperv and ubuntu 5.15+ kernels though. So I'm not sure... can you look for errors in other logs?
hmm
Ran out of journal space (used_start=01b89000, next_free=01b67000, dirty_start=01b20000)
root@bmhv7:/home/ubuntu# tail -f /var/log/vitastor/osd2.log
[OSD 2] avg latency for subop 4 (sync): 833 us
[OSD 2] avg latency for subop 15 (ping): 269 us
Ran out of journal space (used_start=01d9c000, next_free=01d79000, dirty_start=01d46000)
[OSD 2] avg latency for op 1 (read): 17 us, B/W: 0.00 KB/s
[OSD 2] avg latency for op 3 (write_stable): 2409 us, B/W: 14.07 MB/s
[OSD 2] avg latency for op 4 (sync): 11818 us
[OSD 2] avg latency for op 12 (primary_write): 10995 us, B/W: 12.06 MB/s
[OSD 2] avg latency for subop 3 (write_stable): 758 us
[OSD 2] avg latency for subop 4 (sync): 1161 us
Ran out of journal space (used_start=00a9b000, next_free=00a76000, dirty_start=00a63000)
Ran out of journal space (used_start=00745000, next_free=00723000, dirty_start=006d4000)
[OSD 2] avg latency for op 1 (read): 17 us, B/W: 0.00 KB/s
[OSD 2] avg latency for op 3 (write_stable): 2601 us, B/W: 13.70 MB/s
[OSD 2] avg latency for op 4 (sync): 13053 us
[OSD 2] avg latency for op 12 (primary_write): 11119 us, B/W: 11.97 MB/s
[OSD 2] avg latency for subop 3 (write_stable): 734 us
[OSD 2] avg latency for subop 4 (sync): 935 us
[OSD 2] avg latency for subop 15 (ping): 349 us
Ran out of journal space (used_start=01b89000, next_free=01b67000, dirty_start=01b20000)
Ran out of journal space (used_start=00ea1000, next_free=00e7f000, dirty_start=00dee000)
[OSD 2] avg latency for op 1 (read): 20 us, B/W: 0.00 KB/s
[OSD 2] avg latency for op 3 (write_stable): 926 us, B/W: 29.62 MB/s
[OSD 2] avg latency for op 4 (sync): 3241 us
[OSD 2] avg latency for op 12 (primary_write): 5347 us, B/W: 25.24 MB/s
[OSD 2] avg latency for subop 3 (write_stable): 886 us
[OSD 2] avg latency for subop 4 (sync): 1211 us
[OSD 2] avg latency for subop 15 (ping): 326 us
Message from syslogd@bmhv7 at Mar 21 03:06:19 ...
kernel:[19704.166683] NMI watchdog: Watchdog detected hard LOCKUP on cpu 25
Message from syslogd@bmhv7 at Mar 21 03:06:19 ...
kernel:[19708.518669] watchdog: BUG: soft lockup - CPU#16 stuck for 26s! [kworker/16:6:158772]
Message from syslogd@bmhv7 at Mar 21 03:06:19 ...
kernel:[19708.538669] watchdog: BUG: soft lockup - CPU#23 stuck for 23s! [runc:803276]
Message from syslogd@bmhv7 at Mar 21 03:06:23 ...
kernel:[19712.478649] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:22]
Message from syslogd@bmhv7 at Mar 21 03:06:23 ...
kernel:[19712.506649] watchdog: BUG: soft lockup - CPU#11 stuck for 22s! [virt-handler:5247]
Message from syslogd@bmhv7 at Mar 21 03:06:23 ...
kernel:[19712.510649] watchdog: BUG: soft lockup - CPU#12 stuck for 22s! [vitastor-nbd:161338]
Message from syslogd@bmhv7 at Mar 21 03:06:23 ...
kernel:[19712.518649] watchdog: BUG: soft lockup - CPU#15 stuck for 22s! [mongo:803283]
Message from syslogd@bmhv7 at Mar 21 03:06:23 ...
kernel:[19712.530648] watchdog: BUG: soft lockup - CPU#20 stuck for 22s! [kworker/20:6:161420]
"Ran out of journal space" is fine. Nothing more in OSD logs? dmesg? And something hangs?
I collected the logs from all of the nodes. Seems like everything lost connection to OSD2 first. Let me know what you think.
OSD: https://gist.github.com/rayjanoka/141e6617190e3c537a2600c6547b864d
journal: https://gist.github.com/rayjanoka/82a094110e1de07d75b5a961f1ee0d02
I upgraded to Ubuntu 22.04 and built from source the same way I was doing it and it looks like the issue is resolved!
Yeah, OSD 2 hung first. I'm not sure about "has_unclean", "unclean" objects should have been resolved even with 1 failed OSD and it should've continued operating in degraded mode... But I'm glad to hear that the lockup itself is resolved after updating, seems that 5.15.0-67-generic was just buggy :-)
Mar 21 15:47:00 bmhv9 kernel: NMI watchdog: Watchdog detected hard LOCKUP on cpu 26
Mar 21 15:47:00 bmhv9 kernel: Call Trace:
Mar 21 15:47:00 bmhv9 kernel: <IRQ>
Mar 21 15:47:00 bmhv9 kernel: _raw_spin_lock+0x22/0x30
Mar 21 15:47:00 bmhv9 kernel: take_dentry_name_snapshot+0x20/0x80
Mar 21 15:47:00 bmhv9 kernel: __fsnotify_parent+0x1d6/0x310
Mar 21 15:47:00 bmhv9 kernel: ? native_smp_send_reschedule+0x2a/0x50
Mar 21 15:47:00 bmhv9 kernel: ? mempool_free_slab+0x17/0x20
Mar 21 15:47:00 bmhv9 kernel: __io_complete_rw_common+0x132/0x180
Mar 21 15:47:00 bmhv9 kernel: ? __io_complete_rw_common+0x132/0x180
Mar 21 15:47:00 bmhv9 kernel: io_complete_rw+0x17/0x50
Mar 21 15:47:00 bmhv9 kernel: blkdev_bio_end_io+0x8c/0x110
Mar 21 15:47:00 bmhv9 kernel: bio_endio+0x106/0x170
Mar 21 15:47:00 bmhv9 kernel: blk_update_request+0x249/0x4d0
Mar 21 15:47:00 bmhv9 kernel: blk_mq_end_request+0x21/0x140
Mar 21 15:47:00 bmhv9 kernel: nvme_complete_rq+0x80/0x260 [nvme_core]
Mar 21 15:47:00 bmhv9 kernel: nvme_pci_complete_rq+0x82/0xa0 [nvme]
Mar 21 15:47:00 bmhv9 kernel: nvme_irq+0x15f/0x250 [nvme]
Mar 21 15:47:00 bmhv9 kernel: __handle_irq_event_percpu+0x42/0x170
Mar 21 15:47:00 bmhv9 kernel: handle_irq_event_percpu+0x33/0x80
Mar 21 15:47:00 bmhv9 kernel: handle_irq_event+0x3b/0x60
Mar 21 15:47:00 bmhv9 kernel: handle_edge_irq+0x93/0x1c0
Mar 21 15:47:00 bmhv9 kernel: __common_interrupt+0x52/0xe0
Mar 21 15:47:00 bmhv9 kernel: common_interrupt+0x89/0xa0
Mar 21 15:47:00 bmhv9 kernel: </IRQ>
still having a similar issue.. here is a new crash on a 6.0.1 kernel.
Mar 25 04:45:53 bmhv9 kernel: nbd0: detected capacity change from 0 to 20971520
Mar 25 04:45:53 bmhv9 kernel: EXT4-fs (nbd0): mounted filesystem with ordered data mode. Quota mode: none.
Mar 25 04:45:54 bmhv9 systemd-udevd[6086]: Using default interface naming scheme 'v249'.
Mar 25 04:45:54 bmhv9 networkd-dispatcher[1209]: WARNING:Unknown index 15 seen, reloading interface list
Mar 25 04:45:54 bmhv9 systemd-udevd[6131]: Using default interface naming scheme 'v249'.
Mar 25 04:45:54 bmhv9 systemd-networkd[1163]: lxc547e8e7a8b0e: Link UP
Mar 25 04:45:54 bmhv9 kernel: eth0: renamed from tmp49987
Mar 25 04:45:54 bmhv9 systemd-networkd[1163]: lxc547e8e7a8b0e: Gained carrier
Mar 25 04:45:54 bmhv9 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Mar 25 04:45:54 bmhv9 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): lxc547e8e7a8b0e: link becomes ready
Mar 25 04:45:54 bmhv9 networkd-dispatcher[1209]: ERROR:Unknown interface index 15 seen even after reload
Mar 25 04:45:54 bmhv9 networkd-dispatcher[1209]: WARNING:Unknown index 15 seen, reloading interface list
Mar 25 04:45:54 bmhv9 networkd-dispatcher[1209]: ERROR:Unknown interface index 15 seen even after reload
Mar 25 04:45:54 bmhv9 networkd-dispatcher[1209]: WARNING:Unknown index 15 seen, reloading interface list
Mar 25 04:45:54 bmhv9 networkd-dispatcher[1209]: ERROR:Unknown interface index 15 seen even after reload
Mar 25 04:45:54 bmhv9 systemd[1]: Started libcontainer container 499874f395d0fcd0d450a9d8e00e4edd80e2c73e0fee338f7c123381ce45e388.
Mar 25 04:45:55 bmhv9 systemd-networkd[1163]: lxc547e8e7a8b0e: Gained IPv6LL
Mar 25 04:45:56 bmhv9 systemd[1]: Started libcontainer container d64c7126e869c0eef9435a46e28ec7538c6dead8250634ea0572f09e4f1cc9fd.
Mar 25 04:46:11 bmhv9 node[5759]: Moving pool 1 (testpool) PG 88 primary OSD from 3 to 4
Mar 25 04:46:11 bmhv9 node[5759]: Moving pool 1 (testpool) PG 90 primary OSD from 3 to 4
Mar 25 04:46:11 bmhv9 node[5759]: Moving pool 1 (testpool) PG 95 primary OSD from 3 to 2
Mar 25 04:46:11 bmhv9 node[5759]: Moving pool 1 (testpool) PG 96 primary OSD from 3 to 2
Mar 25 04:46:11 bmhv9 node[5759]: Moving pool 1 (testpool) PG 97 primary OSD from 3 to 4
Mar 25 04:46:11 bmhv9 node[5759]: Moving pool 1 (testpool) PG 98 primary OSD from 3 to 4
Mar 25 04:46:11 bmhv9 node[5759]: Moving pool 1 (testpool) PG 100 primary OSD from 3 to 2
Mar 25 04:46:11 bmhv9 node[5759]: Moving pool 1 (testpool) PG 101 primary OSD from 3 to 2
...
Mar 25 04:46:11 bmhv9 node[5759]: Moving pool 1 (testpool) PG 234 primary OSD from 3 to 2
Mar 25 04:46:11 bmhv9 node[5759]: Moving pool 1 (testpool) PG 238 primary OSD from 3 to 2
Mar 25 04:46:11 bmhv9 node[5759]: Moving pool 1 (testpool) PG 239 primary OSD from 3 to 2
Mar 25 04:46:11 bmhv9 node[5759]: Moving pool 1 (testpool) PG 248 primary OSD from 3 to 2
Mar 25 04:46:11 bmhv9 node[5759]: Moving pool 1 (testpool) PG 251 primary OSD from 3 to 2
Mar 25 04:46:11 bmhv9 node[5759]: Moving pool 1 (testpool) PG 252 primary OSD from 3 to 4
Mar 25 04:46:11 bmhv9 node[5759]: Moving pool 1 (testpool) PG 256 primary OSD from 3 to 2
Mar 25 04:46:11 bmhv9 node[5759]: PG configuration successfully changed
Mar 25 04:46:12 bmhv9 etcd[1343]: time="2023-03-25T04:46:12Z" level=warning msg="error reading websocket message: websocket: close 1006 (abnormal closure): unexpected EOF"
Mar 25 04:46:44 bmhv9 kernel: block nbd0: Connection timed out, retrying (1/1 alive)
Mar 25 04:46:44 bmhv9 kernel: block nbd0: Connection timed out, retrying (0/1 alive)
Mar 25 04:46:44 bmhv9 kernel: block nbd0: Receive control failed (result -32)
Mar 25 04:46:44 bmhv9 kernel: block nbd0: Connection timed out, retrying (0/1 alive)
Mar 25 04:46:44 bmhv9 kernel: block nbd0: shutting down sockets
Mar 25 04:46:44 bmhv9 kernel: block nbd0: Connection timed out, retrying (0/1 alive)
Mar 25 04:46:44 bmhv9 kernel: block nbd0: Connection timed out, retrying (0/1 alive)
Mar 25 04:46:44 bmhv9 kernel: I/O error, dev nbd0, sector 196872 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
Mar 25 04:46:44 bmhv9 kernel: block nbd0: Connection timed out, retrying (0/1 alive)
Mar 25 04:46:44 bmhv9 kernel: EXT4-fs warning (device nbd0): ext4_end_bio:343: I/O error 10 writing to inode 12 starting block 24609)
Mar 25 04:46:44 bmhv9 kernel: block nbd0: Connection timed out, retrying (0/1 alive)
Mar 25 04:46:44 bmhv9 kernel: I/O error, dev nbd0, sector 196944 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
Mar 25 04:46:44 bmhv9 kernel: buffer_io_error: 27781 callbacks suppressed
Mar 25 04:46:44 bmhv9 kernel: Buffer I/O error on device nbd0, logical block 24609
Mar 25 04:46:44 bmhv9 kernel: block nbd0: Connection timed out, retrying (0/1 alive)
Mar 25 04:46:44 bmhv9 kernel: EXT4-fs warning (device nbd0): ext4_end_bio:343: I/O error 10 writing to inode 12 starting block 24618)
Mar 25 04:46:44 bmhv9 kernel: block nbd0: Connection timed out, retrying (0/1 alive)
Mar 25 04:46:44 bmhv9 kernel: I/O error, dev nbd0, sector 197144 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
Mar 25 04:46:44 bmhv9 kernel: Buffer I/O error on device nbd0, logical block 24618
Mar 25 04:46:44 bmhv9 kernel: block nbd0: Connection timed out, retrying (0/1 alive)
Mar 25 04:46:44 bmhv9 kernel: EXT4-fs warning (device nbd0): ext4_end_bio:343: I/O error 10 writing to inode 12 starting block 24643)
Mar 25 04:46:44 bmhv9 kernel: I/O error, dev nbd0, sector 41240 op 0x1:(WRITE) flags 0x0 phys_seg 512 prio class 2
Mar 25 04:46:44 bmhv9 kernel: BUG: kernel NULL pointer dereference, address: 00000000000000b8
Mar 25 04:46:44 bmhv9 kernel: #PF: supervisor read access in kernel mode
Mar 25 04:46:45 bmhv9 kernel: #PF: error_code(0x0000) - not-present page
<reboot>
Hm. 1) can you show me other logs? 2) can you describe the whole test scenario so I can just reproduce it myself in VMs?
Ok. I fixed this bug and I also fixed six other EC related bugs %))). Now my OSD killing test (tests/test_heal.sh) passes correctly with EC, it was often hanging previously and it even masked a bug in the write path. I'll build a new release soon, for now you can try to build it yourself from the master branch.
Hey, I have a clean v0.8.6 setup with 5 nodes, but having some issues.
Running this ec pool:
I have some images setup
I was running fio in a container on one of the pvc images when things stopped working. I didn't have any of the servers crash still have 5 up, but now it won't recover.
Separate OSDs have this error
OSD2 crash?
OSD3 crash?
seeing a bit of this around