koverstreet / bcachefs

Other
662 stars 70 forks source link

bcachefs as root file system is hanging indefinitely, unknown cause #712

Open koshell opened 1 month ago

koshell commented 1 month ago

For context: I'm installing Arch on a tiny little laptop. I have only just gotten the thing booting into the distro (haven't even created a user account yet) and for some reason my root partition (that I formatted with bcachefs) appears to be causing issues.

Specifically reading files from the device appears okay, however writing anything to it appears to hang forever, this makes diagnosing the issue difficult. I'm not ruling out a hardware issue (laptop is a pile of garbage on the best of days) but I have confirmed that 'badblocks' doesn't see anything wrong with the underlying block device.

I was able to extract a dmesg log to a usb drive which I hope helps, it appears to be complaining about 'bch-copygc' but I really don't know how to interpret these errors so I'll just include the log in it's entirety: kernel.log

koshell commented 1 month ago

I'm unsure what other information would be of value, if there is any other info that would be of value I can try to collect it.

koverstreet commented 1 month ago

kernel version would be a starting point

koshell commented 1 month ago

Linux version: 6.9.7-arch1-1

koshell commented 1 month ago

This is one of the errors I saw in the log:

[  368.345299] INFO: task bch-copygc/dm-0:298 blocked for more than 122 seconds.
[  368.348641]       Not tainted 6.9.7-arch1-1 #1
[  368.352015] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  368.353103] task:bch-copygc/dm-0 state:D stack:0     pid:298   tgid:298   ppid:2      flags:0x00004000
[  368.353115] Call Trace:
[  368.353118]  <TASK>
[  368.353122]  __schedule+0x3c7/0x1510
[  368.353136]  schedule+0x27/0xf0
[  368.353142]  __closure_sync+0x7e/0x140
[  368.353150]  __bch2_write+0x136b/0x1660 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[  368.353316]  ? six_relock_ip+0x38/0x80 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[  368.353459]  ? local_clock_noinstr+0xd/0xd0
[  368.353465]  ? __kmalloc+0x1a7/0x440
[  368.353475]  ? local_clock_noinstr+0xd/0xd0
[  368.353480]  ? local_clock+0x15/0x30
[  368.353487]  ? bch2_moving_ctxt_do_pending_writes+0x11a/0x220 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[  368.353633]  bch2_moving_ctxt_do_pending_writes+0x11a/0x220 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[  368.353780]  ? bch2_btree_path_traverse_one+0x958/0xcf0 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[  368.353908]  bch2_data_update_init+0x68b/0x1420 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[  368.354052]  ? bch2_move_extent+0x3da/0xed0 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[  368.354197]  bch2_move_extent+0x3da/0xed0 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[  368.354347]  ? bch2_evacuate_bucket+0x9d4/0xc00 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[  368.354492]  bch2_evacuate_bucket+0x9d4/0xc00 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[  368.354644]  ? bch2_copygc+0x210/0x880 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[  368.354816]  bch2_copygc+0x210/0x880 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[  368.354972]  bch2_copygc_thread+0x152/0x3d0 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[  368.355105]  ? bch2_copygc_thread+0xcf/0x3d0 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[  368.355238]  ? __pfx_bch2_copygc_thread+0x10/0x10 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[  368.355367]  kthread+0xd2/0x100
[  368.355374]  ? __pfx_kthread+0x10/0x10
[  368.355380]  ret_from_fork+0x34/0x50
[  368.355386]  ? __pfx_kthread+0x10/0x10
[  368.355391]  ret_from_fork_asm+0x1a/0x30
[  368.355398]  </TASK>

I'll see if I can get it to mount the debugfs and get more info.

koverstreet commented 1 month ago

That looks like copygc blocking on the allocator - doh.

Try 6.10; it dumps a bunch of info when the allocator get stucks.