Transferring large amounts of files to the bcachefs from the btrfs causes I/O timeouts and freezes the whole system. This doesn't seem to be related to the btrfs, but rather to the heavy I/O on the drive, as it happends without btrfs being mounted. Transferring the files to the HDD, and then from it to the bcachefs on the NVME sometimes doesn't make the problem occur.
The problem only happens on the bcachefs. It doesn't happen on the HDD, I can't test with other NVME drives sadly.
The behaviour when it is frozen is like this: all drive accesses can't process, when not cached in ram, so every app that is loaded in the ram, continues to function, but at the moment it tries to access the drive it freezes, until the drive is reset and those abort status messages appear in the dmesg, after that system is unfrozen for a moment, if you keep copying the files then the problem reoccurs once again.
This drive is known to have problems with the power management in the past:
https://wiki.archlinux.org/title/Solid_state_drive/NVMe#Troubleshooting
But those problems where since fixed with kernel workarounds / firmware updates.
This issue is may be related, perhaps bcachefs does something different from the other filesystems, and workarounds don't apply, which causes the bug to occur only on it. It also may be a problem in the nvme subsystem, or just some edge case in the bcachefs too, who knows.
I tried to disable ASPM and setting latency to 0 like was suggested, it didn't fix the problem, so I don't know.
If this is indeed related to that specific drive it would be hard to reproduce.
If anyone finds this and has a similar issue, please reply here, because right now I'm not sure if this is hardware related.
› uname -a
Linux hp-laptop 6.7.0 #1-NixOS SMP PREEMPT_DYNAMIC Sun Jan 7 20:18:38 UTC 2024 x86_64 GNU/Linux
› rg -z -i bcachefs /proc/config.gz
10478:CONFIG_BCACHEFS_FS=m
10479:CONFIG_BCACHEFS_QUOTA=y
10480:# CONFIG_BCACHEFS_ERASURE_CODING is not set
10481:CONFIG_BCACHEFS_POSIX_ACL=y
10482:# CONFIG_BCACHEFS_DEBUG_TRANSACTIONS is not set
10483:# CONFIG_BCACHEFS_DEBUG is not set
10484:# CONFIG_BCACHEFS_TESTS is not set
10485:# CONFIG_BCACHEFS_LOCK_TIME_STATS is not set
10486:# CONFIG_BCACHEFS_NO_LATENCY_ACCT is not set
! nvme list
Node Generic Model Namespace Usage Format FW Rev
-------------- ----------- ----------------------- ---------- -------------------------- ---------------- --------
/dev/nvme0n1 /dev/ng0n1 KINGSTON SA2000M8500G 0x1 348.70 GB / 500.11 GB 512 B + 0 B S5Z42109
Can you report this to the list, and CC linux-block? This is probably an NVME bug, not a bcachefs bug, but since we're triggering it we'll have to help track it down.
Transferring large amounts of files to the bcachefs from the btrfs causes I/O timeouts and freezes the whole system. This doesn't seem to be related to the btrfs, but rather to the heavy I/O on the drive, as it happends without btrfs being mounted. Transferring the files to the HDD, and then from it to the bcachefs on the NVME sometimes doesn't make the problem occur. The problem only happens on the bcachefs. It doesn't happen on the HDD, I can't test with other NVME drives sadly. The behaviour when it is frozen is like this: all drive accesses can't process, when not cached in ram, so every app that is loaded in the ram, continues to function, but at the moment it tries to access the drive it freezes, until the drive is reset and those abort status messages appear in the dmesg, after that system is unfrozen for a moment, if you keep copying the files then the problem reoccurs once again.
This drive is known to have problems with the power management in the past: https://wiki.archlinux.org/title/Solid_state_drive/NVMe#Troubleshooting But those problems where since fixed with kernel workarounds / firmware updates. This issue is may be related, perhaps bcachefs does something different from the other filesystems, and workarounds don't apply, which causes the bug to occur only on it. It also may be a problem in the nvme subsystem, or just some edge case in the bcachefs too, who knows. I tried to disable ASPM and setting latency to 0 like was suggested, it didn't fix the problem, so I don't know. If this is indeed related to that specific drive it would be hard to reproduce.
If anyone finds this and has a similar issue, please reply here, because right now I'm not sure if this is hardware related.
Errors:
System info:
This is when it happens on my machine:
Please tell as to what other info do you need and how to provide it.