Open yuyichao opened 3 years ago
Unloading pcie-apple-m1-nvme
probably shouldn't work at all; it doesn't unregister anything, as far as I can tell.
What I tried was unloading nvme
: as far as I can tell, that forces a flush, but modprobe nvme
won't work until after the next reboot.
I've just pushed changes to include nvme-cli
in both the debootstrap and the initramfs. Can you confirm an explicit
nvme -n 1 flush /dev/nvme0n1
writes data to the nvme?
I must confess I do not understand the nvme code very well, but it seems we might have to add a quirk to flush explicitly on sync...
I've experimented more, and the issue does not appear to happen when an ext4 or btrfs filesystem is mounted, but does happen when the only fs to be mounted is vfat or when there is no fs mounted and we write directly to /dev/nvme0n1p*
.
The two filesystems differ in how they sync: vfat goes along the following call chain:
__sync_blockdev -> filemap_write_and_wait -> do_writepages -> writepages -> writepage -> block_write_full_page -> buffer_async_write ... -> nvme_queue_rq -> nvme_setup_cmd -> nvme_setup_rw
nvme_setup_rw
does not check whether the __REQ_SYNC
flag is set, so it always(?) performs an asynchronous write.
But even if it did check, we'd need to watch out for the situation where a page is dirty, we write it to disk asynchronously, a sync request comes in, checks whether any pages are dirty, and does nothing because none are.
The good news is syncs appear to be fairly quick and painless, so the practical issue on this device is easily avoided.
@svenpeter42 said he could reproduce this issue with the Asahi Linux code, too. Is that still true? If so, can we report it upstream?
TL;DR: if an nvme device is accessed exclusively through /dev/nvme0n1pX, or exclusively used to mount a single fat fs on an nvme partition, sync() doesn't sync in a way that survives a reboot.
Only if someone manages to reproduce it with a mainline kernel, otherwise the report will be ignored because the maintainers cannot know if the issue happens due to additional patches.
Only if someone manages to reproduce it with a mainline kernel, otherwise the report will be ignored because the maintainers cannot know if the issue happens due to additional patches.
I hope you're not saying you're planning to introduce the bug into the mainline kernel only to then "report" it afterwards. We should certainly warn loudly about this in any discussion with the upstream maintainers, and as long as it seems likely the problem is one with the underlying code, they will be interested in resolving this before anything is upstreamed or merged, if no sooner.
I'm not sure what you made you even think I said any of this, that's a surprisingly bad faith interpretation of what I said.
If this is an issue with the currently existing nvme code it should be reproducible on a regular linux machine with a vanilla upstream kernel. Then it can be reported upstream and they will likely fix it.
If this is an issue specific to the Apple NVMe controller I will fix it before submitting my patches upstream.
@svenpeter42 I assure you no bad faith was assumed, but that is the way I read your statement. I'm glad you didn't mean it that way.
You're absolutely correct that it's possible that there is x86 hardware around that doesn't sync writes properly, either. The NVMe maintainers would probably know where to start looking.
This is a summary of my observations.
Writing a single byte to the block file
/dev/nvme0n1p[x]
vs touching a few empty files on a fat32 file system seems to have consistent behavior (i.e. depending on the operation done afterwards, both will either be written back or both not).sync
, writing to/proc/sys/vm/drop_caches
, SysRq-s all have no effect on whether the content is written back.Reboot/power off methods that works (data persists after reboot)
reboot
commandpoweroff
commandReboot/power off methods that doesn't work (a small write never persists after reboot)
rmmod
ing thepcie-apple-m1-nvme
moduleWithout writing anything, potentially with some read
Run
reboot
/poweroff
command afterrmmod
. Always (3 out of 3 times total) get a kernel panic in an interrupt handlerSeems that the rmmod didn't unregister it?
rmmod
. Nothing significant happens. (3 out of 3 times for each action)reboot
/poweroff
command afterrmmod
Same as when it was without writing. Requiring a long press power button to reset, and the data is not kept (i.e. data lost).rmmod
. SysRq-b didn't work one out of 3 times, otherwise, nothing significant and nothing got written.I got the system in a state that loading pcie-apple-m1-nvme always fails once (but booting into 1TR fixed it). Have not been able to reproduce.