nvme content being lost after reboot

yuyichao commented 3 years ago

This is a summary of my observations.

Writing a single byte to the block file /dev/nvme0n1p[x] vs touching a few empty files on a fat32 file system seems to have consistent behavior (i.e. depending on the operation done afterwards, both will either be written back or both not).
sync, writing to /proc/sys/vm/drop_caches, SysRq-s all have no effect on whether the content is written back.
Reboot/power off methods that works (data persists after reboot)
- Run reboot command
- Run poweroff command
Reboot/power off methods that doesn't work (a small write never persists after reboot)
- Long press power button
- SysRq-b
rmmoding the pcie-apple-m1-nvme module
- Without writing anything, potentially with some read
  - Run reboot / poweroff command after rmmod. Always (3 out of 3 times total) get a kernel panic in an interrupt handler
    
    Seems that the rmmod didn't unregister it?
  - SysRq-b or long press power button after rmmod. Nothing significant happens. (3 out of 3 times for each action)
- With a single byte written to the block device directly
  - Run reboot / poweroff command after rmmod Same as when it was without writing. Requiring a long press power button to reset, and the data is not kept (i.e. data lost).
  - SysRq-b or long press power button after rmmod. SysRq-b didn't work one out of 3 times, otherwise, nothing significant and nothing got written.
I got the system in a state that loading pcie-apple-m1-nvme always fails once (but booting into 1TR fixed it). Have not been able to reproduce.

pipcet commented 3 years ago

Unloading pcie-apple-m1-nvme probably shouldn't work at all; it doesn't unregister anything, as far as I can tell.

What I tried was unloading nvme: as far as I can tell, that forces a flush, but modprobe nvme won't work until after the next reboot.

pipcet commented 3 years ago

I've just pushed changes to include nvme-cli in both the debootstrap and the initramfs. Can you confirm an explicit

nvme -n 1 flush /dev/nvme0n1

writes data to the nvme?

I must confess I do not understand the nvme code very well, but it seems we might have to add a quirk to flush explicitly on sync...

pipcet commented 3 years ago

I've experimented more, and the issue does not appear to happen when an ext4 or btrfs filesystem is mounted, but does happen when the only fs to be mounted is vfat or when there is no fs mounted and we write directly to /dev/nvme0n1p*.

The two filesystems differ in how they sync: vfat goes along the following call chain:

__sync_blockdev -> filemap_write_and_wait -> do_writepages -> writepages -> writepage -> block_write_full_page -> buffer_async_write ... -> nvme_queue_rq -> nvme_setup_cmd -> nvme_setup_rw

nvme_setup_rw does not check whether the __REQ_SYNC flag is set, so it always(?) performs an asynchronous write.

But even if it did check, we'd need to watch out for the situation where a page is dirty, we write it to disk asynchronously, a sync request comes in, checks whether any pages are dirty, and does nothing because none are.

The good news is syncs appear to be fairly quick and painless, so the practical issue on this device is easily avoided.

pipcet commented 3 years ago

@svenpeter42 said he could reproduce this issue with the Asahi Linux code, too. Is that still true? If so, can we report it upstream?

TL;DR: if an nvme device is accessed exclusively through /dev/nvme0n1pX, or exclusively used to mount a single fat fs on an nvme partition, sync() doesn't sync in a way that survives a reboot.

svenpeter42 commented 3 years ago

Only if someone manages to reproduce it with a mainline kernel, otherwise the report will be ignored because the maintainers cannot know if the issue happens due to additional patches.

pipcet commented 3 years ago

Only if someone manages to reproduce it with a mainline kernel, otherwise the report will be ignored because the maintainers cannot know if the issue happens due to additional patches.

I hope you're not saying you're planning to introduce the bug into the mainline kernel only to then "report" it afterwards. We should certainly warn loudly about this in any discussion with the upstream maintainers, and as long as it seems likely the problem is one with the underlying code, they will be interested in resolving this before anything is upstreamed or merged, if no sooner.

svenpeter42 commented 3 years ago

I'm not sure what you made you even think I said any of this, that's a surprisingly bad faith interpretation of what I said.

If this is an issue with the currently existing nvme code it should be reproducible on a regular linux machine with a vanilla upstream kernel. Then it can be reported upstream and they will likely fix it.

If this is an issue specific to the Apple NVMe controller I will fix it before submitting my patches upstream.

pipcet commented 3 years ago

@svenpeter42 I assure you no bad faith was assumed, but that is the way I read your statement. I'm glad you didn't mean it that way.

You're absolutely correct that it's possible that there is x86 hardware around that doesn't sync writes properly, either. The NVMe maintainers would probably know where to start looking.

pipcet / pearl

nvme content being lost after reboot #4