openzfsonwindows / openzfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
438 stars 15 forks source link

Crash in zil_open #305

Open andrewc12 opened 10 months ago

andrewc12 commented 10 months ago

Describe the problem you're observing

SYSTEM_THREAD_EXCEPTION_NOT_HANDLED (7e) when trying to import pool

Describe how to reproduce the problem

previously

... ... zpool add tank_ntfs //./C:/zfs/ntfs_03.zfs

install zfs again

zpool import -N -d C:\zfs tank_ntfs zpool remove tank_ntfs //./C:/zfs/ntfs_03.zfs zpool remove -w tank_ntfs //./C:/zfs/ntfs_03.zfs zpool export tank_ntfs

crash 1

info.txt cbuf.txt

zpool import -N -d C:\zfs tank_ntfs

crash 2

info.txt cbuf.txt

EchterAgo commented 10 months ago

Why is it trying to write to a zvol? In both cases zvol_os_attach was called just before the end of buffer marker, then it goes through a write path that should only happen on the first write to the zvol (comment: Open a ZIL if this is the first time we have written to this zvol.).

andrewc12 commented 10 months ago

Unfortunately I'm not sure if this pool had zvols on it

EchterAgo commented 10 months ago

@andrewc12 could you try importing that pool again with zil_replay_disable set?

EchterAgo commented 10 months ago

I suspect in zvol_os_create_minor zil_replay failed, which caused zil_close to not be called, but zv->zv_zilog was still set to NULL. Now on first write, it sees that zv_zilog is NULL and wants to zil_open, but it is already open.

andrewc12 commented 10 months ago

@EchterAgo I was being very silly, this pool pretty much only has a zvol on it. A sparse 2tb ntfs zvol.

set Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\OpenZFS\zfs_zil zil_replay_disable to 1 waited a minute zpool import -N -d C:\zfs tank_ntfs

crash 3

info.txt cbuf.txt

at next boot checked Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\OpenZFS\zfs_zil zil_replay_disable is set to 1 zpool import -N -d C:\zfs tank_ntfs

crash 4

info.txt cbuf.txt

lundman commented 10 months ago

Lesse, the very first crash, we die in

zilog_t *zilog = dmu_objset_zil(os);

"most likely" as os is NULL, which is the same as zil_open(zv->zv_objset == NULL

as you are exporting it, we are probably freeing zv->zv_objset and zv, and another write came in at the perfect time. Because zil appeared not open (already closed) it attempts to open it (again) with a NULL.

So similar to unmount, we need to have a think about the teardown steps here.

lundman commented 10 months ago

I am unsure why the import "info.txt" indicate the exact same location, that is weird.

andrewc12 commented 10 months ago

260f628d650dcdfcf1323f877a66ac14b74536f5 zpool import -N -d C:\zfs tank_ntfs

crash 5

info.txt cbuf.txt

andrewc12 commented 10 months ago

just noticed that was not related to this

lundman commented 10 months ago
 OpenZFS!zil_open(struct objset * os = 0xffffde86`217af680, <function> * get_data = 0xfffff801`7e9acab0, struct zil_sums * zil_sums = 0x00000000`00000000)+0x8d [D:\a\openzfs\openzfs\module\zfs\zil.c @ 3800] 

ok so not that

lundman commented 10 months ago

Why does it say "breakpoint" tho, it isnt the ASSERTs.

EchterAgo commented 10 months ago

Why does it say "breakpoint" tho, it isnt the ASSERTs.

zil.c:3800 is ASSERT3P(zilog->zl_get_data, ==, NULL);, isn't it?

I still think zil_open was called on an already open zil

lundman commented 10 months ago

I classified it in my head as a similar problem to unmount problem, it needs to be a bit better protected. We go and close the zil and dmu, which is set to NULL, then get another read/write, so it attempts to open it again, and dmu is now NULL. So a similar style to use zfs_enter() in the call-in to read/write. It got pushed down the priorities while we are dealing with the other thing.

lundman commented 10 months ago

bc14612

Only addresses the immediate crash