Closed jawbroken closed 1 year ago
I believe we need to double the timeout on this case, and make it into a tunable, for the cases where more is needed. I can produce a build for testing in a bit.
Thanks, I really appreciate it. A curious datapoint: this started happening for me fairly regularly when I had to have my ZFS fileserver on wifi for a while (often when doing large file transfers, I think). I would also end up with empty files, zero byte sometimes (randomly) when files were moved over SMB and have to redo the copy. Now that I've put it back on wired ethernet I haven't seen the same kind of issues yet. It doesn't make much sense to me, but I just thought I would note it.
I experienced something similar where Spotlight indexing, anything that's scanning disks (DaisyDisk), or heavy IO load (Xcode) seems to freeze the system and eventually trigger a panic.
Please let me know if there's anything I can do to help debug this!
Yes, I've disabled Spotlight on the ZFS drives for this reason. I had a previous bug on this but I couldn't work out if OpenZFS was actually at fault there so I closed it. The kernel panics there were different, and the backtraces didn't mention OpenZFS in the "Kernel Extensions in backtrace" section, but I suspected it was involved somehow.
Though I should note that my ZFS drives don't stay in the Spotlight Privacy part of System Preferences for some reason, so I disabled it by making a .metadata_never_index
file at root level.
@jawbroken thanks for the tip! I've disabled spotlight indexing entirely by unloading the daemon:
sudo launchctl unload -w /System/Library/LaunchDaemons/com.apple.metadata.mds.plist
Adding a file to the directory seems a better idea!
Hi @lundman thanks for giving an update to this issue! Wondering will the pkg work for M1? Many thanks again!
The installer installed it successfully, but after rebooting and ran zfs version showed kext not loaded. When I load it manually it gives me an error that the kext is not built for arm64e.
I can make you an arm64 build
amazing @lundman! Thanks so much will test it right away.
@lundman thanks for the arm64 build but once I installed it, the kernel panicked and I couldn't manually load it as well.
Not that it's surprising given the common element of "scanning drives or folders", but I've found that Backblaze's backup scanning may also trigger panics like these on my new M1 Ultra; I was having panics approximately every hour (whether I was sitting at the machine or not) until disabling Backblaze for now, and the machine has happily stayed running overnight.
After a long period of stability after putting the server back on ethernet and disabling Spotlight indexing, I'm now getting this kernel panic repeatedly now while trying to move a large amount of data from one pool to another.
Now that I can reproduce this fairly easily (though it takes a random amount of time) by just copying a large directory from one pool to another in the Finder, I was able to catch one happening live and have some observations. I came back to the computer and the transfer had stopped making any progress and I could see in Activity Monitor and zpool iostat that there was no activity on either pool. This persisted for about 5 minutes at which time I tried opening a folder on one of the pool and it finally locked up and restarted. So I'm not sure that this is just a matter of increasing a timeout somewhere, because it seemed genuinely deadlocked somewhere (and perhaps the child task timeout was more of a symptom?). I'm going to try some of the tuning for stack pages mentioned in the other thread next, I guess.
Just an update. I've used VirtualBuddy to setup a Ventura VM, and VB lets me disable SIP and enable 3rd party KEXTs. I believe that last good installer still works.
However, new compiles of the KEXT just doesn't load, it starts to boot (after Allow) then restarts the progress bar, boots and points out something went wrong, and the KEXT needs Approving again.
Alas, there is no log entries, our outputs to hint at what could be wrong. There is no -v
boot option that I can find.
There is no indication that there is anything wrong with the kext before Approval, codesign ok, kextlibs ok etc. Just needs Approve.
This is rather frustrating, especially since this is the 3rd major OS release with exceptionally poor support in this area.
Possibly the next step would be to start from an empty sample kext, and add things to it one by one.
Sorry for the frustration, I appreciate the help. Let me know if there's anything I can do to help test or narrow down issues. I'm trying a copy with cp -pvR
instead of the Finder now and it seems okay so far, with similar read/write bandwidth from the two pools.
I can definitely get further with cp
versus copying in the Finder, but eventually it stops making progress (and then will kernel panic if I try to do something like open up the directory in Finder). When cp
stops making progress, I can sample the process and see it is stuck inside a write
call in libsystem_kernel.dylib. The cp
process is still using 8% CPU and making a few thousand context switches a second. From the troubleshooting steps in the other thread, kstat.spl.misc.spl_misc.active_threads is stable at 514, its usual value. I just tried to launch Instruments to see if I could work anything else out and the computer kernel panicked.
I haven't tried increasing the stack page size yet because it's more difficult than I thought (I need to boot into the recovery partition, which you can't do remotely, and I don't have a monitor so I need to move the headless server and plug it into a TV or something).
Managed to complete the copy of 32 TB with rsync, which must have a different pattern of system calls.
lundman just released a new beta version Monterey for arm64, perhaps you can give it a shot? https://github.com/openzfsonosx/zfs/issues/798
Previously reported on the forums here and here, but I couldn't find a bug tracking it. I'm recently seeing fairly regular kernel panics with similar backtraces (generally the result of Plex Media Scanner or Plex Transcoder).
Apologies again for lack of symbols, I haven't had any luck getting that to work following the varying instructions across the internet (but you can see one with symbols at the first link above).
Happy to help debug in any way, thanks.