openzfsonosx / zfs

OpenZFS on OS X
https://openzfsonosx.org/
Other
824 stars 72 forks source link

panic on boot (1.7.2 & 1.8.2) #694

Open spamwax opened 5 years ago

spamwax commented 5 years ago

I was using 1.7.2 and everything was working just fine. Then I used App Store to apply Apple's security update (2019-01) and the panic started to happen.

On first restart, I got the black screen message saying that computer had to be rebooted due to an error, after which I can get to login greeting. However after a short period of time (15-45s) the reboot happens and I get the black screen message again.

So I decided to upgrade to 1.8.1 and still save issue! Issue persisted even when I turned off all HDDs in the pools.

I used my TimeMachine backup to restore the system to the state before Apple's security update hoping it will fix the issue, which it didn't :( So at this point I'm stuck!

The kernel panic log as a gist is here.

Not sure if this is relevant, but I installed 1.8.2 however the above log shows some 1.8.1 versions for net.lundman.spl and net.lundman.zfs

System info: Model Name: iMac Model Identifier: iMac14,2 Processor Speed: 3.70 GHz Number of Processors: 1 Total Number of Cores: 6 L2 Cache (per Core): 256 KB L3 Cache: 12 MB Memory: 32 GB

System Version: macOS 10.13.6 (17G65) Kernel Version: Darwin 17.7.0 Boot Volume: macOS Boot Mode: Normal Secure Virtual Memory: Enabled System Integrity Protection: Enabled Time since boot: 19 minutes

lundman commented 5 years ago

Can you please turn on keepsyms so we can see what is going on. If you want to boot without ZFS, just use https://openzfsonosx.org/wiki/Boot_loop

1.8.2 will announce as 1.8.1.

spamwax commented 5 years ago

not familiar how to do keepsyms! do I need to build zfs from source or this is something related to macOS? If there is a guide on how to do it, can you please link it?

I boot into safe/single user mode and use scripts provided by installer to fully remove zfs so I can boot into my Mac.

lundman commented 5 years ago

Ah sorry, run sudo nvram boot-args then add keepsyms=1 to it, like

sudo nvram boot-args="keepsyms=1".

More clearly specified https://openzfsonosx.org/wiki/Install#Initial_installation_from_source but you don't need "-v" unless you like it printing text while booting (like real hackers!) Just check what it is set to first, so you don't lose any setting you may already have (although, clean macs will have no boot-args set).

You do not need to compile, nor disable SIP. keepsyms just means it puts the function names in the panic report, rather than just the addresses like your report has now.

spamwax commented 5 years ago

ok, I'll try this and report. I am using Clover for multibooting, so I am guessing I can add the boot arguments in Clover's starting screen.

side question: Since I am running a hackintosh, do you know if running sudo nvram boot-args="keepsyms=1" will have a side effect on the system?

lundman commented 5 years ago

Ah hmm, I have no idea about hackintosh. But I think you can add keepsyms=1 to the clover boot arguments

spamwax commented 5 years ago

@lundman Ok, I finally managed to get the symbols in crash reports: reboot 1 & reboot 2

Again, these happened after system successfully boots and I can log-in, but after a few second the reboot happens. At the time of these panics only 1 pool was connected.

Do you want me to try and run the system without any pool connected?

lundman commented 5 years ago

Looks like rottegift had a similar problem at one point: https://github.com/openzfsonosx/zfs/issues/521

although I do not think you are trying to remove the log device.

spamwax commented 5 years ago

hmmm... given the timeline of that issue I am guessing a fix is not available yet :)

If there is anything I can do to help with replication/logs please let me know. Otherwise I will go ahead and do a fresh install of entire macOS & see if that fixes my issue as I really need to have access to that pool on my Mac

lundman commented 5 years ago

The 6922 commit was reverted, so it is most likely not what the issue is here. It is not happy with the pool though. Have you tried the usual troubleshooting?

"Import -N" to stop it from mounting, then mount datasets one by one. "import -o readonly=on" to see if you can use it readonly "import -T txg-1" using zdb to find last txg, then try to import the pool one txg earlier.

Fresh macOS will not help with the issue of the pool.

spamwax commented 5 years ago

ok, let me try these.

The reason I thought fresh install will help is because everything was just working and I hadn't run a scrub or anything like that on pools. Just installed that security update from Apple and everything went south!

Do I need to run import -N after installing o3x or I try to run that as soon as I log in before the reboot happens?

lundman commented 5 years ago

I did have problem with that security update myself, and possibly it crashed and your pool is now in a bad state.

After installing ZFS, you should disable the automatic import on boot, which is a launchctl script that runs /usr/local/libexec/zfs/launchd.d/zpool-import-all.sh So either unload the launchctl plist, or rename the zpool-import-all.sh script out of the way, that stops it from automatically importing it.

Then you can try various things - and yes, before reboot is ok after install.

spamwax commented 5 years ago

I disabled the launchctl script and tried all of the suggested troubleshooting steps.

both import -N & import -o readonly=on caused a panic after about 30 seconds while the command (sudo zpool import -N pool) hadn't finished & returned.

I then got txg number by running sudo zdb -l /dev/rdisk0s1 and used it in import -T, and same thing happened.

Should I try to import with lower txg numbers, say, txg-2?

How did you manage the issue with that security update?

lundman commented 5 years ago

Yes you can try TXG-2 and maybe as high as -10. But use it with readonly so you don't make changes to the pool while trying rolling back.

With security update, I had to use apfs snapshot to rollback before the update, then do the update again. The pool was just a test pool, so I had no issues destroying it

spamwax commented 5 years ago

going down to -10 didn't work.

I can attached my pool to a Linux machine, is there anything I should know in order to try & fix the pool in the new machine?

lundman commented 5 years ago

No? Don't run "zpool upgrade" on Linux or you can't import on OSX again.

spamwax commented 5 years ago

@lundman So I could successfully mount the pool on Linux in readonly mode.

However when I do zdb -l /dev/sdb, I get failed to unpack label 0 through label 3

Do I need to do anything in Linux before I try to attach the pool back to macOS?

Linux is Manjaro with zfs package version of 0.7.13-1 zpool status shows no error and I can browse the pool in readonly mode

lundman commented 5 years ago

that is good news - so at the very least you can get data off it in an emergency. It could be that if you import it fully and export, it will write the labels properly, and work again on osx?

spamwax commented 5 years ago

ok, so I just did the export/import in a non-readonly in Linux.

zdb -c didn't return any glaring issue.

However txg number I am getting in Linux are different than Mac 5987721 vs 5987652 I am guessing it has to do with importing/exporting the pool in Linux, maybe?

Should I apply the Apple's security patch before trying to use the pool in Mac?

lundman commented 5 years ago

The txg will tick upwards when its imported. A txg sync is when there is enough data, or changes, or sufficient seconds, since last txg - then rolls one more.

Security fix before or not is up to you, but I would export all pools before starting it.

spamwax commented 5 years ago

ok, thanks for all your help. I'll try the mac & report back.

Just curious why macOS was unable to import the pool! Is it OS related or zfs difference on those platforms.

lundman commented 5 years ago

we are struggling to catch up with ZOLs high turn out of commits, perhaps there is something in vdev coming up

spamwax commented 5 years ago

hmmm, I wish I was versed enough in magics of zfs to help :)

spamwax commented 5 years ago

@lundman Unfortunately the same panic happens after I moved the pool from Linux.

While the problematic pool is attached to macOS, quickly creating a pool on a USB & trying to import/export that specific pool takes a really long time!

Since I can attached this to Linux, what's the best way of cloning the pool before I destroy it?

Should I just do the typical zfs send -> zpool destroy -> zpool create -> zfs receive?

JMoVS commented 5 years ago

making a recursive snapshot (-r) and then using zfs send with the -R option (I would suggest putting -ceL in the mix as well to speed up the process) would be a good way, yes.

Should I just do the typical zfs send -> zpool destroy -> zpool create -> zfs receive?

This one I didn't understand. I'd create the new pool, zfs send | zfs recv from the old pool into the new pool, and then only after it successfully sent the pool's data would I destroy the old pool.

spamwax commented 5 years ago

Thanks for hints, I didn't know about -ceL

Since I didn't have access to a large enough pool, I had to dump the zfs send's stream to a file, destroy/re-create the original one and then zfs receive from the dumped file.

spamwax commented 5 years ago

I can now import the pool in macOS but only in readonly mode:

This pool uses the following feature(s) not supported by this system:
    org.zfsonlinux:userobj_accounting
All unsupported features are only required for writing to the pool.
The pool can be imported using '-o readonly=on'.
cannot import 'Virtuals': unsupported version or feature

I am guessing this happened since the pool was created on Linux. I am gonna try creating the pool on macOS and then doing a zfs recv over network from the dumped file.

lundman commented 5 years ago

Yeah, you can also create on linux with "-g" "zpool create -g" which disables all features, then add features back that work on both platforms (so everything except userobj_accounting)

spamwax commented 5 years ago

I couldn't find -g option in man pages for zpool, did you mean -d?

How do I get a comprehensive list of features? I don't want to copy/paste names from man zpool-features :)

lundman commented 5 years ago

Hah yes, I did mean "-d". You can create a pool, then run update -v to list which is used.

spamwax commented 5 years ago

ok, thanks. So I only need to re-enable the new features from the output of update -v command & not the (longer list) of "legacy versions"?

lundman commented 5 years ago

The numbers? No, the version is 5000 now and no reason to great an ancient pool.

spamwax commented 5 years ago

Closing the issue since I could import the Linux-created pool back in macOS using above suggestions. Thanks again.

spamwax commented 5 years ago

@lundman Unfortunately, enabling automatic import of zpool at startup caused the same corruption & kernel panic. I should mention that before enabling automatic import, I could reliably import/export my pools.

lundman commented 5 years ago

Can I have a panic dump with symbols, in case there is some new hints

spamwax commented 5 years ago

here it is.

lundman commented 5 years ago

That is a little different, but most likely related. So you created a new pool with ZOL, and it was working, but will now panic on import?

Can you share "diskutil list" output? I wonder if it is related to another issue where USB stick is identified as FAT and Apple pukes on it.

spamwax commented 5 years ago

This is the output of diskutil

After the original panics, I created a zpool in Linux and used the backup I had to restore the data. Then I moved the pool to macOS where auto import of pools on startup was disabled. I could use the pool by manually importing/exporting it.

Then I enabled the auto import & restarted the machine, the panic happened.