openzfsonosx / zfs

OpenZFS on OS X
https://openzfsonosx.org/
Other
823 stars 72 forks source link

Desktop freezes while using zfs volume as home directory. #57

Closed brendonhumphrey closed 10 years ago

brendonhumphrey commented 11 years ago

I suspect this is going to be an ongoing discussion for a bit.

Background - carrying out testing of zfs fs as a home directory for a user.

Configuration: macbook pro 2011, 10.8.5, 8gb ram, samsung evo 256 ssd as main hdd. ~50gb ifs partition as home directory, utilisation <10gb. And imac 2009, 10.8.4, 2 tb internal hdd with 12gb ram.

Experience: use laptop in normal usage, it will be fine for quite a while. After a period of use, certain application - games, iPhoto and itunes + "top" will start to misbehave (usually terminate with error, top reports a libtop error when refreshing output). Typically if one persists with using the machine, will encounter spinning beachball of death requiring hard restart.

Repeatability: easy, multiple occurrences in a day.

I have been trying to isolate a cause: Experimenting with high memory and disk loads.

Have discovered that it is possible to trigger early failure conditions in iPhoto by repeated open, close and navigation around a small photo collection. Typically iphoto will either crash on start or refuse to open a photo (and crash) after a while. Cores are typically somewhere in the video driver or in coreimage/some other framework. Not reproducible on same machine with clone of iphoto database on hfs+ partition. Typically once an app fails it misbehaves permanently. Sometimes waiting a bit restores functionality but only temporarily. Once one app fails, its all downhill from there.

My initial theory was a memory leak based on massive wired memory consumption, have convinced myself that this is not the case as I now understand the arc stats output sysctls a bit better.

My last move was to restrict the arc to 25% of the machine rather than the code default of 50% of the machine. Just to see if memory allocation amount was an issue as I had noticed that disk I/O tends to become slow once the arc is full.

On two test runs, untarring the same tar file in my other report, the machine started showing aberrant behaviour per above description. I was running tar -xvf and could see that the untar stopped making progress after being slow for a while. Before the machine froze I was able to verify with iostat 1 that there was no fs activity. Machine hung requiring cold start.

Next test was to set primarycache and secondarycache on both machines to "none". Obviously filesystem I/O became a lot slower, but there have been no stalls leading to beachball and hard restart. Rather I have experienced the panic in the other report + a panic in the lz4 code. The laptop is being used for normal desktop use and has not panicked at all.

I think there may be an issue somewhere in the arc, and will continue efforts to try an isolate this - using the kerneldebugtools if you would like.

brendonhumphrey commented 11 years ago

Bad news (I guess). The laptop has just experienced a SPOD, it was sitting idle, sometime in the last couple of hours it failed. So, I think the cache is a red herring. So much for that.

I am going to install the kernel ddk on the iMac as its 10.8.4 and I can get that one.

brendonhumphrey commented 11 years ago

Stack dumps after symptoms started - iphoto crashing, unable to FUS, or exit to login. Machine had to be shutdown.

https://www.dropbox.com/s/qm06lwq94q8rsy6/desktop_hang.zip

brendonhumphrey commented 11 years ago

This one occurred after heavy I/O (rm several massive directories), idle overnight, and then a FUS from the admin account to my zfs users login. I dont have the automount .fs thing installed on this machine so cant avoid the FUS.

Note the top output - machine has 25% CPU in kernel. Nothing in userspace.

Sorry about the lack os zfs symbols - spl worked ok, but no go on zpl this time.

https://www.dropbox.com/s/72l8hs057g157nx/spod_on_fus.rtf

brendonhumphrey commented 10 years ago

Next bunch of logs.

Observations:

Freeze while copying ~380gb from point a to point b on zfs home users directory tree while logged in as user.

Started copy uning finder command c - command v.

Progress was slow from the outset - 60 gb copied after ~2 hours. Desktop was non responsive at that point. Walked away, hoping to get cores in the morning. It made a further 30gb of progress overnight. In morning I though the machine had completely locked up as it was not responsive to ssh. However it proved to be really slow.

Observed high cpu load 25% attributed to system, 0% to user.

Ran spindump a few times. Responded ok.

Walked away for an hour, ran spindump again, and it took 30 minutes to complete, thought it had locked up entirely.

Used dtrace to panic machine (20 mins to respond).

Got thread and stacks dumps from remote dbx.

All attached via dropbox.

https://www.dropbox.com/s/yk3vs72by9kerld/desktop_hang2.zip

grahamperrin commented 10 years ago

Discussion: OpenZFS ZFS-OSX: reduction of bugs(s) that affect some uses of home directories (although the attachments offered in comments above may be far more useful than my own suggestions).

brendonhumphrey commented 10 years ago

Next hang incident. Macbook pro. Zfs home dir on external thunderbolt drive. User logged into that home account. Running safari, x-chat. Start 380 gb copy from server via afp.

Safari locked up quickly with spinning beachball of death. Dock may have been non-responsive (not 100% sure) was stlll able to interact with terminal program including creating new windows, so not total desktop freeze.

Logs, spindumps, vnop logging hung and unhung included.

https://www.dropbox.com/s/8dov8ar2ozo1wnp/safari_hang_maybe_dock_vnop_logs.zip

brendonhumphrey commented 10 years ago

Another incident:

Copying large amount of data from point a to point b on zfs homedir. Was running in zpool iostat 1 in a console window. Copy was proceeding (40+gb completed), all zfs thoughput dropped to zero read and zero write as indicated by iostat. Desktop had spinning baseball of death. Could not ssh remotely (waited 5 minutes). Was going to wait for an extended time to see if the machine responded eventually, but we had a powercut overnight. So much for that!

brendonhumphrey commented 10 years ago

Issue has either not been seen or much reduced since implementation of memory manager changes and some details in locking. I will close this now and report any further specific instances if they occur.