Closed ylluminate closed 8 years ago
I stepped back to the 1.5.2 release and still seeing the same kind of hangs, although it has not been long enough to see if the initial message above is reported on reboot.
Those are harmless messages.
On 28 Jul 2016, at 12:19 AM, George Plymale notifications@github.com wrote:
I stepped back to the 1.5.2 release and still seeing the same kind of hangs, although it has not been long enough to see if the initial message above is reported on reboot.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openzfsonosx/spl/issues/12#issuecomment-235599391, or mute the thread https://github.com/notifications/unsubscribe-auth/AFWzn46R17RdseRFJPauVjB-YGqGVSLsks5qZ2jagaJpZM4JV6qx.
Also did you pick up the SPL change that was made ~24 hrs ago?
On 28 Jul 2016, at 12:19 AM, George Plymale notifications@github.com wrote:
I stepped back to the 1.5.2 release and still seeing the same kind of hangs, although it has not been long enough to see if the initial message above is reported on reboot.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openzfsonosx/spl/issues/12#issuecomment-235599391, or mute the thread https://github.com/notifications/unsubscribe-auth/AFWzn46R17RdseRFJPauVjB-YGqGVSLsks5qZ2jagaJpZM4JV6qx.
No, I've not grabbed those. Let me take a look. I went back to the older driver in hopes things would improve, but to no avail. What concerned me was the fact that it wasn't releasing after long periods of time and some degree of inactivity. I probably should make some kind of a shell script that forces it to feel the pressure and release...
I see. “An hour since last reap” is behaving normally. That is not changing for now. Its just idle chitchat in the logs.
Brendon
On 29 Jul 2016, at 5:34 AM, George Plymale notifications@github.com wrote:
No, I've not grabbed those. Let me take a look. I went back to the older driver in hopes things would improve, but to no avail. What concerned me was the fact that it wasn't releasing after long periods of time and some degree of inactivity. I probably should make some kind of a shell script that forces it to feel the pressure and release...
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/openzfsonosx/spl/issues/12#issuecomment-236001156, or mute the thread https://github.com/notifications/unsubscribe-auth/AFWzn09VjeD1WCH3QbinnLTDSmpoKKudks5qaQRCgaJpZM4JV6qx.
If you're keen try the code in the SPL branch 'vmem_experiments' we would like some feedback as to how that works for you.
Absolutely. Can you tell me what to expect and what the goal is? From the look of the commits it seems fairly stable so I don't think it's likely to produce panics is it?
You should expect :rainbow: :bouquet: :sunny:
/me switches to :godmode:
HERE WE GO!
I've been running it for a month without reboot, it seems ok. Just alters some allocator details, maybe works better. Dont want to cloud your judgement by making claims. Do whatever you did to create your performance stalls or whatever they were.
Did you try it?
I've actually got a heavy copy process going on it right now that should have prompted the issue I was having earlier. I've actually cranked it up and have 6 rsyncs going full bore and while it has maxed out my memory usage, my system is still semi usable. I have two different volumes copying via these multiple threads onto the raid-z. So yeah, so far so good... I wish I could cut down the memory consumption further though. Like I said, it's consuming everything and using swap at the moment on a 64GB system (htop
is showing kernel_task is at 103 GB virt).
ok sounds promising. If you want to limit memory use, why dont you set the arc max?
Well that's what is confusing @brendonhumphrey, I do have kstat.zfs.darwin.tunable.zfs_arc_max=8589934592
set in /etc/zfs/zsysctl.conf
, yet that still happened.
That is a similar setting to what I have, and the machine is obediently keeping memory use in check. You havent got any other settings in zsysctl.conf have you? I accidentally inherited something from ilovezfs' example script that blocked zfs_arc_max from working. I'm not near the machine, so cant quote the exact setting.
you could also do sysctl -a | grep spl. One of the sysctls (theres only a few, so it should be obvious which one) tells how much memory we have allocated in total. You can use that to check that it is o3x that is occupying the memory.
Hmm, that's the only line in zsysctl.conf
, and yes, I do use @ilovezfs's script, so I must have also inherited that setting somewhere... What may be doing that is not readily popping out at me from the settings so I'll hang tight until you can pull it up...
Ok, you really dont want this setting (assuming you pulled it in from the .example file):
10 Mar 2015; ilovezfs Raise zfs_arc_meta_limit to 3/4 (instead of 1/4) of zfs_arc_max. 3/4 * (11 * 2^30) = 8,858,370,048 But let's use hexadecimal this time. 8,858,370,048 = 0x210000000 kstat.zfs.darwin.tunable.zfs_arc_meta_limit=0x210000000
also kstat.zfs.misc.arcstats.c_max is the actual arc max i.e. check after setting the tunable.
Interesting, so essentially I should use:
kstat.zfs.darwin.tunable.zfs_arc_meta_limit=0x210000000
instead of:
kstat.zfs.darwin.tunable.zfs_arc_max=8589934592
Or should BOTH be in there? What's the purpose of this new setting in comparison with the old wrt this "meta" vs actual limit?
Ah no. Sorry if I wasnt clear. if you have kstat.zfs.darwin.tunable.zfs_arc_meta_limit=0x210000000 then kstat.zfs.darwin.tunable.zfs_arc_max=8589934592 wont work.
You can verify what arc max is after setting the tunable by checking kstat.zfs.misc.arcstats.c_max.
Also, the total memory in use is kstat.spl.misc.spl_misc.os_mem_alloc.
Alright, so at present I only have kstat.zfs.darwin.tunable.zfs_arc_max=8589934592
, so I am removing that and inserting this kstat.zfs.darwin.tunable.zfs_arc_meta_limit=0x210000000
and will monitor the levels. This will be the only line in zsysctl.conf
. Will keep you posted.
I guess that last setting must have been the wrong move since the kernel_task
memory levels are already going well past the threshold very rapidly. Already up to ~20GB after just running rsync for a few minutes.
Try this script.
DON'T adjust to taste, don't mess around. Tell us the results of running EXACTLY this as superuser as soon as you can after boot time, assuming your recent problematic workload eventually causes problems again.
#!/bin/bash
export PATH=/usr/bin:/bin:/sbin:/usr/sbin
sysctl -w kstat.zfs.darwin.tunable.zfs_arc_min=$((512*1024**2+1*1024**3))
sysctl -w kstat.zfs.darwin.tunable.zfs_arc_meta_min=$((256*1024**2+1*1024**3))
sysctl -w kstat.zfs.darwin.tunable.zfs_arc_meta_limit=$((5*1000**3))
sysctl -w kstat.zfs.darwin.tunable.zfs_arc_max=$((11*1024**3))
sysctl -w kstat.zfs.darwin.tunable.zfs_dirty_data_max=$((512*1024**2))
IF you see the same problem after running this as superuser near boot-time, then paste in the exact output of
sysctl kstat.spl kstat.vmem.vmem.kmem_va.mem_inuse kstat.vmem.vmem.kmem_default.mem_inuse kstat.vmem.vmem.kmem_oversize.mem_inuse kstat.vmem.vmem.heap.mem_inuse; sysctl vm | egrep '^vm.page_(free_(wanted|count)|speculative|purgable_count)'
(and surround the result in a pair of lines containing three backticks for the sake of readability).
So, FYI, I've done the following:
Created /etc/zfs/zsysctl.sh # with 'sudo chmod +x'
with these contents:
#!/bin/bash
export PATH=/usr/bin:/bin:/sbin:/usr/sbin
sysctl -w kstat.zfs.darwin.tunable.zfs_arc_min=$((512*1024**2+1*1024**3))
sysctl -w kstat.zfs.darwin.tunable.zfs_arc_meta_min=$((256*1024**2+1*1024**3))
sysctl -w kstat.zfs.darwin.tunable.zfs_arc_meta_limit=$((5*1000**3))
sysctl -w kstat.zfs.darwin.tunable.zfs_arc_max=$((11*1024**3))
sysctl -w kstat.zfs.darwin.tunable.zfs_dirty_data_max=$((512*1024**2))
And created /Library/LaunchDaemons/com.openzfsonosx.zsysctlupdate.plist
with contents:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.openzfsonosx.zsysctlupdate</string>
<key>ProgramArguments</key>
<array>
<string>/etc/zfs/zsysctl.sh</string>
<string>/etc/zfs</string>
</array>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<false/>
</dict>
</plist>
Well, you could just have run it by hand (sudo scriptname) when you get a login prompt, but it should (should!) work when executed early by launchd.
Make sure it does run via launchd; you can use the StandardErrorPath and StandardOutPath keys in your plist, and also grep SPL /var/log/system.log and check the output of "sysctl kstat.zfs.darwin.tunable" to make sure the values reflect the script.
Unfortunately if you're using launchd you may have a race with org.openzfsonosx.zconfigd.
(It's perfectly fine to run it by hand before there are more than 2GB in your ARC, and certainly before you start running your problem-reproducing workload. Running it by hand gives you confidence that the tunable values have been updated by the script.)
Yes, understood. So far it's staying a lot more sane...
If it's making a difference this soon with a workload like "6 rsyncs going full bore" -- and on the assumption that each rsync is writing to a different dataset or zvol -- then the issue is probably that you're generating an enromous amount of dirty data (limited by the smallest of 80% of system memory; or ~10% maximum per active target dataset; or 4 GiB per target dataset or zvol) which takes time to drain, during which your reads into ARC will stall, and during which the ARC cannot shrink back based on userland pressure.
The "smallarc" script I gave you leads to a maximum of 512 MiB of dirty data in the ARC per target dataset or zvol summing to a maximum of about 10GiB total. You can even consider dialling down kstat.zfs.darwin.tunable.zfs_dirty_data_max even further (e.g. to 128 MiB) if your write loads will always be throughput-sensitive rather than latency-sensitive.
(However, if you have only one target zvol (or dataset) for all six rsyncs, it could be something else is going on in addition to huge amounts of dirty data.)
I have been noticing that my ZFS volume has become unresponsive somewhat and upon rebooting I received the following messages:
I'm a little concerned because when I rebooted, I am still running into some hangs in the middle of copying some files. I'm using the most recent source from master at present.