Closed jimmyH closed 5 years ago
@jimmyH Just getting close to the quota limit shouldn't cause ZFS to slow down. And I'd only expect fragmentation to become a major issue if the pool (not the dataset) was running low on free blocks. Were you able to character what kind of IO was being issues to the pool during this, small reads or small writes?
@behlendorf The pool was only 25% full and most of the IO was writes - I did not check what size they were.
I ran some tests on an empty dataset with a variety of blocksizes, and in all cases if you are close to the quota the writes speed plummets. For example:
zfs create zfs2/test
zfs set compression=off zfs2/test
cd /zfs2/test
# async writes without a quota
dd if=/dev/zero of=zero bs=$((128*1024)) count=400; rm zero
52428800 bytes (52 MB) copied, 0.0364067 s, 1.4 GB/s
# sync writes without a quota
zfs set sync=always zfs2/test
dd if=/dev/zero of=zero bs=$((128*1024)) count=400 ; rm zero
52428800 bytes (52 MB) copied, 5.02718 s, 10.4 MB/s
zfs set sync=disabled zfs2/test
# async writes with a quota
zfs set refquota=10G zfs2/test
dd if=/dev/zero of=zero bs=$((128*1024)) count=400 ; rm zero
52428800 bytes (52 MB) copied, 0.0392953 s, 1.3 GB/s
# async writes with a smaller quota
zfs set refquota=1G zfs2/test
dd if=/dev/zero of=zero bs=$((128*1024)) count=400 ; rm zero
52428800 bytes (52 MB) copied, 2.43307 s, 21.5 MB/s
# async writes with an even smaller quota
zfs set refquota=100M zfs2/test
dd if=/dev/zero of=zero bs=$((128*1024)) count=400; rm zero
52428800 bytes (52 MB) copied, 16.8641 s, 3.1 MB/s
What really surprised me is that writing 50MB to an empty dataset with a 1GB quota is so slow.
Digging through the code it looks like zfs calculates the worst case for disk usage is 24x the written size (spa_get_asize()) - so it estimates that 50MB will take 1200MB on disk. I have not looked further in the code to see what it does when this exceeds the quota.
@jimmyH What's happening is that the transaction group is being committed earlier. In my tests, the 50MB is still fitting into a single txg but it's being flushed during program exit in order to enforce the reservation. The effect of this behaviour would seem to be that writes could slow down a bit as a file system fills due to more txg flushes.
Clearly the un-tarring lots of small files example would be almost a worst-case scenario.
We are experiencing the same and are forced to workaround with userquotas. We also see that if people run out of quota while accessing the filesystem over NFS the load on the fileserver shoots up, which can only be cleared by extending the quota. Is there a good solution to this?
This is still a serious issue for me - it pretty much locks up the fileserver when it occurs.
Currently my only workaround is to have a groupquota less than the filesystem size. All access is via NFS or SMB so I can ensure that all files are owned by the same group. However, even this causes issues, when the group quota is exceeded the error returned is EDQUOT which some apps handle incorrectly. Most annoying is unzip which on receipt of EDQUOT is telling the user that the archive extraction was successful.
I'll note that this behavior is tied into the write throttle and, therefore, the relevant code has been changed quite a bit as part of its re-work which is currently being evaluated for ZoL in nedbass/zfs@3b0110e. I've been meaning to try some tests with this code but haven't yet had a chance. It might be worth trying it to see whether this behavior of reverting to very small txgs as the quota is approached has changed.
I'll also note that the inflation factor has been changed to a global variable (default still 24) and is settable with mdb under Illumos but in the ZoL patch, it's not (yet) been converted to a tunable module parameter. Presumably this has been done to allow it to be reduced in situations where it's known that the default is too large.
I agree, we should verify this is still and issue with the latest code and the updated write throttle code.
It seems there are still issues in this area in 0.6.4.2, I can reproduce problems with write performance by using bonnie++, when approaching quota limits.
More info in this thread: http://list.zfsonlinux.org/pipermail/zfs-discuss/2015-July/022730.html
Today i have reproduced this simply with:
zfs create pool/dataset
zfs set compression=off pool/dataset #(just in case)
zfs set refquota=1M pool/dataset
cd /pool/dataset
n=1
while true; do time touch blah$n; let n++; done
Each run of the loop starts off at about 0.1s (which is at least 100x longer than it should take), and then after 30 or so runs, it might go up to avg 0.4s per loop with spikes up to 0.6s. Remove the quota or set it to 1G, and it's 0.000s - 0.001s (rare spikes up to 0.006).
And during the test, the disks look very overloaded (it's a 4 x 9 disk raidz2, and it's only 50% full):
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 5.00 120.00 0.07 0.54 9.92 0.10 0.77 1.60 0.73 0.77 9.60
sdc 0.00 0.00 4.00 116.00 0.05 0.61 11.20 0.14 1.13 0.00 1.17 1.10 13.20
sdf 0.00 0.00 3.00 113.00 0.04 0.56 10.48 0.13 1.14 0.00 1.17 1.14 13.20
sdi 0.00 0.00 3.00 122.00 0.04 0.60 10.43 0.12 0.96 0.00 0.98 0.93 11.60
sdl 0.00 0.00 1.00 194.00 0.01 0.97 10.34 0.21 1.07 0.00 1.07 1.07 20.80
sdn 0.00 0.00 1.00 194.00 0.01 1.02 10.87 0.24 1.21 0.00 1.22 1.13 22.00
sdp 0.00 0.00 1.00 192.00 0.01 0.94 10.07 0.22 1.14 40.00 0.94 1.10 21.20
sdq 0.00 0.00 0.00 200.00 0.00 1.03 10.56 0.17 0.86 0.00 0.86 0.80 16.00
sdt 0.00 0.00 0.00 198.00 0.00 0.99 10.26 0.16 0.89 0.00 0.89 0.83 16.40
sds 0.00 0.00 0.00 181.00 0.00 0.91 10.34 0.19 1.04 0.00 1.04 1.04 18.80
sdz 0.00 0.00 0.00 206.00 0.00 1.02 10.14 0.24 1.17 0.00 1.17 1.01 20.80
sdx 0.00 0.00 0.00 189.00 0.00 0.96 10.46 0.20 1.14 0.00 1.14 1.06 20.00
sdaj 0.00 0.00 0.00 99.00 0.00 0.49 10.18 0.16 1.21 0.00 1.21 1.49 14.80
sdag 0.00 0.00 0.00 98.00 0.00 0.48 9.96 0.15 1.51 0.00 1.51 1.43 14.00
sdab 0.00 0.00 0.00 98.00 0.00 0.54 11.18 0.09 0.94 0.00 0.94 0.94 9.20
sdae 0.00 0.00 0.00 105.00 0.00 0.53 10.29 0.13 1.26 0.00 1.26 1.26 13.20
sdg 0.00 0.00 3.00 106.00 0.04 0.52 10.42 0.27 2.57 1.33 2.60 2.42 26.40
sde 0.00 0.00 5.00 104.00 0.06 0.54 11.23 0.24 2.17 0.00 2.27 2.06 22.40
sdr 0.00 0.00 0.00 184.00 0.00 0.99 11.00 0.30 1.74 0.00 1.74 1.61 29.60
sda 0.00 0.00 3.00 118.00 0.04 0.57 10.25 0.30 2.64 8.00 2.51 2.51 30.40
sdaf 0.00 0.00 0.00 93.00 0.00 0.49 10.84 0.26 2.37 0.00 2.37 2.67 24.80
sdj 0.00 0.00 1.00 179.00 0.02 0.93 10.80 0.36 2.07 0.00 2.08 1.93 34.80
sdaa 0.00 0.00 0.00 188.00 0.00 0.98 10.72 0.43 2.38 0.00 2.38 2.19 41.20
sdy 0.00 0.00 0.00 184.00 0.00 0.93 10.39 0.47 2.65 0.00 2.65 2.39 44.00
sdah 0.00 0.00 0.00 99.00 0.00 0.54 11.07 0.30 2.67 0.00 2.67 2.75 27.20
sdai 0.00 0.00 0.00 101.00 0.00 0.52 10.61 0.31 2.69 0.00 2.69 2.81 28.40
sdk 0.00 0.00 1.00 196.00 0.01 1.04 10.92 0.31 1.58 0.00 1.59 1.54 30.40
sdu 0.00 0.00 0.00 170.00 0.00 0.92 11.11 0.40 2.16 0.00 2.16 2.31 39.20
sdh 0.00 0.00 5.00 115.00 0.07 0.50 9.73 0.27 2.40 4.00 2.33 2.23 26.80
sdd 0.00 0.00 4.00 121.00 0.05 0.57 10.11 0.21 1.70 0.00 1.75 1.60 20.00
sdo 0.00 0.00 1.00 205.00 0.01 0.98 9.83 0.38 1.86 0.00 1.87 1.81 37.20
sdad 0.00 0.00 0.00 101.00 0.00 0.48 9.66 0.29 2.53 0.00 2.53 2.89 29.20
sdm 0.00 0.00 1.00 197.00 0.01 0.92 9.62 0.26 1.33 0.00 1.34 1.29 25.60
sdv 0.00 0.00 0.00 192.00 0.00 0.88 9.42 0.37 1.94 0.00 1.94 1.94 37.20
sdw 0.00 0.00 0.00 217.00 0.00 1.02 9.59 0.42 1.95 0.00 1.95 1.94 42.00
sdac 0.00 0.00 0.00 107.00 0.00 0.51 9.79 0.25 2.36 0.00 2.36 2.36 25.20
Ubuntu 16.04 kernel 4.4.0-92-generic zfsonlinux 0.6.5.6-0ubuntu16
Try reducing the value of spa_asize_inflation.
Poor write performance very close to the quota is a design choice made in
OpenZFS 7793 - ztest fails assertion in dmu_tx_willuse_space (3ec3bc2167352df525c10c99cf24cb24952c2786).
The choice is described mainly in the large comment block above dmu_tx_try_assign() in module/zfs/dmu_tx.c but also in the comment blocks in module/zfs/spa_misc.c before "int spa_asize_inflation = 24" , before the definition of spa_get_worst_case_asize(), and of course in the long commit message:
Relevant part from the dmu_tx.c comment:
* Note that due to this algorithm, it is possible to exceed the allowed
* usage by one transaction. Also, as we approach the allowed usage,
* we will allow a very limited amount of changes into each TXG, thus
* decreasing performance.
FWIW, we chose (in https://github.com/openzfsonosx/zfs/commit/7a8a2ead5c218ba2239b746f886abfd68c59c638 ) to reduce the default value to 6 in order to make the test suite happy and for a couple of other not-quite-so-empirical reasons.
I do not quite understand why this issue has been closed. I understand the "Poor write performance very close to the quota is a design choice made in" part, but this makes ZFS almost unusable. In a current new project I'm running a single user filling his quota is grinding the entire filesystem to a halt for all the other users, making it extremely slow.
Is there a working solution?
@iio7 and whoever else is interested,
Here's an uglier automatic version of my above reproduction test... it shows that spa_asize_inflation=6 does nothing, and even 1... but 0 is great.
log_debug() {
if [ "$debug" = 1 ]; then
echo "DEBUG: $@"
fi
}
debug=0
ds=tank/testquota
zfs create "$ds"
zfs set compression=off "$ds" #(just in case)
refquotas=(none 1M)
quotas=(none)
inflations=(24 6 2 1 0)
for spa_asize_inflation in "${inflations[@]}"; do
echo "$spa_asize_inflation" > /sys/module/zfs/parameters/spa_asize_inflation
for refquota in "${refquotas[@]}"; do
for quota in "${quotas[@]}"; do
log_debug "setting refquota none"
zfs set refquota=none "$ds"
log_debug "setting quota none"
zfs set quota=none "$ds"
cd /"$ds"
log_debug "cleaning up"
rm -- blah*
log_debug "setting refquota $refquota"
zfs set refquota="$refquota" "$ds"
log_debug "setting quota $quota"
zfs set quota="$quota" "$ds"
n=1
sum=0
count=0
log_debug "running test"
t1=$(date +%s)
while true; do
line=$( ( time touch "blah$n" ) 2>&1 | grep real )
seconds=$(awk -F'm|s' '{print $2}' <<< "$line")
sum=$(echo "scale=5; $sum + $seconds" | bc)
let count++
let n++
t2=$(date +%s)
td=$((t2-t1))
if [ "$td" -gt 3 ]; then
break
fi
done
log_debug "calculating"
avg=$(echo "scale=5; ${sum} / ${count}" | bc)
printf "spa_asize_inflation = %2d, quota = %4s, refquota = %4s, count = %5d, avg = %8.5f s\n" "$spa_asize_inflation" "$quota" "$refquota" "$count" "$avg"
log_debug "waiting"
sleep 1
done
done
done
(
uname -r
dpkg -l | awk '$2 == "zfsutils-linux" {print $3}'
lsb_release -sd
)
an idle test machine
spa_asize_inflation = 24, quota = none, refquota = none, count = 281, avg = 0.00370 s
spa_asize_inflation = 24, quota = none, refquota = 1M, count = 8, avg = 0.51000 s
spa_asize_inflation = 6, quota = none, refquota = none, count = 258, avg = 0.00283 s
spa_asize_inflation = 6, quota = none, refquota = 1M, count = 9, avg = 0.41077 s
spa_asize_inflation = 2, quota = none, refquota = none, count = 267, avg = 0.00195 s
spa_asize_inflation = 2, quota = none, refquota = 1M, count = 8, avg = 0.49950 s
spa_asize_inflation = 1, quota = none, refquota = none, count = 221, avg = 0.00676 s
spa_asize_inflation = 1, quota = none, refquota = 1M, count = 2, avg = 2.18950 s
spa_asize_inflation = 0, quota = none, refquota = none, count = 221, avg = 0.00626 s
spa_asize_inflation = 0, quota = none, refquota = 1M, count = 207, avg = 0.00568 s
4.15.0-48-generic
0.6.5.6-0ubuntu27
Ubuntu 16.04.6 LTS
the machine that fails constantly in production because of this quota issue (basically same)
spa_asize_inflation = 24, quota = none, refquota = none, count = 318, avg = 0.00183 s
spa_asize_inflation = 24, quota = none, refquota = 1M, count = 13, avg = 0.25961 s
spa_asize_inflation = 6, quota = none, refquota = none, count = 276, avg = 0.00171 s
spa_asize_inflation = 6, quota = none, refquota = 1M, count = 13, avg = 0.24915 s
spa_asize_inflation = 2, quota = none, refquota = none, count = 244, avg = 0.00179 s
spa_asize_inflation = 2, quota = none, refquota = 1M, count = 13, avg = 0.26746 s
spa_asize_inflation = 1, quota = none, refquota = none, count = 274, avg = 0.00172 s
spa_asize_inflation = 1, quota = none, refquota = 1M, count = 13, avg = 0.25453 s
spa_asize_inflation = 0, quota = none, refquota = none, count = 284, avg = 0.00176 s
spa_asize_inflation = 0, quota = none, refquota = 1M, count = 284, avg = 0.00173 s
4.15.0-99-generic
0.6.5.6-0ubuntu27
Ubuntu 16.04.2 LTS
Currently I'm considering ZFS quotas as non-existent! It just doesn't work like this!
I'm considering using automated scripts to monitor users "quotas" and then manually deal with problems when they arise. My machines are running on spinning disks and as soon as a single users fills his space the machine starts sounding like a jack-hammer due to the harddrive working so extremely hard trying to cram every single bit into the space, yet at the same time, every other user finds their workspace completely unresponsive!
ubuntu 16 is so incredibly ancient, upgrade that system. 0.6.5 is unsupported.
Ubuntu 16.04 is the old stable, not ancient, but the kernel is the actual zfs version, so I should have posted the kernel's zfs version:
$ cat /sys/module/zfs/version 0.7.5-1ubuntu16.8
I also would like predictable and reasonable write() performance all the way up to EDQUOT. What is the downside of setting spa_asize_inflation=0, which seems to provide that? Does that just mean I might go over the quote by an extra transaction?
Agreed,
I would much rather have the possibility of quotas being slightly exceed rather than diminishing write performance.
Regardless of its implementation, quotas are advertised as a way to set simple storage caps on datasets/users. It feels like a gotcha that it has huge performance implications.
As you get close to a filesystem quota (or refquota) writes become extremely slow. In the worst case they drop to a few kB/s and cause lots of random IOs.
In one example, I had an almost full filesystem with a 10G refquota and a user was trying to extract a tarball into the directory. This resulted in 100% disk utilisation for 4 hours until we killed the process.
I assume that the excessive random IO is due to fragmentation of the existing data in the filesystem.
In most cases I can workaround this by setting a lower userquota in each filesystem. This would not work in filesystems shared by multiple users (in different groups).
Perhaps being able to set a quota which covers all groups (eg groupquota@all=size) would be a workaround?