openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.62k stars 1.75k forks source link

ZFS runs out of memory on boot after destroy snapshot #2581

Closed Ashod closed 10 years ago

Ashod commented 10 years ago

I have managed to break my 0.6.3 ZFS server... it fails on boot.

I have a dedup-ed dataset with ~550GiB of data.[1] Created a snapshot (the only one) to send to another dataset, which was successful. After I was done, I wanted to destroy the dataset, but knowing that I don't have ample RAM, I thought I should do it piecemeal (to avoid locking the system). I remembered that I still had the snapshot after I had rm-ed ~35GiB worth of files, which ran rather surprisingly fast (which gave me the clue that something was wrong--I really meant to destroy the snapshot before rm-ing, which was only created to send the dataset). I then proceeded to destroy the snapshot so that I could permanently remove the files. Albeit, system locked up very, very quickly even when it was far from arc_max. Hard rebooted and was able to login normally. I repeated the destroy snapshot command, locked up again in a matter of seconds and the system now dies with out of memory on boot with 'out of memory, killing mountall' message.

At the time of both lockups arcsz=1.4GB with room to grow to 3GB. It locked up after writing ~200MB of data in ~4seconds (I had zpool iostat 1 running and arcstats.py 1). The system has been running with essentially the same data and config for a good year.

My setup: 6x3TB raidz-6, 11TB of the zpool full, 5.2TB free, some datasets with dedup=verify. According to zdb -DD, the 'in core' DDT is ~5.5GiB and ~33.2GiB on disk.[2] 8GB RAM, zfs_arc_max=3GiB, l2arc=58GiB (64GB usb flash drive). ZFS 0.6.3 on 3.15.8 64-bit kernel (Ubuntu).

Questions: 1) Considering that I've rebooted a dozen or so times, and it runs out of memory at slightly different points in time (typically after ~840 seconds, but it can be as early as 650 seconds or as late as 1000 seconds after boot). Does this mean that ZFS is able to progress through the ZIL replay? Or is it NOT progressing and will never boot without intervention?

2) What should I do if it doesn't/can't boot? How can I ask ZFS to ignore that destroy command that is killing the system? (I can double the RAM, but how much RAM would it need to recover?)

3) It's not helpful to me now, but shouldn't one expect destroy snapshot that has a few rm-ed files to be a quick operation, let alone that it shouldn't really run out of memory? I know ZFS needs to access the DDT, but the snapshot itself had a small changeset and it was only data removed. In addition, it had 3GB of arc (of which it had used half before locking up) and 58GB of l2arc. (What is it trying to do anyway? How does the snapshot destroy work?)

4) The engineer/coder in me thinks ZFS should give up gracefully in low-mem situations and report that it can't complete the request, never kill the system. So, is this something that I can help improve? I have looked at the code and built ZFS before, and am willing to do some more if someone is willing to guide me. My goal is to at least detect similar situations and fail the operation, but ideally it should try to complete it at a slower pace (by thrashing the ARC, f.e.).

Thanks!

[1] I know, ZFS is memory hungry, should do the math, etc. etc. When I build this system I was willing to sacrifice performance, which I knew is inevitable with DDT and low-mem. But I was never willing to sacrifice availability or tolerate high maintenance overhead (I've had many lockups before tuning the system, avoiding large writes in dedup datasets, stopped using rsync and adding l2arc).

[2] Right before running destroy snapshot...

zdb -DD tank

DDT-sha256-zap-duplicate: 2576696 entries, size 1387 on disk, 224 in core
DDT-sha256-zap-unique: 16231058 entries, size 1976 on disk, 319 in core

DDT histogram (aggregated over all DDTs):

bucket allocated referenced


refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE


 1    15.5M   1.88T   1.78T   1.79T    15.5M   1.88T   1.78T   1.79T      
 2    1.90M    205G    174G    179G    4.20M    455G    389G    399G      
 4     455K   44.5G   23.1G   25.4G    2.20M    227G    110G    122G      
 8    55.8K   4.30G   2.58G   2.85G     584K   41.0G   23.5G   26.5G      
16    13.2K   1.36G    496M    568M     297K   30.7G   10.6G   12.2G      
32    30.9K   3.81G   1.10G   1.28G    1.26M    160G   52.2G   59.3G      
64    20.2K   2.51G   1.01G   1.10G    1.30M    165G   66.4G   72.5G      

128 528 57.7M 21.4M 24.1M 86.2K 9.32G 3.41G 3.85G
256 68 5.90M 2.80M 3.15M 23.2K 1.95G 967M 1.07G
512 14 1.00M 192K 280K 10.2K 740M 138M 202M
1K 6 3K 3K 48.0K 8.27K 4.13M 4.13M 66.1M
2K 3 1.50K 1.50K 24.0K 8.35K 4.18M 4.18M 66.8M
4K 3 1.50K 1.50K 24.0K 18.0K 8.99M 8.99M 144M
8K 1 512 512 7.99K 8.86K 4.43M 4.43M 70.8M
Total 17.9M 2.13T 1.98T 2.00T 25.5M 2.94T 2.42T 2.47T

dedup = 1.24, compress = 1.21, copies = 1.02, dedup * compress / copies = 1.47

Ashod commented 10 years ago

Update

After countless more cycles of failed boots, I removed /etc/zfs/zpool.cache and booted again (ZFS is not rootfs). I could 'import -o readonly=on' and could see that the snapshot destruction had progressed (it was no longer listed and the dataset reported different referenced size).

I tried importing without readonly and observe memory usage. For over 9 minutes memory increased at the rate of ~1MB/s at which point it went from ~1800MB to ~3600 to ~5600 and the next second the system was out of memory and froze.

Rebooting again successfully mounted ZFS and everything is back to normal with the snapshot removed.

So this particular case has been resolved, however, I still think (and I hope others agree with me) that ZFS should be more aware of the memory as a scarce resource. It seems that the assumption is that RAM is in infinite supply (something that SUN, let us just say, would have loved to pretend is the case, at the expense of the customer of course). There is absolutely no reason why the DDT and other metadata couldn't be updated within 8GB space and 64GB l2arc.

behlendorf commented 10 years ago

@Ashod Thanks for the detailed bug report. I completely agree, there's no reason that ZFS should have exhausted memory on your system. I consider this a bug and it's something we'll want to get addressed.

Ashod commented 10 years ago

Thanks Brian. My offer stands, if I can help with anything, just let me know.

There is clearly something triggering excessive locked page allocations that is starving the kernel to death. If it's any help, I had a bunch of files each at 1GB. I know rm-ing large dedup-ed files does tax the ARC and RAM quite a bit and I suspect they must have been the cause. But again, I've rm-ed even larger files on DDT in the past and didn't exceed arc_max. Clearly the fact that this was a snapshot destroy had also something to do with it.

Thanks again for a great job on ZOL. It's highly appreciated.

behlendorf commented 10 years ago

@Ashod it would be great if you have some time to help debug this.

What we need to determine first is what's actually consuming all of that memory. I suspect it's not normal ARC buffers because there's no reason for them to be pinned in memory. You've also mentioned that arcsz was only 1.4G at the time of the time of the lockup.

There are a couple other significant memory consumers we should take a look at. One such consumer is l2arc headers. These must be kept in memory so blocks which have been written to the l2arc device can be referenced. There's also currently no mechanism to limit how many are cached (see #1420) so they might consume cause an issue like this. It would be helpful to see the contents of the /proc/spl/kstat/zfs/arcstats file when it's in this state. That should give us a decent accounting of what memory is being consumed by the ARC.

Another place which might end up consuming a significant amount space is the dde entries themselves. We should be able to see how much space they're consuming by looking at the spl's slab via the /proc/spl/kmem/slab file.

Finally, it's possible another type of cache object is accumulating on the system. These would be visible in the /proc/slabinfo file.

If you're able to reproduce the issue again gathering the contents of these 3 proc files should help explain where the memory is going.

fling- commented 10 years ago

Same issue here: 16G ram, raidz2 6x 2TB disks, 1.05 dedup ratio (only enabled on few datasets)

  1. Removed some huge deduped files.
  2. Destroyed some old snaphots.
  3. freeing started it's work but stuck at 367G or something.
  4. Got a deadlock while freeing was in progress.
  5. sysrq rebooted but pool is not importing because of a memory leak, oomkiller is killing everything including 'zpool' command
  6. Tried to import with 0.6.3 and freebsd-10 got enomem and kills and deadlock with both.
fling- commented 10 years ago

[rescue] localhost ~ # zpool import pool: studio id: 11938738359150139976 state: ONLINE status: Some supported features are not enabled on the pool. action: The pool can be imported using its name or numeric identifier, though some features will not be available without an explicit 'zpool upgrade'. config:

    studio           ONLINE
      raidz2-0       ONLINE
        raidz_leg_a  ONLINE
        raidz_leg_b  ONLINE
        raidz_leg_c  ONLINE
        raidz_leg_d  ONLINE
        raidz_leg_e  ONLINE
        raidz_leg_f  ONLINE
behlendorf commented 10 years ago

@fling- If you're still able to reproduce this could you gather some of the debugging requested several comments above. In particular, /proc/spl/kmem/slab and /proc/slabinfo.

6.Tried to import with 0.6.3 and freebsd-10 got enomem and kills and deadlock with both.

This is a useful tidbit. So the problem is likely isn't something Liunx specific.

fling- commented 10 years ago

@behlendorf Logs of HEAD version import: syslog messages while true; sleep 10; do date | cat - /proc/spl/kstat/zfs/arcstats /proc/spl/kmem/slab >> oomkiller.log; done # warning: 9000+ lines

behlendorf commented 10 years ago

@fling- The logs you posted show that ZFS has queued a huge number of outstanding IO's (~700,000) on the system which are consuming a significant amount of memory. It appears the IOs are mainly pulling in 4k blocks. This is the deduplication table leaf block size so I suspect what's happening is all the newly freeable blocks are causing it to overly aggressively prefetch the ddt.

Can you try the following and see if it helps.

1) Pull the latest spl and zfs code from master. There were some recent changes which should reduce the memory footprint. 2) Set zfs_dedup_prefetch=0 to disable the deduplication prefetch. 3) Grab the /proc/meminfo stats from the system while this is running.

fling- commented 10 years ago

@Ashod zfs_dedup_prefetch=0 fixed my issue, try this.

Ashod commented 10 years ago

At least on 0.6.3 disabling dedup prefetching doesn't help.

I have reproduced the issue and am currently trying to recover. The good news is that I have captured all the relevant stats and will post them once I get the system up and running.

This time the issue was reproduced in a different way. Last time it had locked up rather quickly (seconds) after 'destroy snapshot' command. This time, I had a dataset with a few dozen snapshots (each had 5-9GB unique and referenced 10-20GB) mostly small files (unlike the previous dataset which had about a dozen 1GB files and most files were 30-100MB, this dataset is a system backup and is mostly small files, 10s of thousands of them). I destroyed half a dozen snapshots rather quickly (under a minute in most cases) and so I proceeded to destroy the complete dataset (thinking I couldn't reproduce by destroying each snapshot). The complete dataset sat at 330GB with high dedup/compression ratio. Logical size was over 720GB. Destroying recursively finished swiftly (also under a minute or two).

The system seemed to be normal and I zdb -DD'd and got stats back twice. I used the system for a good half an hour before noticing slowdowns. I noticed heavy I/O and tried zdb -DD again (I suspected the DDT was getting cleaned up, and wanted to see the progress). (It was here that I ran the stats capturing script again, sensing an imminent lockup.) But that didn't run for long (it didn't finish at all). The system went down (out of memory). Restarts brought the system back twice (logged in with smbd and all other daemons up and running) but both times again ran out of RAM within 10-15 minutes. I've since hard-booted a few dozen times, and set zfs_dedup_prefetch=0 but still run out of RAM on mount. If that option helped any is hard to tell, but the system is still unable to mount (with the current DDT cleanup undergoing).

Let me know if there is any data I can collect while in this condition, but consider that without being able to boot I have limited resources.

behlendorf commented 10 years ago

@Ashod The contents of /proc/spl/kstat/zfs/arcstats, /proc/spl/kmem/slab, /proc/slabinfo, and /proc/meminfo prior to the OOM event would be the most useful. That should give us a good idea where the memory is being used and perhaps why it can't be freed.

Ashod commented 10 years ago

@behlendorf Here you go. This is the last ~3 seconds @ 100ms interval and the syslog at the same time. (I have the full capture for the whole 3 minute run of import, it's 1.1MB gzipped. Let me know if you want it.)

One interesting thing is the stack traces in the syslog roughly 33 seconds before death. Also, the ARC and ARC-Meta were way below limit (or so they are reported). So you may be right about outstanding IOs being the culprit.

I'm interested in the code that is responsible for this activity (presumably some sort of DDT cleanup of the removed references). If you can point me to the relevant code/docs I'd appreciate it. Thanks again.

prakashsurya commented 10 years ago

This line looks troubling (from the last dump in the gist):

zio_buf_4096                          0x08040 5940183040 2876067840   262144     4096  22660 22660 22661  702460 702165 702491      0     0     0

So, nearly 6GB is being "consumed" by the SPL 4K slab, but only 2.8GB is actually being "used".

It also shows:

VmallocUsed:     7216336 kB

So, that lines up. My guess is if I was to add up all of the SPL slab memory, it would add up to around 7.2GB. This doesn't look good, considering the system only has 7.3GB total.

The weird part is, none of this is accounted for in the arcstats dump:

size                            4    552630752

Only 550MB is being reported as the size of the ARC. So, what is consuming all of that 2.8GB of the 4K slab? My guess is it's the DDT leaf blocks (which are 4K in size), but I'd expect those to be accounted for in the ARC. Can somebody comment on if these blocks are backed by the ARC or not?

So, who is using all of the 4K slab memory (?), and why isn't this being properly regulated by the shrinkers?

As a temporary workaround, I would suggest trying out this module parameter:

$ cat /etc/modprobe.d/zfs.conf 
options spl spl_kmem_cache_slab_limit=16384

It's not a "fix", but it might help relieve some of your memory pressure as the linux slabs don't have nearly the amount of overhead as the SPL slabs.

Ashod commented 10 years ago

@prakashsurya Good points. I did hint at the same discrepancy but didn't trust my reading the numbers.

@behlendorf et al: I've added both the spl slab limit and disabled dedup prefetching. Countless failed mounts later I installed new dimms to double the RAM to 16GB. The improvement that the 16GB RAM brought was noticeable. Right off the bat the time-to-death doubled, which meant more work getting done with fewer hard boots. It took maybe half a dozen reboots before I got login prompt, a good sight indeed. Even then, ZFS continued heavy IO and did OOM, even though it had clearly mounted. This time the culprit wasn't mount.zfs rather the (in)famous txg_sync, which tells me the hardest part was probably over.

I've since reverted to 8GB and noticed that ZFS didn't automount. The zpool.cache was missing. I mounted manually which although took over 10 minutes, did succeed and the system seems back online and still. But ZFS is still doing heavy IO (20 minutes in). iotop shows tens of threads reading:

Actual DISK READ: 1669.60 K/s | Actual DISK WRITE: 0.00 B/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 2063 be/7 root 36.27 K/s 0.00 B/s 0.00 % 99.99 % [z_fr_iss_4/0] 2068 be/7 root 73.53 K/s 0.00 B/s 0.00 % 99.99 % [z_fr_iss_5/1] 2066 be/7 root 49.02 K/s 0.00 B/s 0.00 % 99.99 % [z_fr_iss_4/3] 2055 be/7 root 24.51 K/s 0.00 B/s 0.00 % 99.99 % [z_fr_iss_2/0] 2051 be/7 root 43.14 K/s 0.00 B/s 0.00 % 99.99 % [z_fr_iss_1/0] 2054 be/7 root 25.49 K/s 0.00 B/s 0.00 % 99.99 % [z_fr_iss_1/3] 2069 be/7 root 25.49 K/s 0.00 B/s 0.00 % 99.99 % [z_fr_iss_5/2] 2073 be/7 root 20.59 K/s 0.00 B/s 0.00 % 99.99 % [z_fr_iss_6/2] 2047 be/7 root 34.31 K/s 0.00 B/s 0.00 % 99.99 % [z_fr_iss_0/0] 2075 be/7 root 29.41 K/s 0.00 B/s 0.00 % 99.99 % [z_fr_iss_7/0] 2071 be/7 root 39.22 K/s 0.00 B/s 0.00 % 99.99 % [z_fr_iss_6/0] 2060 be/7 root 34.31 K/s 0.00 B/s 0.00 % 99.99 % [z_fr_iss_3/1] 2076 be/7 root 27.45 K/s 0.00 B/s 0.00 % 99.99 % [z_fr_iss_7/1] 2067 be/7 root 41.18 K/s 0.00 B/s 0.00 % 99.99 % [z_fr_iss_5/0] 2078 be/7 root 53.92 K/s 0.00 B/s 0.00 % 99.99 % [z_fr_iss_7/3] 2061 be/7 root 21.57 K/s 0.00 B/s 0.00 % 99.99 % [z_fr_iss_3/2] 2058 be/7 root 25.49 K/s 0.00 B/s 0.00 % 99.99 % [z_fr_iss_2/3] 2074 be/7 root 27.45 K/s 0.00 B/s 0.00 % 99.41 % [z_fr_iss_6/3] 2062 be/7 root 42.16 K/s 0.00 B/s 0.00 % 98.75 % [z_fr_iss_3/3] 2070 be/7 root 19.61 K/s 0.00 B/s 0.00 % 98.56 % [z_fr_iss_5/3] 2065 be/7 root 27.45 K/s 0.00 B/s 0.00 % 98.38 % [z_fr_iss_4/2] 2072 be/7 root 12.75 K/s 0.00 B/s 0.00 % 98.12 % [z_fr_iss_6/1] 2064 be/7 root 25.49 K/s 0.00 B/s 0.00 % 97.06 % [z_fr_iss_4/1] 2052 be/7 root 29.41 K/s 0.00 B/s 0.00 % 96.55 % [z_fr_iss_1/1] 2050 be/7 root 33.33 K/s 0.00 B/s 0.00 % 96.50 % [z_fr_iss_0/3] 2059 be/7 root 10.78 K/s 0.00 B/s 0.00 % 95.12 % [z_fr_iss_3/0] 2057 be/7 root 7.84 K/s 0.00 B/s 0.00 % 93.74 % [z_fr_iss_2/2] 2049 be/7 root 26.47 K/s 0.00 B/s 0.00 % 91.90 % [z_fr_iss_0/2] 2077 be/7 root 63.73 K/s 0.00 B/s 0.00 % 91.79 % [z_fr_iss_7/2] 2056 be/7 root 21.57 K/s 0.00 B/s 0.00 % 89.20 % [z_fr_iss_2/1]

These are just the top 30 or so, there were many more! On a quad-core, 6-device, 1-vdev pool (leaving the RAM aside) I'm not sure what's the big idea of having so many reader threads. Combined read is typically <2MB/s.

I really would like to help resolve this situation. By removing dedup datasets one expects the load to decrease. Instead, this is an ordeal. That I don't have enough RAM could be argued for, however my dataset was trivial by any measure (according to zdb, it should fit in RAM with some spillover due to overhead). Even then, doubling the RAM to 16GB didn't help much (only reduced the number of reboots, although I had already done dozens by then). Clearly there is some accounting issues with RAM usage and shrinking isn't effective. Could this be a memory fragmentation issue (on the slab-level)? Another issue is the amount of IO required to remove a dedup dataset. Is it really impossible to be more efficient (I do recognize the random nature of the DDT leafs and the thrashing involved, but my DDT isn't that large to begin with at 10.8M blocks before the dataset destroy).

Let me know your thoughts and how I can help. And thanks again for the help!

behlendorf commented 10 years ago

So, who is using all of the 4K slab memory (?), and why isn't this being properly regulated by the shrinkers?

@prakashsurya @Ashod I can partly explain what's going on here. According to the cache statistics there are 702,491 outstanding 4k ARC buffers. This fact by itself this isn't too interesting because we would expect the ZAP leaf blocks to be cached for the DDT.

zio_cache    0x00040 866582528 827233560 32768 1080 26446 26446 26446 766934 765957   0 0 0
zio_buf_4096 0x08040 5940183040 2876067840   262144     4096  22660 22660 22661  702460 702165 702491      0     0     0

However, there is also a huge number of in-flight zios (765,957). Each of these structures is responsible for managing an IO which ZFS has issued to the disk. I would expect the vast majority of these zio's to to reads and have an empty 4k ARC buffer attached waiting to be filled in. Until each zio is handled by the disk this memory can't be freed and the attached buffers won't appear in the ARC stats. This explains the strange accounting.

Based on code inspection I was hoping that setting zfs_dedup_prefetch=0 would prevent ZFS from issuing all of this IO. Clearly that didn't work. The next easy thing to try would be disabling all ZFS prefetch by setting zfs_prefetch_disable=1. That will at least let us rule out if it's some overly aggressive prefetch code causing this.

After that we'll want to run down exactly what call path is issuing all this IO. Getting a sampling of the read history might help narrow down the search. @Ashod you can enable the read history by setting the zfs_read_history=1000 module option. With this enabled a log of the last 1000 IOs can be read via /proc/spl/kstat/zfs/<pool>/reads.

Depending on what we discover we may want to investigating adding some kind of limit or throttle on the number of outstanding IOs.

Ashod commented 10 years ago

Depending on what we discover we may want to investigating adding some kind of limit or throttle on the number of outstanding IOs.

@behlendorf I see your approach, but couldn't we agree that the overall outstanding IOs is high for the amount of RAM available and limiting it is necessary? My thinking is that prefetching should only be provisional, so it must auto-throttle based on available resources (much like cache/buffers, it should never hog the system). So prefetching should behave well regardless of this particular scenario, otherwise it'll end up harming the overall system performance than help.

This leaves us with the active operation and/or the IO scheduler. I think it's fair to ask for more evidence before adding code to throttle it, but I have no reason to think if it has no limit on the number of IOs it queues it wouldn't starve the system of memory. In other words, I expect it too to have some sort of control of the number of outstanding IOs it issues, otherwise on a spindle disk (such as mine) it'll almost always end up queued more than the disk could handle. And this is what's happening with a destroy over large data (as I understand it).

I'll add the suggested settings now, but I'm not too keen on bringing the server down for another couple of days immediately (I'm fairly confident I can reproduce this by destroying a snapshot/dataset with at least 30-50GB of dedup'ed data.) But I will be ready for the next time to collect the read history.

Meanwhile, I'll browse the code and get familiar with the algorithm (would be helpful if someone can point me at the relevant code that runs as a result of the destroy).

behlendorf commented 10 years ago

but couldn't we agree that the overall outstanding IOs is high for the amount of RAM available and limiting it is necessary?

Absolutely! I think we're in complete agreement. The code should never allow this many outstanding zios in the system at any one time. However, there's clear some care where it does so we need to determine is exactly where the IOs are originating from and why they aren't being throttled.

Meanwhile, I'll browse the code and get familiar with the algorithm (would be helpful if someone can point me at the relevant code that runs as a result of the destroy).

Start with dsl_destroy_head this is where the sync task function gets setup to destroy the dataset. You might also take a look at vdev_queue.c which explains and implements the ZFS I/O scheduler.

Ashod commented 10 years ago

Thanks Brian.

behlendorf commented 10 years ago

@Ashod I just saw that the patch in #2675 may address your issue. You'll probably want to set zfs_free_max_blocks to something around 100,000. Go smaller if you still run out of memory.

dweeezil commented 10 years ago

During some completely unrelated work recently, I implemented a tunable to completely disable the async freeing and/or to halt it on the fly. I wonder if this might be a useful option, combined with the per-txg freeong limiter we just got to help with pool recovery/importing in these cases.

behlendorf commented 10 years ago

@dweeezil That could be handy as a last resort. .But how does it differ from just setting zfs_free_max_blocks=0?

dweeezil commented 10 years ago

@behlendorf Not a whole lot of difference, now that I think about it. I had added the tunable to allow me to create pools with arbitrary sets of deferred destroys for further analysis. It did short-circuit a bit of processing that would still occur with zfs_free_max_blocks=0 but, as I said, in retrospect, it's likely not worth it once zfs_free_max_blocks is available.

Ashod commented 10 years ago

Thanks. This is good news. Can't wait to get a stable build to try it.

dweeezil commented 10 years ago

I wonder whether this condition was exacerbated by 7ef5e54 and even more so in Illumos since they went to ZTI_P(12, 8) as opposed to our ZTI_P(4, 8).

behlendorf commented 10 years ago

@dweeezil That commit should only impact the level on concurrency, the number of zio's queued should remain the same.

dweeezil commented 10 years ago

Right, I was trying to rationalize why this issue hadn't come up in the past (that I know of) and thought maybe the greater concurrency would allow for a lot more queued zio's to be created.

ryao commented 10 years ago

@Ashod I suspect that 0dfc732416922e1dd59028fa900d38d04dceaa12 might have fixed this problem. Try setting zfs_dedup_prefetch=0 on the module before the import. If it resolves this, then the problem is caused by doing prefetching on many random locations in parallel.

Ashod commented 10 years ago

@ryao I had tried that, but it didn't help. I tried spl_kmem_cache_slab_limit=16384 and zfs_dedup_prefetch=0 even with 16GB RAM it took multiple OOMs and hard reboots to recover.

The two things that are relevant to my case are the high outstanding ZIO (over 700K in my case) and the unnecessarily high number of z_fr_iss threads (which I'd expect beyond 2 x CPU-count to be harmful).

Both of these should (not unexpectedly) flood spinning disks (6 in my case) far faster than the kernel could handle. What is happening is the queued IO's overhead is sooner rather than later catching up with the available RAM and killing the system. That's because the spinning disks not only can't keep up with the demands, but the high random IO is making it even harder for them to keep up.

In sequential IO my WD REDs can serve 500+ iops (combined iops reaches 3000) but during high DDT activity (as is the case when destroying deduped snapshots/datasets) this number averages a low of 1/10th its peak, which means the scheduler can queue even more ZIO in a given period.

behlendorf commented 10 years ago

The patch to limit the number of blocks freed per TXG has been merged to master. The default value has been set to 100,000 which should prevent this issue in the future. We can fine tune that value as appropriate.

Ashod commented 10 years ago

@behlendorf

We can fine tune that value as appropriate.

Curious as to why not make it a percentage of available cores and tune that rather than make a fixed value?

What is the assumed RAM and CPU availability, say, per 100K IO? If that is known (or estimated) we can scale based on that. Although it seems the issue here is RAM-bound, not CPU, the latter might be a reasonable indicator of hardware capability (rarely would a 32 core machine run on 8GB or less, or a quad-core on 64GB or more).

behlendorf commented 10 years ago

@Ashod we could definitely scale it based on RAM or CPUs but I don't think in the end those are the most important constraints. We need the TXGs to keep rolling along to ensure everything in the system making forward progress. So we need to limit the number of blocks free such that:

1) Individual TXG sync times remain reasonable (a smallish number of seconds) 2) Enough blocks are freed per TXG to such that it completes as quickly as possible. 2) The worst case memory requirements are reasonable.

For this I swagged 100K as a number which I think strikes a reasonable balance. It roughly limits the memory usage to 100M since each zio is around 1K, and it should put a reasonable bound of the TXG times. That said, I've got no data so it's just a best guess and if we can do something better I'm all for it.

Ashod commented 10 years ago

@behlendorf Sounds good. I'm interested in numbers as well.