commit zfs-0.6.4-93-gca0bf58 causes excessive server load

openzfs / zfs

OpenZFS on Linux and FreeBSD

https://openzfs.github.io/openzfs-docs

Other

10.69k stars 1.76k forks source link

commit zfs-0.6.4-93-gca0bf58 causes excessive server load #4210

Closed jxiong closed 8 years ago

jxiong commented 8 years ago

Recently we're experiencing a timeout issue for Lustre workload on all ZFS revision after commit ca0bf58: 'Illumos 5497 - lock contention on arcs_mtx'. The symptom is that the server load became really high and then after a while timeout happened. The timeout thread is always with the following backtrace.

LNet: Service thread pid 28002 was inactive for 90.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 28002, comm: ll_ost01_014

Call Trace:
 [<ffffffffa06919af>] ? vdev_mirror_io_start+0xaf/0x1a0 [zfs]
 [<ffffffffa0690d60>] ? vdev_mirror_child_done+0x0/0x30 [zfs]
 [<ffffffff815395d3>] io_schedule+0x73/0xc0
 [<ffffffffa05b961f>] cv_wait_common+0xaf/0x130 [spl]
 [<ffffffff810a1460>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa05b96b8>] __cv_wait_io+0x18/0x20 [spl]
 [<ffffffffa06d67fb>] zio_wait+0xdb/0x160 [zfs]
 [<ffffffffa061bcde>] dbuf_read+0x2be/0x920 [zfs]
 [<ffffffffa0637094>] ? dnode_rele_and_unlock+0x64/0xb0 [zfs]
 [<ffffffffa0624158>] dmu_buf_hold+0x68/0x90 [zfs]
 [<ffffffffa0696225>] zap_get_leaf_byblk+0x85/0x290 [zfs]
 [<ffffffffa06964a1>] zap_deref_leaf+0x71/0x90 [zfs]
 [<ffffffffa0698baa>] fzap_add_cd+0x4a/0x160 [zfs]
 [<ffffffff811790cd>] ? __kmalloc_node+0x4d/0x60
 [<ffffffffa0698d13>] fzap_add+0x53/0x70 [zfs]
 [<ffffffffa069d1b5>] zap_add+0x125/0x1d0 [zfs]
 [<ffffffffa1061d29>] osd_object_create+0x339/0x860 [osd_zfs]
 [<ffffffffa11a1260>] ofd_precreate_objects+0xd20/0x19d0 [ofd]
 [<ffffffffa119083e>] ofd_create_hdl+0x56e/0x2640 [ofd]
 [<ffffffffa0c0cfab>] ? lustre_pack_reply_v2+0x1eb/0x280 [ptlrpc]
 [<ffffffffa0c774cc>] tgt_request_handle+0x8ec/0x1470 [ptlrpc]
 [<ffffffffa0c1eb41>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
 [<ffffffffa0c1dd00>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
 [<ffffffff810a0fce>] kthread+0x9e/0xc0
 [<ffffffff8100c28a>] child_rip+0xa/0x20
 [<ffffffff810a0f30>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20

I have collected a crashdump when this issue was happening at: http://goo.gl/L52nf5, please take a look.

This is a snapshot of server load when this issue is going to happen:

# pdsh -w iws[2-9] uptime
iws6:  16:53:07 up 40 min,  0 users,  load average: 191.02, 108.51, 72.68
iws9:  16:53:07 up 39 min,  0 users,  load average: 123.03, 52.62, 34.41
iws8:  16:53:07 up 39 min,  0 users,  load average: 219.14, 82.30, 40.87
iws5:  16:53:07 up 40 min,  0 users,  load average: 224.34, 93.45, 96.52
iws7:  16:53:07 up 39 min,  0 users,  load average: 5.75, 11.86, 21.90
iws3:  16:53:07 up 40 min,  0 users,  load average: 0.77, 1.51, 7.33
iws4:  16:53:07 up 40 min,  0 users,  load average: 149.31, 72.71, 42.53
iws2:  16:53:07 up 41 min,  0 users,  load average: 0.20, 27.54, 77.79

There are many servers and the issue is likely seen on busy servers.

And the output of vmstat on one of the busy servers:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3 242      0 24306064   6020 532824    0    0  5291  8527  174  645  0  4 82 13  0
 3 262      0 24305912   6020 532828    0    0 634693     0 6412 25480  0  6  0 94  0   
 6 278      0 24304936   6020 532828    0    0 654416     0 6370 24295  0  6  0 94  0   
 2 278      0 24304404   6020 532828    0    0 659252     0 6323 24095  0  6  0 94  0   
 1 288      0 24303212   6020 532828    0    0 660254   260 6488 24572  0  6  0 94  0   
 1 290      0 24303072   6020 532828    0    0 641945     0 6218 22338  0  6  0 94  0   
 3 290      0 24303332   6020 532828    0    0 671117     0 6389 23223  0  6  0 94  0   
 1 279      0 24303040   6020 532828    0    0 667915   232 6508 22861  0  6  0 94  0   
 4 282      0 24302912   6020 532828    0    0 663835     0 6531 22738  0  6  0 94  0   
 1 275      0 24303204   6020 532828    0    0 655927     0 6259 21764  0  6  0 94  0   
 2 270      0 24303912   6020 532828    0    0 659160     0 6545 23536  0  6  0 94  0   
 0 254      0 24303992   6020 532828    0    0 642260     0 6405 22885  0  6  0 94  0   
 3 249      0 24304924   6020 532828    0    0 618660     0 6251 22535  0  6  0 94  0   
 1 235      0 24313268   6020 532828    0    0 653712     0 6486 23174  0  6  0 94  0   
10 263      0 24305344   6020 532828    0    0 647617     0 6457 23212  0  6  0 94  0

As you can see, 94% percent of CPU are waiting for I/O. Not sure if this is normal as we're running an I/O intensive test case.

This is not a deadlock issue as we have seen on the other tickets. When this issue occurs, the corresponding clients, where the workload is produced, are evicted and then the server can go back to normal again.

From the crashdump I mentioned above, there are lots of ZIO_PRIORITY_SYNC_READ ZIO queued in vdev. For example, this is one of the child ZIO the server thread is waiting for:

crash> zio 0xffff8803fa46fad0
struct zio {
  io_bookmark = {
    zb_objset = 45, 
    zb_object = 300, 
    zb_level = 0, 
    zb_blkid = 1
  }, 
  io_prop = {
    zp_checksum = ZIO_CHECKSUM_INHERIT, 
    zp_compress = ZIO_COMPRESS_INHERIT, 
    zp_type = DMU_OT_NONE, 
    zp_level = 0 '\000', 
    zp_copies = 0 '\000', 
    zp_dedup = B_FALSE, 
    zp_dedup_verify = B_FALSE, 
    zp_nopwrite = B_FALSE
  }, 
  io_type = ZIO_TYPE_READ, 
  io_child_type = ZIO_CHILD_VDEV, 
  io_cmd = 0, 
  io_priority = ZIO_PRIORITY_SYNC_READ, 
  io_reexecute = 0 '\000', 
  io_state = "\001", 
  io_txg = 418, 
  io_spa = 0xffff8807e23cc000, 
  io_bp = 0xffff8803fa46fb40, 
  io_bp_override = 0x0,
  ...

And the status of the corresponding vdev:

crash> vdev.vdev_queue 0xffff8807e23ef000
  vdev_queue = {
    vq_vdev = 0xffff8807e23ef000, 
    vq_class = {{
        vqc_active = 10, 
        vqc_queued_tree = {
          avl_root = 0xffff8804a7e74418, 
          avl_compar = 0xffffffffa0691d20 <vdev_queue_timestamp_compare>, 
          avl_offset = 584, 
          avl_numnodes = 507, 
          avl_size = 1088
        }
      }, {
        vqc_active = 0, 
        vqc_queued_tree = {
          avl_root = 0x0, 
          avl_compar = 0xffffffffa0691d20 <vdev_queue_timestamp_compare>, 
          avl_offset = 584, 
          avl_numnodes = 0, 
          avl_size = 1088
        }
      }, {
        vqc_active = 0, 
        vqc_queued_tree = {
          avl_root = 0x0, 
          avl_compar = 0xffffffffa0691ce0 <vdev_queue_offset_compare>, 
          avl_offset = 584, 
          avl_numnodes = 0, 
          avl_size = 1088
        }
      }, {
        vqc_active = 0, 
        vqc_queued_tree = {
          avl_root = 0x0, 
          avl_compar = 0xffffffffa0691ce0 <vdev_queue_offset_compare>, 
          avl_offset = 584, 
          avl_numnodes = 0, 
          avl_size = 1088
        }
      }, {
        vqc_active = 0, 
        vqc_queued_tree = {
          avl_root = 0x0, 
          avl_compar = 0xffffffffa0691ce0 <vdev_queue_offset_compare>, 
          avl_offset = 584, 
          avl_numnodes = 0, 
          avl_size = 1088
        }
      }}, 
    vq_active_tree = {
      avl_root = 0xffff880ba38e6318, 
      avl_compar = 0xffffffffa0691ce0 <vdev_queue_offset_compare>, 
      avl_offset = 584, 
      avl_numnodes = 10, 
      avl_size = 1088
    }, 
    vq_last_offset = 16288101888, 
    vq_io_complete_ts = 1732281291603, 
    vq_io_delta_ts = 1026073230,

There were 507 ZIO pending in the queue. I have seen more on the other crashdump.

I have checked that the last finished I/O was the first I/O in the active tree. It took more than 1 second to finish this I/O.

We didn't see this issue on zfs-0.6.4-93-gca0bf58~1. Please let me know if you want any further information, thanks.

jxiong commented 8 years ago

The backtrace of all processes I extracted from the crashdump file.

bt.txt

kernelOfTruth commented 8 years ago

@jxiong any reason you can't use the 0.6.5* branch (which should include fixes against that commit) ?

jxiong commented 8 years ago

@kernelOfTruth we tried the latest revision 0.6.5.4 coming out last Friday and it hit the same issue.

behlendorf commented 8 years ago

This is not a deadlock issue as we have seen on the other tickets

Agreed this looks different.

That said the 0.6.5.4 tag definitely does contain fixes for some deadlocks introduced by that big Illumos commit you referenced which reworked the ARC. I'd suggest using that or an 0.6.4 variant which predate the change. We're locally running 0.6.5.3 on our Lustre systems and will be rolling to 0.6.5.4 with the next update. To my knowledge we haven't hit this specific issue but we may just not be pushing the system hard enough.

A few things caught my eye in the stacks you posted.

1) The txg_sync and arc_reclaim threads don't appear to be deadlocked. That's consistent with what you described with Lustre being able to move on after evicting the client.

2) ll_ost01_014 is blocked waiting for a single read to be serviced. Based on the stacks you posted my guess is that z_rd_int_5 is servicing that completion for that read (all other zrd* threads are idle).

3) It's odd (but not impossible) to have caught this thread, and only thread this thread, in zio_checksum_error() validating the checksum. It almost feels like somehow a invalid size was passed to fletcher_4_native() and it's spinning in there. Can you check if there is this same signature on other systems which have hit the problem?

PID: 23018  TASK: ffff8810323eeab0  CPU: 5   COMMAND: "z_rd_int_5"
 #0 [ffff8800486a6e90] crash_nmi_callback at ffffffff81033cf6
 #1 [ffff8800486a6ea0] notifier_call_chain at ffffffff8153f4a5
 #2 [ffff8800486a6ee0] atomic_notifier_call_chain at ffffffff8153f50a
 #3 [ffff8800486a6ef0] notify_die at ffffffff810a783e
 #4 [ffff8800486a6f20] do_nmi at ffffffff8153d163
 #5 [ffff8800486a6f50] nmi at ffffffff8153ca20
    [exception RIP: fletcher_4_native+42]
    RIP: ffffffffa05fc6ea  RSP: ffff88102af8fcb0  RFLAGS: 00000202
    RAX: 07492e19b73998bc  RBX: ffff88102af8fd80  RCX: 02d2bf56d2495fb8
    RDX: ffff88102af8fd10  RSI: ffffc9057c47c000  RDI: ffffc9057c47b5a4
    RBP: ffff88102af8fcb0   R8: 00000b86d2c8bc26   R9: 9de92f08e35685b8
    R10: 00000000568f1812  R11: 0000000000000000  R12: ffff880d327bbb10
    R13: 0000000000000000  R14: 0000000000020000  R15: ffffc9057c45c000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #6 [ffff88102af8fcb0] fletcher_4_native at ffffffffa05fc6ea [zcommon]
 #7 [ffff88102af8fcb8] zio_checksum_error at ffffffffa06d8e33 [zfs]
 #8 [ffff88102af8fd68] zio_checksum_verify at ffffffffa06d2a4e [zfs]
 #9 [ffff88102af8fde8] zio_execute at ffffffffa06d66bc [zfs]
#10 [ffff88102af8fe18] taskq_thread at ffffffffa05b5707 [spl]
#11 [ffff88102af8fee8] kthread at ffffffff810a0fce
#12 [ffff88102af8ff48] kernel_thread at ffffffff8100c28a

jxiong commented 8 years ago

@behlendorf I didn't see this pattern on the other crashdumps. But for this specific call, I can see the size passing to fletcher_4_native is correct. The r14 is the size and it matches zio:io_size:

crash> zio.io_size ffff880d327bbb10
  io_size = 131072

kdump interrupted fletcher_4_native() when it's 2652L bytes of computation left.

jxiong commented 8 years ago

@prakashsurya @behlendorf

I would like to look into this under your help. I will appreciate if you could tell me where/how to start the investigation.

behlendorf commented 8 years ago

@jxiong thanks for the quick follow up. Since you have a crash dump I'd start be carefully inspecting the zio (including its children) which the ll_ost01_014 thread is blocked on. You should be able to determine what stage of the pipeline it's in and why it's waiting. One of the pipeline threads should be completing the IO, is it z_rd_int_5?

Have you ruled out hardware problems on this system? Real drives can misbehave in the strangest ways. You might be able to get some insight in by looking at the zpool events log. Asking smartctl to dump the error counters from the drive isn't a bad idea either. Obviously check the console logs too. Maybe things are just slow due to hardware.

I noticed the pool in question has a low txg number which indicates it's a new pool. Does this mean your able to easily reproduce this issue? If so there's some debugging you could enable which might help shed some light on the issue. The latest SPL code has added a a new /proc/spl/task-all kstat which gives you an easy way to look at everything in the taskqs. That may provide some insight.

I know you've identified ca0bf58 as the cause but how sure of that are you? This looks like some kind of issue in the IO pipeline unrelated to those changes. At least I didn't see anything in the backtraces to implicate it aside from how it might have changed the workload.

jxiong commented 8 years ago

When this crash dump was collected the OST ll_ost01_014 has already timedout, but I found another OST thread ll_ost00_008 that was doing the similar thing. The ZIO was in ZIO_STAGE_VDEV_IO_STARTstate, as follows:

crash> zio 0xffff8803fa46fad0
  ...
  io_flags = (ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_CACHE | ZIO_FLAG_DONT_QUEUE | ZIO_FLAG_DONT_PROPAGATE), 
  io_stage = ZIO_STAGE_VDEV_IO_START, 
  io_pipeline = (ZIO_STAGE_VDEV_IO_START | ZIO_STAGE_VDEV_IO_DONE | ZIO_STAGE_VDEV_IO_ASSESS | ZIO_STAGE_CHECKSUM_VERIFY | ZIO_STAGE_DONE), 
  io_orig_flags = (ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE), 
  io_orig_stage = ZIO_STAGE_READY, 
  io_orig_pipeline = (ZIO_STAGE_VDEV_IO_START | ZIO_STAGE_VDEV_IO_DONE | ZIO_STAGE_VDEV_IO_ASSESS | ZIO_STAGE_CHECKSUM_VERIFY | ZIO_STAGE_DONE), 
  ...

I believe the disk I/O is not yet finished, because the executor of this ZIO is the OST thread:

crash> zio.io_executor 0xffff8803fa46fad0
  io_executor = 0xffff880ffa927520
crash> kthread_t 0xffff880ffa927520 |grep comm
  comm = "ll_ost00_008\000\000\000",

Then I investigated the ZIO in the head of active queue:

crash> zio.io_stage 0xffff880ffb77a050
  io_stage = ZIO_STAGE_VDEV_IO_START
crash> zio.io_executor 0xffff880ffb77a050
  io_executor = 0xffff881031deaab0
crash> kthread_t.comm 0xffff881031deaab0
  comm = "z_rd_int_6\000\000\000\000\000"

The disk I/O of this ZIO must have been finished. However, the stage of this is still on ZIO_STAGE_VDEV_IO_START but it had ever been scheduled by zio_interrupt() and zio_execute() must have executed because io_executor is assigned in it. I would expect the stage should be moved to ZIO_STAGE_VDEV_IO_DONE but for unknown reason it didn't.

I'm pretty sure the hardware should be okay because we ran zfs-0.6.4-92-gb9541d6 for many hours and never hit the same issue. Usually we can hit this issue within half an hour when running a problematic revision.

I didn't have a copy of git bisect log, but I do have a note by myself:

0.6.4-54_gf1512ee - good

0.6.4-65_g5dc8b73 - good

0.6.4-76-g87abfcb - good

0.6.4-87-g7807028 - good

0.6.4-92-gb9541d6 - good

zfs-0.6.4-93-gca0bf58 <- the bad guy

0.6.4-95-gc5528b9 - bad with new issues - dmu_tx_wait() stuck

0.6.4-98_g06358ea - bad

0.6.4-143_g411bf20 - bad

0.6.5 - bad

A lot of labor work, sigh.

kernelOfTruth commented 8 years ago

I wonder if the issue would be mitigated by the changes in

Illumos 5960, 5925 https://github.com/zfsonlinux/zfs/commit/fcff0f35bd522076bdda7491c88a91cc0aa531a3 & Illumos 5987 - zfs prefetch code needs work https://github.com/zfsonlinux/zfs/commit/7f60329a261bd48558a498fb10e9b29638eab33b

there were several modifications related to ZIO_PRIORITYSYNC*

behlendorf commented 8 years ago

@jxiong I've definitely been down this exact rabbit hole before and we've found a fixed a few rare issues. One such problem was addressed by commit aa159afb568fff01bb9f427c16b406c5d29664c7 but that fix is part of 0.6.5.3 and latter so if you reproduced the issue on 0.6.5.4 that's not it.

I'm just speculating here without too much information. But it sounds like either zio_interrupt() never ran for that zio in which case you should be able to still find it in the vq_active_tree tree. If you're able to easily recreate the issue you could hot wire spa_deadman() to more aggressively look for slow outstanding IOs. Say have it poll every 1s or even 100ms.

Or possibly zio_execute() is mistakenly re-dispatching a zio repeatedly and never making forward progress on it? That shouldn't be possible but maybe something was overlooked. If you were able to watch the read taskqs something like this might be clear. You could easily cherry pick that taskq kstat, zfsonlinux/spl@200366f23f1a16874d78a07936c5a33f2d488022, to get a little more visibility.

jxiong commented 8 years ago

@kernelOfTruth Those patch are not landed to zfs-0.6.5-release but they are in master. I will check them and give them a try. Thanks.

kernelOfTruth commented 8 years ago

@jxiong if you test out these commits on a testing box, make sure to also include

https://github.com/zfsonlinux/zfs/commit/23de906c72946da56a54f75238693e186d44d6e2 Illumos 5745 - zfs set allows only one dataset property to be set at a time

and

https://github.com/zfsonlinux/zfs/commit/37f8a8835a88da6122e2526d6aaeeef75556a7bd Illumos 5746 - more checksumming in zfs send

otherwise the previously mentioned commits (related to prefetch) won't apply cleanly

alternative would be to simply test master in a dedicated testing box

behlendorf commented 8 years ago

@jxiong I definitely want to get to the bottom on this, do you have a test setup where your able to reproduce the issue fairly easily? I wouldn't mind picking through the crash dumps myself if they're still available.

jxiong commented 8 years ago

I would like to update the status of this problem - we tried the ZFS master revision with the fixes @kernelOfTruth mentioned above, but unfortunately it didn't help.

@behlendorf Yes, this problem can be reproduced on Hyperion really easily. Most of the time we can start seeing this problem in several minutes after test is launched. Sure, I will upload a crash dump to Dropbox soon and update this ticket again. I have crash dumps for both 'good' and 'bad' revision.

jxiong commented 8 years ago

@behlendorf Please check the crashdump at: https://www.dropbox.com/s/23my4xnpahiibgk/zfs-reproduced.tar?dl=0, thanks.

behlendorf commented 8 years ago

@jxiong thanks for posting the crash dump I've spent the morning digging through it. It contains some really interesting insights in to Lustre's behavior. I suspect you were on the right path investigating the vdev queue.

What I found confirms your observation above that the last serviced I/O for that vdev took a little over one second to complete. On the surface that sounds pretty bad, but as you noted there were 507 pending zio's in the sync read class (all other classes were empty). Since sync reads are treated in a strictly FIFO fashion that implies the disk was running at about ~500 read IOPs. Not too shabby.

This means that any on-demand reads which are issued are going to have to pay a 1 second latency. Based on the other stack traces it looks like the vast majority of those sync reads are coming from the ll_ost_io threads. I count roughly 224 of them actively waiting on a read in the crash dump.

As you pointed out this definitely isn't a deadlock. It seems to be more a matter of ofd_precreate_objects getting throttled by all the serial on-demand reads it's issuing. Take a look at the ll_ost00_008 stack and you'll see it's waiting on a read to bring in a zap block. Once that finishes it will move on to the next fid and likely perform another on-demand read, and another, and another, before completing the rpc. That's almost certainly why you're exceeding the higher level Lustre timeout and evicting the client.

PID: 27994  TASK: ffff880ffa927520  CPU: 2   COMMAND: "ll_ost00_008"
 #0 [ffff880ffb057610] schedule at ffffffff81538d90
 #1 [ffff880ffb0576e8] io_schedule at ffffffff815395d3
 #2 [ffff880ffb057708] cv_wait_common at ffffffffa05b961f [spl]
 #3 [ffff880ffb057788] __cv_wait_io at ffffffffa05b96b8 [spl]
 #4 [ffff880ffb057798] zio_wait at ffffffffa06d67fb [zfs]
 #5 [ffff880ffb0577c8] dbuf_read at ffffffffa061bcde [zfs]
 #6 [ffff880ffb057888] dmu_buf_hold at ffffffffa0624158 [zfs]
 #7 [ffff880ffb0578c8] zap_get_leaf_byblk at ffffffffa0696225 [zfs]
 #8 [ffff880ffb057938] zap_deref_leaf at ffffffffa06964a1 [zfs]
 #9 [ffff880ffb057978] fzap_add_cd at ffffffffa0698baa [zfs]
#10 [ffff880ffb057a28] fzap_add at ffffffffa0698d13 [zfs]
#11 [ffff880ffb057a68] zap_add at ffffffffa069d1b5 [zfs]
#12 [ffff880ffb057ad8] osd_object_create at ffffffffa1061d29 [osd_zfs]
#13 [ffff880ffb057b78] ofd_precreate_objects at ffffffffa11a1260 [ofd]
#14 [ffff880ffb057c58] ofd_create_hdl at ffffffffa119083e [ofd]
#15 [ffff880ffb057d48] tgt_request_handle at ffffffffa0c774cc [ptlrpc]
#16 [ffff880ffb057da8] ptlrpc_main at ffffffffa0c1eb41 [ptlrpc]
#17 [ffff880ffb057ee8] kthread at ffffffff810a0fce
#18 [ffff880ffb057f48] kernel_thread at ffffffff8100c28a

This behavior has always existed in Lustre but my guess is that ca0bf58 perturbed things to make this case worse for Lustre. Maybe by reducing contention in the ARC all those Lustre IO threads are now able to drive the disk more so ofd_prefetch_object() sees higher individual latency. The ARC change might also have caused your cache hit rate to drop, maybe you're going to disk more now.

The good news is that we might be able to fix this fairly easily. In our ideal world what we want is for all those zap_add calls to be cache hits. Now since we know in advance what fids are going to be added to the zap in osd_declare_object_create() I suggest we use it as an opportunity to asynchronously fetch the needed zap blocks. This way by the time osd_object_create() gets to a specific fid there's a good chance it's already in the ARC. It also means all the needed zap IO for fids gets started at the very start of the RPC and runs concurrently rather than sequentially as needed.

Here's an untested patch which adds this prefetch logic to Lustre. Can you try applying it and see if it resolves the issue. I would think this might also help improve the maximum create rate on an idle system.

diff --git a/lustre/osd-zfs/osd_object.c b/lustre/osd-zfs/osd_object.c
index 9884333..5f1e85c 100644
--- a/lustre/osd-zfs/osd_object.c
+++ b/lustre/osd-zfs/osd_object.c
@@ -1086,6 +1086,10 @@ static int osd_declare_object_create(const struct lu_env *env,

        /* and we'll add it to some mapping */
        zapid = osd_get_name_n_idx(env, osd, fid, buf);
+
+       /* prefetch any needed zap blocks for this key */
+       zap_prefetch(osd->od_os, zapid, buf);
+
        dmu_tx_hold_bonus(oh->ot_tx, zapid);
        dmu_tx_hold_zap(oh->ot_tx, zapid, TRUE, buf);

An additional thing you could try is increasing the zfs_vdev_sync_read_max_active module option from 10 to say 100. We might be able to get more sync read IOPs out of the storage on Hyperion by increasing how many can be in flight for a vdev.

Finally, I noticed in that crash dump that the threads blocked in dmu_tx_wait were waiting on the dn_notxholds condition variable. This will happen when a tx contains an tx_hold for a dnode which is currently part of the syncing txg. It will need to block until the txg completes and it's signaled it may continue. This seems to be happen a lot for Lustre compared to the ZPL and I suspect it's due to updates to the Lustre quota objects. We're going to look in to it since it would be a drag on performance.

jxiong commented 8 years ago

@behlendorf thanks a lot for looking into this issue.

Since sync reads are treated in a strictly FIFO fashion that implies the disk was running at about ~500 read IOPs. Not too shabby.

Yes, I noticed this too. The IOPS was pretty good for a hard drive. This is why I suspected that this problem may boil down to the fact that ZFS/Lustre needs too many metadata reads before reaching the actual data.

I will apply your patch and see how it goes.

Finally, I noticed in that crash dump that the threads blocked in dmu_tx_wait were waiting on the dn_notxholds condition variable...

Actually this is because txg_sync() was being blocked by a slow dmu_read(), and I guess this will be okay after the read problem is addressed.

This problem could be mitigated significantly after @don-brady 's isolated md allocation class is in place, because we can store metadata on low latency drives.

thanks again.

jxiong commented 8 years ago

@behlendorf

An additional thing you could try is increasing the zfs_vdev_sync_read_max_active module option from 10 to say 100.

Yes, I tried this but it didn't help(I didn't remember the detail though). Also I tried to reduce the number of OSS threads but it didn't help either.

prakashsurya commented 8 years ago

The ARC change might also have caused your cache hit rate to drop, maybe you're going to disk more now.

This wouldn't be a bad thing to try and rule out. I did make a decent amount of changes to the eviction code, and while I tried to keep it roughly the same as before, there's always a chance that I botched it for a specific workload (Lustre does appear to drive ZFS very differently than any ZPL workload I'm aware of). If there's some evidence that I broke the eviction algorithm, I'm more than happy to revisit that code with this specific workload in mind.

jxiong commented 8 years ago

@prakashsurya @behlendorf

Are there any stats you want to collect to identify the ARC behavior in the next run?

jxiong commented 8 years ago

@behlendorf We tried the proposed patch today and it didn't help. @prakashsurya We collected and compared the ARC stats between 'good' and 'bad' revisions, and it turned out that the 'Cache Hit Ratio' are of great difference.

ARC stats without multilisted ARC:

ARC Total accesses:                                     863.95k
        Cache Hit Ratio:                99.68%  861.22k
        Cache Miss Ratio:               0.32%   2.73k
        Actual Hit Ratio:               99.68%  861.22k

        Data Demand Efficiency:         100.00% 369.92k

        CACHE HITS BY CACHE LIST:
          Most Recently Used:           30.96%  266.66k
          Most Frequently Used:         69.04%  594.56k
          Most Recently Used Ghost:     0.00%   0
          Most Frequently Used Ghost:   0.00%   0

        CACHE HITS BY DATA TYPE:
          Demand Data:                  42.95%  369.92k
          Prefetch Data:                0.00%   0
          Demand Metadata:              57.05%  491.30k
          Prefetch Metadata:            0.00%   0

        CACHE MISSES BY DATA TYPE:
          Demand Data:                  0.00%   0
          Prefetch Data:                0.00%   0
          Demand Metadata:              100.00% 2.73k
          Prefetch Metadata:            0.00%   0

With multilisted ARC:

ARC Total accesses:                                     967.83k
        Cache Hit Ratio:                18.54%  179.46k
        Cache Miss Ratio:               81.46%  788.37k
        Actual Hit Ratio:               18.54%  179.46k

        Data Demand Efficiency:         10.47%  868.25k

        CACHE HITS BY CACHE LIST:
          Most Recently Used:           35.70%  64.07k
          Most Frequently Used:         64.30%  115.39k
          Most Recently Used Ghost:     3.77%   6.77k
          Most Frequently Used Ghost:   6.30%   11.30k

        CACHE HITS BY DATA TYPE:
          Demand Data:                  50.64%  90.89k
          Prefetch Data:                0.00%   0
          Demand Metadata:              49.36%  88.58k
          Prefetch Metadata:            0.00%   0

        CACHE MISSES BY DATA TYPE:
          Demand Data:                  98.60%  777.36k
          Prefetch Data:                0.00%   0
          Demand Metadata:              1.40%   11.01k
          Prefetch Metadata:            0.00%   0

Only 20% cache hit ratio in the latter case, while it was ~100% before the patch is introduced.

Is there any further information we need to collect to identify the issue?

prakashsurya commented 8 years ago

That's concerning. I need to think about this some more. Is the workload described in the comments above in any detail? If not, can you explain the workload to help me try to understand what might be going wrong? Is this a Lustre OSS or MDS? Do you think this behavior could be exhibited outside of Lustre; is there a ZPL workload that we could apply to reproduce this?

At a glance, the demand data misses went from 0 to over 700K?! yikes!

I think it'd be great to get historical log of the arcstats information from "before" and "after"; e.g.

while true;
    cat /proc/sys/spl/kstat/zfs/arcstats > arcstats-`date +%s`.log
    sleep 60
done

It should be possible to take those dumps and write a script to generate graphs for the various fields to see how the ARC is behaving over time. I've used that method to track down a mis-behaving ARC in the past, although it can be time consuming and I don't have those graph generation scripts anymore (sigh, I knew I should have pushed to get those merged into the contrib directory).

It might also be interesting to see which objects/blocks we're missing on, and which blocks are being evicted from the caches; although that might be more difficult to gather. I vaguely remember there being a ZOL specific kstat that was introduced when I was at LLNL to gather info about the blocks that were being evicted, but I can't recall what that was called off the top of my head (and there might be a better way to get that information now, using eBPF or tracepoints).

There's also the DTRACE_PROBE1(arc__delete, arc_buf_hdr_t *, hdr); and DTRACE_PROBE1(arc__evict, arc_buf_hdr_t *, hdr); probes that were intended to be used to get information about the headers being evicted, but I'm not sure how to leverage those on Linux (i.e. without dtrace).

dweeezil commented 8 years ago

You can only use the dtrace probes if you've enabled debugfs and tracing in the kernel and have set license to GPL in META. If you have then... [EDIT: oops, should be ">>" in the second one]

# echo zfs_arc__evict > /sys/kernel/debug/tracing/set_event
# echo zfs_arc__delete >> /sys/kernel/debug/tracing/set_event
# cat /sys/kernel/debug/tracing/trace
     arc_reclaim-225   [001] .... 88477.973235: zfs_arc__evict: hdr { dva 0x10000003a:0x534b3375 birth 20040026 flags 0x80160 datacnt 0 type 4294967295 size 29696 spa 8942987173442330154 state_type 4 access 4297671799 mru_hits 1 mru_ghost_hits 0 mfu_hits 5 mfu_ghost_hits 0 l2_hits 0 refcount 0 }
     arc_reclaim-225   [001] .... 88477.973237: zfs_arc__evict: hdr { dva 0x10000003f:0x5348dfd2 birth 20039918 flags 0x80160 datacnt 0 type 2165127836 size 32256 spa 8942987173442330154 state_type 4 access 4297677674 mru_hits 1 mru_ghost_hits 0 mfu_hits 3 mfu_ghost_hits 0 l2_hits 0 refcount 0 }
     arc_reclaim-225   [001] .... 88477.973239: zfs_arc__evict: hdr { dva 0x10000004a:0x5348e107 birth 20039918 flags 0x80160 datacnt 0 type 4294967295 size 37888 spa 8942987173442330154 state_type 4 access 4297677872 mru_hits 1 mru_ghost_hits 0 mfu_hits 9 mfu_ghost_hits 0 l2_hits 0 refcount 0 }
     arc_reclaim-225   [001] .... 88477.973241: zfs_arc__evict: hdr { dva 0x100000100:0x534527a6 birth 20039833 flags 0x80120 datacnt 0 type 8358 size 131072 spa 8942987173442330154 state_type 4 access 4297819640 mru_hits 1 mru_ghost_hits 0 mfu_hits 0 mfu_ghost_hits 0 l2_hits 0 refcount 0 }
...

As to the actual problem being reported in this issue, I've got no idea at the moment and will have to read this entire thread in its entirety.

@jxiong It might be worth running perf top -ag while the problem is happening. You mention "server load" in the original posting. Is this load as reported by "top" or "w"? If so, under Linux that is often simply the count of the dormant processes ("D" state in ps). It would be interesting to see whether there's lots of time being spent in spinlocks and/or mutexes.

I've posted this without having read the entire thread so there's a chance I've missed something. I mainly wanted to point out how to use the tracepoints.

prakashsurya commented 8 years ago

Ah, right! I almost forgot that we're converting the static dtraces probes to linux tracepoints. I also need to read through all the comments in this thread in detail, too; I'll try and do that tomorrow (my brain needs some rest before I can think clear enough to tackle this one).

jxiong commented 8 years ago

@prakashsurya This is SWL workload on Hyperion. The nodes in question were OSS nodes. We've never tried to reproduce it through ZPL. I will try to collect the historic information as you mentioned and draw a graph, but it would take time. We collected arcstats for good and bad run, please take a look:

iws20.arcstats.bad.txt iws20.arcstats.good.txt

For the other linux specific on ZoL, let's wait for @behlendorf 's comment.

@dweeezil The server load was measured by uptime. In this context, the server load high would mean lots of threads are waiting for disk I/O to finish. I had run `perf top' but it seemed normal. There is a crashdump that was collected at the time when this problem was happening at: https://www.dropbox.com/s/23my4xnpahiibgk/zfs-reproduced.tar?dl=0, please take a look if you're interested.

jxiong commented 8 years ago

BTW, does 'CACHE MISSES BY Demand Data' mean the data was not in ARC when it was requested? and the corresponding 'By Prefetch Data' means the data was not in cache when it was prefetched?

dswartz commented 8 years ago

I believe so.

jxiong notifications@github.com wrote:

BTW, does 'CACHE MISSES BY Demand Data' mean the data was not in ARC when it was requested? and the corresponding 'By Prefetch Data' means the data was not in cache when it was prefetched?

— Reply to this email directly or view it on GitHub.

prakashsurya commented 8 years ago

BTW, does 'CACHE MISSES BY Demand Data' mean the data was not in ARC when it was requested? and the corresponding 'By Prefetch Data' means the data was not in cache when it was prefetched?

Those values correspond to the counters that are incremented in arc_read using the following snippet:

3960                 /*                                                              
3961                  * At this point, we have a level 1 cache miss.  Try again in   
3962                  * L2ARC if possible.                                           
3963                  */                                                             
3964                 ASSERT3U(hdr->b_size, ==, size);                                
3965                 DTRACE_PROBE4(arc__miss, arc_buf_hdr_t *, hdr, blkptr_t *, bp,  
3966                     uint64_t, size, zbookmark_phys_t *, zb);                    
3967                 ARCSTAT_BUMP(arcstat_misses);                                   
3968                 ARCSTAT_CONDSTAT(!HDR_PREFETCH(hdr),                            
3969                     demand, prefetch, !HDR_ISTYPE_METADATA(hdr),                
3970                     data, metadata, misses);

As you can see, we have a DTRACE_PROBE here if we really want to dig into exactly which blocks we're missing on.

I'm looking over the arcstat files that you posted, and I'm still a little uncertain what to make out of them, but I think there's a lot of good information here (just need to think a little bit more). So far, here's what strikes me as interesting.....

Let's look at a few particular fields that (I think) are helping me build a picture about what might be happening:

good: c_max                           4    33524654080                          
bad:  c_max                           4    33524654080                          

good: c                               4    33524654080                          
bad:  c                               4    33524654080                          

good: size                            4    114578192                            
bad:  size                            4    599604416                            

good: p                               4    16762327040                          
bad:  p                               4    22557525504                          

good: mru_size                        4    49120256                             
good: mru_evict_data                  4    23855104                             
good: mru_evict_metadata              4    15823872                             
good: mru_ghost_size                  4    0                                    
good: mru_ghost_evict_data            4    0                                    
good: mru_ghost_evict_metadata        4    0                                    
good: mfu_size                        4    167936                               
good: mfu_evict_data                  4    0                                    
good: mfu_evict_metadata              4    167936                               
good: mfu_ghost_size                  4    0                                    
good: mfu_ghost_evict_data            4    0                                    
good: mfu_ghost_evict_metadata        4    0                                    

bad:  mru_size                        4    582610944                            
bad:  mru_evictable_data              4    571080704                            
bad:  mru_evictable_metadata          4    6627328                              
bad:  mru_ghost_size                  4    34395136                             
bad:  mru_ghost_evictable_data        4    0                                    
bad:  mru_ghost_evictable_metadata    4    34395136                             
bad:  mfu_size                        4    32768                                
bad:  mfu_evictable_data              4    0                                    
bad:  mfu_evictable_metadata          4    32768                                
bad:  mfu_ghost_size                  4    3902410752                           
bad:  mfu_ghost_evictable_data        4    3902275584                           
bad:  mfu_ghost_evictable_metadata    4    135168

OK, so if we look at this, c in both cases appears to be unmodified from the original value that gets set at boot (set to c_max). That tells me there was no memory pressure on the system that would have cause that to drop (in either runs); which is good.

Additionally, size is considerably less than c, so there should be no evictions from the ARC in either runs. We can cross check this with the sizes reported for the mru_ghost and mfu_ghost lists (i.e. mru_ghost_size and mfu_ghost_size should be 0).

If we look at the "good" case, mru_ghost_size and mfu_ghost_size are both 0; that's good and consistent with my expectations. Unfortunately, in the "bad" case, the values are non-zero. There's roughly 34MB on the mru_ghost list, and 3902MB on the mfu_ghost list. These lists are only populated when we perform eviction from the ARC, and we should not be evicting buffers when size is lower than the target ARC size of c. So, my question now is, why are we evicting headers in the "bad" case?

If we also look at p, it appears it hasn't been modified in the "good" run which is expected since that value will only be updated when we have "ghost hits" (i.e. hits on either of the mru_ghost or mfu_ghost lists), and since those lists are empty in the "good" run, we can't get any hits on those. In the "bad" case, though, the ghost lists are non-empty, so we must be getting hits on them. We can double check this with a few more stats from those logs:

good: mru_hits                        4    263330                               
good: mru_ghost_hits                  4    0                                    
good: mfu_hits                        4    582023                               
good: mfu_ghost_hits                  4    0                                    

bad:  mru_hits                        4    672114                               
bad:  mru_ghost_hits                  4    98101                                
bad:  mfu_hits                        4    547677                               
bad:  mfu_ghost_hits                  4    134042

Just as expected, zero ghost list hits in the "good" case, and non-zero ghost list hits in the "bad" case.

Additionally, we can see from the deleted field:

good: deleted                         4    0                                    
bad:  deleted                         4    1331794

that headers are being evicted from the ghost lists in the "bad" case (that field is incremented as headers roll off of the ghost lists).

I think the question we need to answer, is, with the size nowhere near the value of c, why are we evicting headers from the ARC in the "bad" case? I have a feeling that is what is leading to all of the disk activity; with the timeouts merely being a symptom of the root cause, the terrible cache hit ratio.

What do folks think? Does that sounds reasonable? Am I overlooking anything?

behlendorf commented 8 years ago

@prakashsurya I think you've definitely put you're finger on the problem. Why are we evicting at all from the ARC in the "bad" case?

We need to identify the call path where this is happening. I spent some time carefully reviewing the patch and the ARC code and I wasn't able to identify a probable cause. To understand why I think we may need a clue from either an ftrace call graph or perhaps some additional debug kstats.

The good news is that it looks like we can initially rule out memory pressure as the cause. Not only for the reasons @prakashsurya mentioned but because all of the other memory pressure related kstats don't indicate any issue. That helps exclude some ZoL specific changes in places like arc_available_memory().

There are two other notable related differences between illumos and Linux which we should rule out. Neither one looks obviously like the cause to me but they're in the roughly the right area and it may help us narrow down the issue or result in some new insight. Plus they're easy to test.

arc_adjust_meta() - An alternate version of this function was implemented for Linux to guarantee forward progress when evicting meta data. See the comment above arc_adjust_meta_balanced for all the details. However, the original version was kept and can be used by setting the zfs_arc_meta_strategy=0 module option. It would be helpful to see if changing this modified the behavior.
arc_c_min - Is set to a much lower value (32M). This uncovered at least one corner case which was fixed, 1b8951b319b7bcddfe0cc2408b7717f9ce3d758b, and we should really upstream. Again I don't see any real problems here for Lustre but it's in the right general area and would be easy to increase and verify this isn't related.

@jxiong I'm still hopeful that little Lustre prefetch patch might help after we resolve this much bigger problem!

jxiong commented 8 years ago

@prakashsurya we collected the trace log as you mentioned, please check it here: https://www.dropbox.com/sh/nrdhrwoylk2c9c2/AACid-KDadAdi4gRf2R8Bfjwa?dl=0 Thanks.

@behlendorf We tried to set the module option zfs_arc_meta_strategy=0 but it didn't help. Yes, I will land your patch later. Thanks.

thegreatgazoo commented 8 years ago

Haven't read all the comments, but could it be the case that "size" got high enough at one point to trigger eviction and then it fell down afterwords?

ahrens commented 8 years ago

iws20.arcstats.bad.txg shows:

memory_direct_count             4    0
memory_indirect_count           4    0
arc_no_grow                     4    0

This should mean that we never experienced memory pressure and therefore never evicted anything. However, we also see:

deleted                         4    1331794
mru_ghost_evictable_metadata    4    34395136
mfu_ghost_evictable_data        4    3902275584
mfu_ghost_evictable_metadata    4    135168

Which should mean that we did evict some buffers due to memory pressure.

So something here is not as it appears to be.

behlendorf commented 8 years ago

@thegreatgazoo after getting on Hyperion and running the SWL I wasn't able to reproduce the poor ARC hit rate problem described above. With zfs-0.6.5.5 I was observing a hit rate of ~99% which is more consistent with what we've seen historically. However, I did observe some other issues which could help explain what's going on.

iws29@behlendo:arc_summary.py -p2

------------------------------------------------------------------------
ZFS Subsystem Report                Mon Apr 11 15:51:00 2016
ARC Total accesses:                 488.19m
    Cache Hit Ratio:        99.63%  486.39m
    Cache Miss Ratio:       0.37%   1.80m
    Actual Hit Ratio:       99.63%  486.39m

    Data Demand Efficiency:     99.74%  370.42m

The first thing I looked at was /proc/spl/kstat/zfs/dmu_tx on the OSTs, this kstat was added to give us visibility in to how TXs were being assigned. According to the counters we're hitting the dmu_tx_dirty_delay and dmu_tx_dirty_over_max limits frequently causing us to throttle TX assignments (correctly) due to dirty data in the ARC.

iws3@behlendo:cat /proc/spl/kstat/zfs/dmu_tx 
5 1 0x01 11 528 79180825711 635150490201115
name                            type data
dmu_tx_assigned                 4    36091633
dmu_tx_delay                    4    0
dmu_tx_error                    4    0
dmu_tx_suspended                4    0
dmu_tx_group                    4    1998
dmu_tx_memory_reserve           4    0
dmu_tx_memory_reclaim           4    0
dmu_tx_dirty_throttle           4    0
dmu_tx_dirty_delay              4    20897360
dmu_tx_dirty_over_max           4    7035939
dmu_tx_quota                    4    0

Digging a little deeper we can look at the /proc/spl/kstat/zfs/unstab-ost2/dmu_tx_assign histogram from the pool and see that we're delaying lots of TXs by over 2 seconds. That suggests we've got a ton of dirty data in the ARC and it isn't getting flushed quickly to disk.

iws3@behlendo:cat /proc/spl/kstat/zfs/unstab-ost2/dmu_tx_assign 
11 1 0x01 32 1536 474228994344 635617604649871
name                            type data
1 ns                            4    0
2 ns                            4    0
4 ns                            4    0
8 ns                            4    0
16 ns                           4    0
32 ns                           4    0
64 ns                           4    0
128 ns                          4    86230
256 ns                          4    19421
512 ns                          4    6838
1024 ns                         4    192
2048 ns                         4    38
4096 ns                         4    14
8192 ns                         4    8
16384 ns                        4    20
32768 ns                        4    5
65536 ns                        4    11
131072 ns                       4    546
262144 ns                       4    1516
524288 ns                       4    1993
1048576 ns                      4    2057
2097152 ns                      4    2173
4194304 ns                      4    2205
8388608 ns                      4    3048
16777216 ns                     4    4638
33554432 ns                     4    4229
67108864 ns                     4    7208
134217728 ns                    4    12391
268435456 ns                    4    19573
536870912 ns                    4    26709
1073741824 ns                   4    19798
2147483648 ns                   4    48327

The TXG history in /proc/spl/kstat/zfs/unstab-ost2/txgs confirms this, there are TXG sync times of over 120 seconds while the SWL was running. The TXGs had roughly 7GB of dirty data in them and due to the relatively slow disk it takes a long time to flush. Under Linux the maximum amount of dirty data in the ARC is limited to 25% (not strictly 4G).

I manually decreased zfs_dirty_data_max to 1G which cut the TXG sync times down to roughly 30s. That's still slow but seems to be fast enough that we no longer appear to be hitting the Lustre timeout. We could definitely go lower and I don't think it'll hurt performance since the disk was saturated. I'd also suggest setting the block size to 1M, particularly for this configuration where it's running on top of an existing RAID setup. That's really limiting the total IOPs.

My impression was that the ARC dirty size was pretty badly mismatched with the actual disk performance. It seems plausible that we were getting away with this for the most part up until the ARC multilist improvements were made. The increased concurrency and large number of Lustre threads may have pushed us over a tipping point with respect to the Lustre timeouts.

I've left the SWL running with a 1G dirty size to see if we're able to reproduce the issue. If possible it would be great to let this run for a while.

iws3@behlendo:cat /sys/module/zfs/parameters/zfs_dirty_data_max 
1073741824

We most likely haven't seen this at Livermore for two reasons. 1) our storage hardware can deliver many more IOPs so even large TXGs are written quickly, and 2) we've updated Lustre to not block on a txg sync in osd_trans_stop(), see the LU-4009 patch.

jxiong commented 8 years ago

@behlendorf just to make it clear, you couldn't see this problem after you set zfs_dirty_data_max to 1GB, correct?

Probably the I/O was throttled by the ARC lock contention in previous versions. I would suggest to investigate the performance difference between 0.6.4 and 0.6.5.5 with the new zfs_dirty_data_max setting.

This is a piece of wonderful work - I tried to reduce the OSS workload by reducing the number of OSS threads but it didn't help.

behlendorf commented 8 years ago

@jxiong correct, after I reduced the zfs_dirty_data_max to 1GB I wasn't able to reproduce the evictions. Some of the service times we still a longer than I'd like, so you might consider going even lower to ensure the TXGs keep rolling along. Any sync callers of osd_trans_stop() will end up blocking on the TXG until we have ZIL support so we need to keep the TGX time small.

It would be great if you could independently verify this fixes the issue.

jxiong commented 8 years ago

We verified that the problem can't be reproduced after setting zfs_dirty_data_max to 1GB. Thanks Brian.

behlendorf commented 8 years ago

@jxiong awesome. I think we still want to completely explain why the delays injected by dmu_tx_assign() were not enough to throttle back Lustre but we can do that in a different issue. If you're happy we've gotten to the root cause here please go ahead and close out the issue.

jxiong commented 8 years ago

@behlendorf Probably the delay inject worked and it caused Lustre I/O took really long time to complete, and at last it caused I/O timeout.

jxiong commented 8 years ago

Just a second thought, let's keep this issue open for at least 24 hours in case people want to comment from a different time zone.

prakashsurya commented 8 years ago

My impression was that the ARC dirty size was pretty badly mismatched with the actual disk performance.

If we had a mechanism in the code to dynamically tune the maximum amount of dirty data, would that be the ideal way to solve this problem? And if so, how would it be tuned?

e.g. Maybe if the TXG takes longer than X seconds to sync, the max amount allowed is decreased. If it takes less than X seconds to sync, the max amount is increased. At some point, the max amount should reach some equilibrium point and maintain roughly the same value, right? (this is a drastic oversimplification, but I think it gets the idea across)

If I'm understanding correctly, the "problem" is TXG sync times sometimes takes "too long"? and by reducing the amount of dirty data that can accumulate, that ensures the TXG sync times are "sufficiently small enough" to prevent these timeouts?

This sort of also feels like Lustre is to blame, in that it doesn't apply sufficient back-pressure to the clients. i.e. Instead of telling the clients to slow down by some percentage based on how fast the server can write out the data (e.g. perhaps injecting small artificial delays into the clients' requests), the server allows clients to stream as much data as they can (as fast as they can) up until a time out occurs. Simply telling ZFS to write less data every TXG feels like an ugly workaround to a problem that is more fundamental to the client/server architecture of Lustre.

Obviously I don't have a stake in this, so feel free to ignore me as it pertains to fixing the timeouts; I'm just genuinely curious to learn more and/or more clearly understand the root cause, as this feels more like a systems architecture issue vs. the more mundane defects that usually crop up.

behlendorf commented 8 years ago

If we had a mechanism in the code to dynamically tune the maximum amount of dirty data, would that be the ideal way to solve this problem? And if so, how would it be tuned?

I was thinking along the same lines. This is something ZFS should be able to handle more gracefully. It's not at all clear to me that Lustre is behaving particularly badly. Having a large number of threads all writing may simply be able to overwhelm the default throttle settings. It would be interesting to see the behavior when 1000 threads are all writing through the ZPL.

Just to throw an idea out there as possible solution. We could start dynamically ratcheting up the zfs_delay_scale when txg_sync times start exceeding the zfs_txg_timeout. This is an indicator that dmu_tx_assign() isn't being aggressive enough to keep the system in equilibrium. In an ideal world we should be able to keep the TXG sync times close to zfs_txg_timeout.

Other ideas welcome, this was just the first thing which occurred to me and I haven't done a careful analysis. This feels to me like a systems architecture issue as well. The throttle shouldn't allow this to happen.

ahrens commented 8 years ago

@behlendorf I agree that zfs_delay_scale is the right place to start. I'm pretty sure the default is not great for systems with high performance and high number of threads calling zfs_write (or equivalent funcs that create dmu transactions). You want to avoid the dirty data ever getting close to 100%.

Of course, I have a bunch of dtrace scripts for monitoring this, which doesn't help you on Linux :-(

behlendorf commented 8 years ago

Of course, I have a bunch of dtrace scripts for monitoring this, which doesn't help you on Linux :-(

We might be able to adapt them to systemtap or add something as a kstat. Specifically what are you monitoring with the dtrace scripts?

ahrens commented 8 years ago

@behlendorf Easiest to explain by showing. See https://gist.github.com/ahrens/ca0b0c1009e873c80d53467749085ec2 and https://gist.github.com/ahrens/7e110344c9732d22e143b48e8050923a