openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.65k stars 1.75k forks source link

msync occasionally takes several seconds (up to ~15 seconds) #12662

Closed shaan1337 closed 2 years ago

shaan1337 commented 3 years ago

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 20.04
Kernel Version 5.11.0-1018-gcp (on a Google Cloud VM)
Architecture x86_64
OpenZFS Version zfs-0.8.3-1ubuntu12.12, zfs-kmod-2.0.2-1ubuntu5.1

Describe the problem you're observing

When running EventStoreDB (https://github.com/EventStore/EventStore) on a Google Cloud VM, we occasionally see writes taking several seconds in the logs (it happens 1-2 times per day with a constant write load):

{"@t":"2021-10-19T10:51:11.5248104Z","@mt":"SLOW QUEUE MSG [{queue}]: {message} - {elapsed}ms. Q: {prevQueueCount}/{curQueueCount}.","@l":"Debug","queue":"StorageWriterQueue","message":"WritePrepares","elapsed":1574,"prevQueueCount":0,"curQueueCount":0,"SourceContext":"EventStore.Core.Bus.QueuedHandlerMRES","ProcessId":1100082,"ThreadId":10}

In the above case, it took ~1.5 seconds to complete a write but sometimes it can be up to ~15 seconds.

After doing an strace, I've come to the conclusion that msync is the culprit. We have a small memory-mapped file (8 bytes large) which is updated and flushed very regularly (every ~2 milliseconds).

After doing an ftrace on msync (__x64_sys_msync) I've got the following output which shows that the delay is incurred in wait_on_page_writeback:

 StorageWriterQu-1100154 [000] 1634640669.215961: funcgraph_entry:                   |  __x64_sys_msync() {
 StorageWriterQu-1100154 [000] 1634640669.215962: funcgraph_entry:                   |    down_read() {
 StorageWriterQu-1100154 [000] 1634640669.215963: funcgraph_entry:                   |      _cond_resched() {
 StorageWriterQu-1100154 [000] 1634640669.215963: funcgraph_entry:        0.355 us   |        rcu_all_qs();
 StorageWriterQu-1100154 [000] 1634640669.215963: funcgraph_exit:         0.775 us   |      }
 StorageWriterQu-1100154 [000] 1634640669.215964: funcgraph_exit:         1.409 us   |    }
 StorageWriterQu-1100154 [000] 1634640669.215964: funcgraph_entry:                   |    find_vma() {
 StorageWriterQu-1100154 [000] 1634640669.215964: funcgraph_entry:        0.186 us   |      vmacache_find();
 StorageWriterQu-1100154 [000] 1634640669.215965: funcgraph_exit:         0.636 us   |    }
 StorageWriterQu-1100154 [000] 1634640669.215965: funcgraph_entry:        0.173 us   |    up_read();
 StorageWriterQu-1100154 [000] 1634640669.215965: funcgraph_entry:                   |    vfs_fsync_range() {
 StorageWriterQu-1100154 [000] 1634640669.215966: funcgraph_entry:                   |      zpl_fsync() {
 StorageWriterQu-1100154 [000] 1634640669.215966: funcgraph_entry:                   |        filemap_write_and_wait_range() {
 StorageWriterQu-1100154 [000] 1634640669.215967: funcgraph_entry:                   |          __filemap_fdatawrite_range() {
 StorageWriterQu-1100154 [000] 1634640669.215967: funcgraph_entry:        0.306 us   |            _raw_spin_lock();
 StorageWriterQu-1100154 [000] 1634640669.215967: funcgraph_entry:        0.199 us   |            wbc_attach_and_unlock_inode();
 StorageWriterQu-1100154 [000] 1634640669.215968: funcgraph_entry:                   |            do_writepages() {
 StorageWriterQu-1100154 [000] 1634640669.215968: funcgraph_entry:      + 78.604 us  |              zpl_writepages();
 StorageWriterQu-1100154 [000] 1634640669.216047: funcgraph_exit:       + 79.316 us  |            }
 StorageWriterQu-1100154 [000] 1634640669.216047: funcgraph_entry:        0.235 us   |            wbc_detach_inode();
 StorageWriterQu-1100154 [000] 1634640669.216048: funcgraph_exit:       + 81.505 us  |          }
 StorageWriterQu-1100154 [000] 1634640669.216048: funcgraph_entry:                   |          __filemap_fdatawait_range() {
 StorageWriterQu-1100154 [000] 1634640669.216048: funcgraph_entry:                   |            pagevec_lookup_range_tag() {
 StorageWriterQu-1100154 [000] 1634640669.216048: funcgraph_entry:        0.430 us   |              find_get_pages_range_tag();
 StorageWriterQu-1100154 [000] 1634640669.216049: funcgraph_exit:         0.783 us   |            }
 StorageWriterQu-1100154 [000] 1634640669.216049: funcgraph_entry:                   |            wait_on_page_writeback() {
 StorageWriterQu-1100154 [000] 1634640669.216049: funcgraph_entry:        0.265 us   |              page_mapping();
 StorageWriterQu-1100154 [000] 1634640669.216050: funcgraph_entry:      $ 1571904 us |              wait_on_page_bit();
 StorageWriterQu-1100154 [001] 1634640670.787956: funcgraph_exit:       $ 1571907 us |            }
 StorageWriterQu-1100154 [001] 1634640670.787957: funcgraph_entry:                   |            __pagevec_release() {
 StorageWriterQu-1100154 [001] 1634640670.787957: funcgraph_entry:      + 11.553 us  |              lru_add_drain_cpu();
 StorageWriterQu-1100154 [001] 1634640670.787969: funcgraph_entry:        0.405 us   |              release_pages();
 StorageWriterQu-1100154 [001] 1634640670.787970: funcgraph_exit:       + 13.001 us  |            }
 StorageWriterQu-1100154 [001] 1634640670.787970: funcgraph_entry:                   |            _cond_resched() {
 StorageWriterQu-1100154 [001] 1634640670.787970: funcgraph_entry:        0.178 us   |              rcu_all_qs();
 StorageWriterQu-1100154 [001] 1634640670.787971: funcgraph_exit:         0.652 us   |            }
 StorageWriterQu-1100154 [001] 1634640670.787971: funcgraph_exit:       $ 1571923 us |          }
 StorageWriterQu-1100154 [001] 1634640670.787971: funcgraph_entry:        0.350 us   |          filemap_check_errors();
 StorageWriterQu-1100154 [001] 1634640670.787972: funcgraph_exit:       $ 1572005 us |        }
 StorageWriterQu-1100154 [001] 1634640670.787972: funcgraph_entry:        0.399 us   |        crhold();
 StorageWriterQu-1100154 [001] 1634640670.787973: funcgraph_entry:                   |        zfs_fsync() {
 StorageWriterQu-1100154 [001] 1634640670.787974: funcgraph_entry:                   |          tsd_set() {
 StorageWriterQu-1100154 [001] 1634640670.787974: funcgraph_entry:                   |            tsd_hash_search() {
 StorageWriterQu-1100154 [001] 1634640670.787974: funcgraph_entry:        0.279 us   |              _raw_spin_lock();
 StorageWriterQu-1100154 [001] 1634640670.787975: funcgraph_exit:         0.749 us   |            }
 StorageWriterQu-1100154 [001] 1634640670.787975: funcgraph_entry:                   |            tsd_hash_search() {
 StorageWriterQu-1100154 [001] 1634640670.787975: funcgraph_entry:        0.255 us   |              _raw_spin_lock();
 StorageWriterQu-1100154 [001] 1634640670.787976: funcgraph_exit:         0.622 us   |            }
 StorageWriterQu-1100154 [001] 1634640670.787976: funcgraph_entry:                   |            tsd_hash_add_pid() {
 StorageWriterQu-1100154 [001] 1634640670.787976: funcgraph_entry:        1.214 us   |              spl_kmem_alloc();
 StorageWriterQu-1100154 [001] 1634640670.787977: funcgraph_entry:        0.186 us   |              _raw_spin_lock();
 StorageWriterQu-1100154 [001] 1634640670.787978: funcgraph_entry:        0.238 us   |              _raw_spin_lock();
 StorageWriterQu-1100154 [001] 1634640670.787978: funcgraph_exit:         2.413 us   |            }
 StorageWriterQu-1100154 [001] 1634640670.787979: funcgraph_entry:                   |            tsd_hash_add() {
 StorageWriterQu-1100154 [001] 1634640670.787979: funcgraph_entry:        0.578 us   |              spl_kmem_alloc();
 StorageWriterQu-1100154 [001] 1634640670.787980: funcgraph_entry:        0.253 us   |              _raw_spin_lock();
 StorageWriterQu-1100154 [001] 1634640670.787980: funcgraph_entry:        1.398 us   |              tsd_hash_search();
 StorageWriterQu-1100154 [001] 1634640670.787982: funcgraph_entry:        0.350 us   |              tsd_hash_search();
 StorageWriterQu-1100154 [001] 1634640670.787982: funcgraph_entry:        0.250 us   |              _raw_spin_lock();
 StorageWriterQu-1100154 [001] 1634640670.787983: funcgraph_exit:         4.591 us   |            }
 StorageWriterQu-1100154 [001] 1634640670.787983: funcgraph_exit:         9.417 us   |          }
 StorageWriterQu-1100154 [001] 1634640670.787984: funcgraph_entry:                   |          rrm_enter_read() {
 StorageWriterQu-1100154 [001] 1634640670.787984: funcgraph_entry:                   |            rrw_enter_read() {
 StorageWriterQu-1100154 [001] 1634640670.787984: funcgraph_entry:        0.850 us   |              rrw_enter_read_impl();
 StorageWriterQu-1100154 [001] 1634640670.787985: funcgraph_exit:         1.398 us   |            }
 StorageWriterQu-1100154 [001] 1634640670.787985: funcgraph_exit:         1.791 us   |          }
 StorageWriterQu-1100154 [001] 1634640670.787986: funcgraph_entry:                   |          zil_commit() {
 StorageWriterQu-1100154 [001] 1634640670.787986: funcgraph_entry:        0.171 us   |            spa_writeable();
 StorageWriterQu-1100154 [001] 1634640670.787987: funcgraph_entry:                   |            zil_commit_impl() {
 StorageWriterQu-1100154 [001] 1634640670.787987: funcgraph_entry:        1.771 us   |              zil_async_to_sync();
 StorageWriterQu-1100154 [001] 1634640670.787989: funcgraph_entry:        1.622 us   |              zil_alloc_commit_waiter();
 StorageWriterQu-1100154 [001] 1634640670.787991: funcgraph_entry:        9.633 us   |              zil_commit_itx_assign();
 StorageWriterQu-1100154 [001] 1634640670.788001: funcgraph_entry:        4.155 us   |              zil_commit_writer();
 StorageWriterQu-1100154 [001] 1634640670.788005: funcgraph_entry:        0.784 us   |              zil_commit_waiter();
 StorageWriterQu-1100154 [001] 1634640670.788006: funcgraph_entry:        0.605 us   |              zil_free_commit_waiter();
 StorageWriterQu-1100154 [001] 1634640670.788007: funcgraph_exit:       + 20.187 us  |            }
 StorageWriterQu-1100154 [001] 1634640670.788007: funcgraph_exit:       + 21.044 us  |          }
 StorageWriterQu-1100154 [001] 1634640670.788007: funcgraph_entry:                   |          rrm_exit() {
 StorageWriterQu-1100154 [001] 1634640670.788008: funcgraph_entry:                   |            rrw_exit() {
 StorageWriterQu-1100154 [001] 1634640670.788008: funcgraph_entry:        0.325 us   |              mutex_lock();
 StorageWriterQu-1100154 [001] 1634640670.788008: funcgraph_entry:        0.263 us   |              __cv_broadcast();
 StorageWriterQu-1100154 [001] 1634640670.788009: funcgraph_entry:        0.279 us   |              _raw_spin_lock();
 StorageWriterQu-1100154 [001] 1634640670.788009: funcgraph_entry:        0.259 us   |              mutex_unlock();
 StorageWriterQu-1100154 [001] 1634640670.788010: funcgraph_exit:         2.051 us   |            }
 StorageWriterQu-1100154 [001] 1634640670.788010: funcgraph_exit:         2.497 us   |          }
 StorageWriterQu-1100154 [001] 1634640670.788010: funcgraph_entry:                   |          tsd_set() {
 StorageWriterQu-1100154 [001] 1634640670.788010: funcgraph_entry:                   |            tsd_hash_search() {
 StorageWriterQu-1100154 [001] 1634640670.788011: funcgraph_entry:        0.184 us   |              _raw_spin_lock();
 StorageWriterQu-1100154 [001] 1634640670.788011: funcgraph_exit:         0.708 us   |            }
 StorageWriterQu-1100154 [001] 1634640670.788011: funcgraph_entry:                   |            tsd_remove_entry() {
 StorageWriterQu-1100154 [001] 1634640670.788012: funcgraph_entry:        0.174 us   |              _raw_spin_lock();
 StorageWriterQu-1100154 [001] 1634640670.788012: funcgraph_entry:        0.175 us   |              _raw_spin_lock();
 StorageWriterQu-1100154 [001] 1634640670.788013: funcgraph_entry:        0.246 us   |              tsd_hash_del.constprop.0();
 StorageWriterQu-1100154 [001] 1634640670.788013: funcgraph_entry:        0.175 us   |              _raw_spin_lock();
 StorageWriterQu-1100154 [001] 1634640670.788013: funcgraph_entry:        0.191 us   |              tsd_hash_del.constprop.0();
 StorageWriterQu-1100154 [001] 1634640670.788014: funcgraph_entry:        1.373 us   |              tsd_hash_dtor();
 StorageWriterQu-1100154 [001] 1634640670.788015: funcgraph_exit:         4.004 us   |            }
 StorageWriterQu-1100154 [001] 1634640670.788016: funcgraph_exit:         5.289 us   |          }
 StorageWriterQu-1100154 [001] 1634640670.788016: funcgraph_exit:       + 42.299 us  |        }
 StorageWriterQu-1100154 [001] 1634640670.788016: funcgraph_entry:        0.204 us   |        crfree();
 StorageWriterQu-1100154 [001] 1634640670.788016: funcgraph_exit:       $ 1572050 us |      }
 StorageWriterQu-1100154 [001] 1634640670.788017: funcgraph_exit:       $ 1572051 us |    }
 StorageWriterQu-1100154 [001] 1634640670.788017: funcgraph_entry:                   |    fput() {
 StorageWriterQu-1100154 [001] 1634640670.788017: funcgraph_entry:        0.437 us   |      fput_many();
 StorageWriterQu-1100154 [001] 1634640670.788018: funcgraph_exit:         0.874 us   |    }
 StorageWriterQu-1100154 [001] 1634640670.788018: funcgraph_exit:       $ 1572058 us |  }

I've done an additional ftrace, but this time only on wait_on_page_bit() and I've got this output:

 StorageWriterQu-1100154 [006] 1634719033.236598: funcgraph_entry:                   |  wait_on_page_bit() {
 StorageWriterQu-1100154 [006] 1634719033.236601: funcgraph_entry:        0.519 us   |    _raw_spin_lock_irq();
 StorageWriterQu-1100154 [006] 1634719033.236602: funcgraph_entry:                   |    io_schedule() {
 StorageWriterQu-1100154 [006] 1634719033.236602: funcgraph_entry:        0.316 us   |      io_schedule_prepare();
 StorageWriterQu-1100154 [006] 1634719033.236603: funcgraph_entry:                   |      schedule() {
 StorageWriterQu-1100154 [006] 1634719033.236603: funcgraph_entry:                   |        rcu_note_context_switch() {
 StorageWriterQu-1100154 [006] 1634719033.236603: funcgraph_entry:        0.379 us   |          rcu_qs();
 StorageWriterQu-1100154 [006] 1634719033.236604: funcgraph_exit:         0.950 us   |        }
 StorageWriterQu-1100154 [006] 1634719033.236604: funcgraph_entry:        0.358 us   |        _raw_spin_lock();
 StorageWriterQu-1100154 [006] 1634719033.236605: funcgraph_entry:        0.419 us   |        update_rq_clock();
 StorageWriterQu-1100154 [006] 1634719033.236606: funcgraph_entry:                   |        psi_task_change() {
 StorageWriterQu-1100154 [006] 1634719033.236606: funcgraph_entry:        0.300 us   |          psi_flags_change();
 StorageWriterQu-1100154 [006] 1634719033.236607: funcgraph_entry:                   |          psi_group_change() {
 StorageWriterQu-1100154 [006] 1634719033.236607: funcgraph_entry:        0.321 us   |            record_times();
 StorageWriterQu-1100154 [006] 1634719033.236608: funcgraph_exit:         1.029 us   |          }
 StorageWriterQu-1100154 [006] 1634719033.236608: funcgraph_entry:                   |          psi_group_change() {
 StorageWriterQu-1100154 [006] 1634719033.236608: funcgraph_entry:        0.288 us   |            record_times();
 StorageWriterQu-1100154 [006] 1634719033.236609: funcgraph_exit:         0.897 us   |          }
 StorageWriterQu-1100154 [006] 1634719033.236609: funcgraph_entry:                   |          psi_group_change() {
 StorageWriterQu-1100154 [006] 1634719033.236610: funcgraph_entry:        0.305 us   |            record_times();
 StorageWriterQu-1100154 [006] 1634719033.236610: funcgraph_exit:         0.883 us   |          }
 StorageWriterQu-1100154 [006] 1634719033.236610: funcgraph_exit:         4.667 us   |        }
 StorageWriterQu-1100154 [006] 1634719033.236611: funcgraph_entry:                   |        dequeue_task_fair() {
 StorageWriterQu-1100154 [006] 1634719033.236611: funcgraph_entry:                   |          dequeue_entity() {
 StorageWriterQu-1100154 [006] 1634719033.236611: funcgraph_entry:                   |            update_curr() {
 StorageWriterQu-1100154 [006] 1634719033.236612: funcgraph_entry:        0.350 us   |              update_min_vruntime();
 StorageWriterQu-1100154 [006] 1634719033.236612: funcgraph_entry:                   |              cpuacct_charge() {
 StorageWriterQu-1100154 [006] 1634719033.236613: funcgraph_entry:        0.299 us   |                rcu_read_unlock_strict();
 StorageWriterQu-1100154 [006] 1634719033.236613: funcgraph_exit:         0.888 us   |              }
 StorageWriterQu-1100154 [006] 1634719033.236614: funcgraph_entry:                   |              __cgroup_account_cputime() {
 StorageWriterQu-1100154 [006] 1634719033.236614: funcgraph_entry:        0.285 us   |                cgroup_rstat_updated();
 StorageWriterQu-1100154 [006] 1634719033.236614: funcgraph_exit:         0.854 us   |              }
 StorageWriterQu-1100154 [006] 1634719033.236615: funcgraph_entry:        0.296 us   |              rcu_read_unlock_strict();
 StorageWriterQu-1100154 [006] 1634719033.236615: funcgraph_exit:         3.909 us   |            }
 StorageWriterQu-1100154 [006] 1634719033.236616: funcgraph_entry:        0.268 us   |            __update_load_avg_se();
 StorageWriterQu-1100154 [006] 1634719033.236616: funcgraph_entry:        0.304 us   |            __update_load_avg_cfs_rq();
 StorageWriterQu-1100154 [006] 1634719033.236617: funcgraph_entry:        0.293 us   |            clear_buddies();
 StorageWriterQu-1100154 [006] 1634719033.236617: funcgraph_entry:        0.284 us   |            update_cfs_group();
 StorageWriterQu-1100154 [006] 1634719033.236618: funcgraph_entry:        0.301 us   |            update_min_vruntime();
 StorageWriterQu-1100154 [006] 1634719033.236618: funcgraph_exit:         7.319 us   |          }
 StorageWriterQu-1100154 [006] 1634719033.236619: funcgraph_entry:                   |          dequeue_entity() {
 StorageWriterQu-1100154 [006] 1634719033.236619: funcgraph_entry:                   |            update_curr() {
 StorageWriterQu-1100154 [006] 1634719033.236619: funcgraph_entry:        0.328 us   |              __calc_delta();
 StorageWriterQu-1100154 [006] 1634719033.236620: funcgraph_entry:        0.300 us   |              update_min_vruntime();
 StorageWriterQu-1100154 [006] 1634719033.236620: funcgraph_exit:         1.487 us   |            }
 StorageWriterQu-1100154 [006] 1634719033.236621: funcgraph_entry:        0.305 us   |            __update_load_avg_se();
 StorageWriterQu-1100154 [006] 1634719033.236621: funcgraph_entry:        0.296 us   |            __update_load_avg_cfs_rq();
 StorageWriterQu-1100154 [006] 1634719033.236622: funcgraph_entry:        0.279 us   |            clear_buddies();
 StorageWriterQu-1100154 [006] 1634719033.236622: funcgraph_entry:                   |            update_cfs_group() {
 StorageWriterQu-1100154 [006] 1634719033.236623: funcgraph_entry:        0.328 us   |              reweight_entity();
 StorageWriterQu-1100154 [006] 1634719033.236623: funcgraph_exit:         0.939 us   |            }
 StorageWriterQu-1100154 [006] 1634719033.236624: funcgraph_entry:        0.287 us   |            update_min_vruntime();
 StorageWriterQu-1100154 [006] 1634719033.236624: funcgraph_exit:         5.516 us   |          }
 StorageWriterQu-1100154 [006] 1634719033.236624: funcgraph_entry:        0.239 us   |          hrtick_update();
 StorageWriterQu-1100154 [006] 1634719033.236625: funcgraph_exit:       + 14.232 us  |        }
 StorageWriterQu-1100154 [006] 1634719033.236625: funcgraph_entry:                   |        __delayacct_blkio_start() {
 StorageWriterQu-1100154 [006] 1634719033.236626: funcgraph_entry:        0.366 us   |          ktime_get();
 StorageWriterQu-1100154 [006] 1634719033.236626: funcgraph_exit:         0.921 us   |        }
 StorageWriterQu-1100154 [006] 1634719033.236627: funcgraph_entry:                   |        pick_next_task_fair() {
 StorageWriterQu-1100154 [006] 1634719033.236627: funcgraph_entry:                   |          newidle_balance() {
 StorageWriterQu-1100154 [006] 1634719033.236627: funcgraph_entry:        0.285 us   |            __msecs_to_jiffies();
 StorageWriterQu-1100154 [006] 1634719033.236628: funcgraph_entry:        0.245 us   |            rcu_read_unlock_strict();
 StorageWriterQu-1100154 [006] 1634719033.236628: funcgraph_exit:         1.419 us   |          }
 StorageWriterQu-1100154 [006] 1634719033.236628: funcgraph_exit:         1.972 us   |        }
 StorageWriterQu-1100154 [006] 1634719033.236629: funcgraph_entry:                   |        put_prev_task_fair() {
 StorageWriterQu-1100154 [006] 1634719033.236629: funcgraph_entry:                   |          put_prev_entity() {
 StorageWriterQu-1100154 [006] 1634719033.236629: funcgraph_entry:        0.278 us   |            check_cfs_rq_runtime();
 StorageWriterQu-1100154 [006] 1634719033.236630: funcgraph_exit:         0.840 us   |          }
 StorageWriterQu-1100154 [006] 1634719033.236630: funcgraph_entry:                   |          put_prev_entity() {
 StorageWriterQu-1100154 [006] 1634719033.236630: funcgraph_entry:        0.281 us   |            check_cfs_rq_runtime();
 StorageWriterQu-1100154 [006] 1634719033.236631: funcgraph_exit:         0.829 us   |          }
 StorageWriterQu-1100154 [006] 1634719033.236631: funcgraph_exit:         2.501 us   |        }
 StorageWriterQu-1100154 [006] 1634719033.236632: funcgraph_entry:                   |        pick_next_task_idle() {
 StorageWriterQu-1100154 [006] 1634719033.236632: funcgraph_entry:                   |          set_next_task_idle() {
 StorageWriterQu-1100154 [006] 1634719033.236632: funcgraph_entry:                   |            __update_idle_core() {
 StorageWriterQu-1100154 [006] 1634719033.236632: funcgraph_entry:        0.287 us   |              rcu_read_unlock_strict();
 StorageWriterQu-1100154 [006] 1634719033.236633: funcgraph_exit:         0.835 us   |            }
 StorageWriterQu-1100154 [006] 1634719033.236633: funcgraph_exit:         1.372 us   |          }
 StorageWriterQu-1100154 [006] 1634719033.236633: funcgraph_exit:         1.934 us   |        }
 StorageWriterQu-1100154 [006] 1634719033.236634: funcgraph_entry:        0.225 us   |        psi_task_switch();
 StorageWriterQu-1100154 [006] 1634719033.236634: funcgraph_entry:        1.438 us   |        __traceiter_sched_switch();
 StorageWriterQu-1100154 [006] 1634719038.397231: funcgraph_entry:        2.358 us   |        finish_task_switch();
 StorageWriterQu-1100154 [006] 1634719038.397233: funcgraph_exit:       $ 5160630 us |      }
 StorageWriterQu-1100154 [006] 1634719038.397233: funcgraph_exit:       $ 5160631 us |    }
 StorageWriterQu-1100154 [006] 1634719038.397234: funcgraph_entry:        0.206 us   |    finish_wait();
 StorageWriterQu-1100154 [006] 1634719038.397237: funcgraph_entry:        0.280 us   |    irq_enter_rcu();
 StorageWriterQu-1100154 [006] 1634719038.397237: funcgraph_entry:                   |    __sysvec_irq_work() {
 StorageWriterQu-1100154 [006] 1634719038.397238: funcgraph_entry:                   |      __wake_up() {
 StorageWriterQu-1100154 [006] 1634719038.397238: funcgraph_entry:                   |        __wake_up_common_lock() {
 StorageWriterQu-1100154 [006] 1634719038.397238: funcgraph_entry:        0.181 us   |          _raw_spin_lock_irqsave();
 StorageWriterQu-1100154 [006] 1634719038.397239: funcgraph_entry:        0.199 us   |          __wake_up_common();
 StorageWriterQu-1100154 [006] 1634719038.397239: funcgraph_entry:        0.197 us   |          __lock_text_start();
 StorageWriterQu-1100154 [006] 1634719038.397240: funcgraph_exit:         1.614 us   |        }
 StorageWriterQu-1100154 [006] 1634719038.397240: funcgraph_exit:         1.981 us   |      }
 StorageWriterQu-1100154 [006] 1634719038.397240: funcgraph_entry:                   |      __wake_up() {
 StorageWriterQu-1100154 [006] 1634719038.397240: funcgraph_entry:                   |        __wake_up_common_lock() {
 StorageWriterQu-1100154 [006] 1634719038.397240: funcgraph_entry:        0.178 us   |          _raw_spin_lock_irqsave();
 StorageWriterQu-1100154 [006] 1634719038.397241: funcgraph_entry:                   |          __wake_up_common() {
 StorageWriterQu-1100154 [006] 1634719038.397242: funcgraph_entry:                   |            autoremove_wake_function() {
 StorageWriterQu-1100154 [006] 1634719038.397242: funcgraph_entry:                   |              default_wake_function() {
 StorageWriterQu-1100154 [006] 1634719038.397243: funcgraph_entry:                   |                try_to_wake_up() {
 StorageWriterQu-1100154 [006] 1634719038.397243: funcgraph_entry:        0.427 us   |                  _raw_spin_lock_irqsave();
 StorageWriterQu-1100154 [006] 1634719038.397243: funcgraph_entry:        0.300 us   |                  __traceiter_sched_waking();
 StorageWriterQu-1100154 [006] 1634719038.397244: funcgraph_entry:                   |                  select_task_rq_fair() {
 StorageWriterQu-1100154 [006] 1634719038.397244: funcgraph_entry:                   |                    select_idle_sibling() {
 StorageWriterQu-1100154 [006] 1634719038.397244: funcgraph_entry:        0.252 us   |                      available_idle_cpu();
 StorageWriterQu-1100154 [006] 1634719038.397245: funcgraph_entry:        0.180 us   |                      cpus_share_cache();
 StorageWriterQu-1100154 [006] 1634719038.397245: funcgraph_entry:        0.227 us   |                      available_idle_cpu();
 StorageWriterQu-1100154 [006] 1634719038.397246: funcgraph_entry:        0.210 us   |                      available_idle_cpu();
 StorageWriterQu-1100154 [006] 1634719038.397246: funcgraph_entry:        0.175 us   |                      available_idle_cpu();
 StorageWriterQu-1100154 [006] 1634719038.397247: funcgraph_entry:        0.195 us   |                      available_idle_cpu();
 StorageWriterQu-1100154 [006] 1634719038.397247: funcgraph_entry:        0.212 us   |                      available_idle_cpu();
 StorageWriterQu-1100154 [006] 1634719038.397247: funcgraph_exit:         3.202 us   |                    }
 StorageWriterQu-1100154 [006] 1634719038.397248: funcgraph_entry:        0.182 us   |                    rcu_read_unlock_strict();
 StorageWriterQu-1100154 [006] 1634719038.397248: funcgraph_exit:         4.042 us   |                  }
 StorageWriterQu-1100154 [006] 1634719038.397248: funcgraph_entry:                   |                  set_task_cpu() {
 StorageWriterQu-1100154 [006] 1634719038.397248: funcgraph_entry:                   |                    migrate_task_rq_fair() {
 StorageWriterQu-1100154 [006] 1634719038.397249: funcgraph_entry:                   |                      remove_entity_load_avg() {
 StorageWriterQu-1100154 [006] 1634719038.397249: funcgraph_entry:        0.376 us   |                        __update_load_avg_blocked_se();
 StorageWriterQu-1100154 [006] 1634719038.397250: funcgraph_entry:        0.261 us   |                        _raw_spin_lock_irqsave();
 StorageWriterQu-1100154 [006] 1634719038.397250: funcgraph_entry:        0.193 us   |                        __lock_text_start();
 StorageWriterQu-1100154 [006] 1634719038.397251: funcgraph_exit:         1.612 us   |                      }
 StorageWriterQu-1100154 [006] 1634719038.397251: funcgraph_exit:         2.327 us   |                    }
 StorageWriterQu-1100154 [006] 1634719038.397251: funcgraph_entry:        0.174 us   |                    set_task_rq_fair();
 StorageWriterQu-1100154 [006] 1634719038.397252: funcgraph_exit:         3.537 us   |                  }
 StorageWriterQu-1100154 [006] 1634719038.397252: funcgraph_entry:        0.217 us   |                  ttwu_queue_wakelist();
 StorageWriterQu-1100154 [006] 1634719038.397252: funcgraph_entry:        0.216 us   |                  _raw_spin_lock();
 StorageWriterQu-1100154 [006] 1634719038.397253: funcgraph_entry:        0.289 us   |                  update_rq_clock();
 StorageWriterQu-1100154 [006] 1634719038.397253: funcgraph_entry:                   |                  ttwu_do_activate() {
 StorageWriterQu-1100154 [006] 1634719038.397253: funcgraph_entry:                   |                    psi_task_change() {
 StorageWriterQu-1100154 [006] 1634719038.397254: funcgraph_entry:        0.236 us   |                      psi_flags_change();
 StorageWriterQu-1100154 [006] 1634719038.397255: funcgraph_entry:                   |                      psi_group_change() {
 StorageWriterQu-1100154 [006] 1634719038.397256: funcgraph_entry:        0.215 us   |                        record_times();
 StorageWriterQu-1100154 [006] 1634719038.397256: funcgraph_entry:                   |                        queue_delayed_work_on() {
 StorageWriterQu-1100154 [006] 1634719038.397257: funcgraph_entry:                   |                          __queue_delayed_work() {
 StorageWriterQu-1100154 [006] 1634719038.397257: funcgraph_entry:                   |                            add_timer() {
 StorageWriterQu-1100154 [006] 1634719038.397257: funcgraph_entry:        0.359 us   |                              lock_timer_base();
 StorageWriterQu-1100154 [006] 1634719038.397258: funcgraph_entry:        0.181 us   |                              detach_if_pending();
 StorageWriterQu-1100154 [006] 1634719038.397258: funcgraph_entry:        0.176 us   |                              get_nohz_timer_target();
 StorageWriterQu-1100154 [006] 1634719038.397258: funcgraph_entry:        0.219 us   |                              _raw_spin_lock();
 StorageWriterQu-1100154 [006] 1634719038.397259: funcgraph_entry:        0.476 us   |                              internal_add_timer();
 StorageWriterQu-1100154 [006] 1634719038.397259: funcgraph_entry:        0.205 us   |                              __lock_text_start();
 StorageWriterQu-1100154 [006] 1634719038.397260: funcgraph_exit:         3.069 us   |                            }
 StorageWriterQu-1100154 [006] 1634719038.397260: funcgraph_exit:         3.513 us   |                          }
 StorageWriterQu-1100154 [006] 1634719038.397260: funcgraph_exit:         3.904 us   |                        }
 StorageWriterQu-1100154 [006] 1634719038.397260: funcgraph_exit:         5.155 us   |                      }
 StorageWriterQu-1100154 [006] 1634719038.397261: funcgraph_entry:                   |                      psi_group_change() {
 StorageWriterQu-1100154 [006] 1634719038.397261: funcgraph_entry:        0.196 us   |                        record_times();
 StorageWriterQu-1100154 [006] 1634719038.397262: funcgraph_exit:         0.838 us   |                      }
 StorageWriterQu-1100154 [006] 1634719038.397262: funcgraph_entry:                   |                      psi_group_change() {
 StorageWriterQu-1100154 [006] 1634719038.397263: funcgraph_entry:        0.192 us   |                        record_times();
 StorageWriterQu-1100154 [006] 1634719038.397263: funcgraph_exit:         0.831 us   |                      }
 StorageWriterQu-1100154 [006] 1634719038.397263: funcgraph_entry:                   |                      psi_group_change() {
 StorageWriterQu-1100154 [006] 1634719038.397263: funcgraph_entry:        0.226 us   |                        record_times();
 StorageWriterQu-1100154 [006] 1634719038.397264: funcgraph_exit:         0.649 us   |                      }
 StorageWriterQu-1100154 [006] 1634719038.397264: funcgraph_exit:       + 10.517 us  |                    }
 StorageWriterQu-1100154 [006] 1634719038.397264: funcgraph_entry:                   |                    enqueue_task_fair() {
 StorageWriterQu-1100154 [006] 1634719038.397264: funcgraph_entry:                   |                      enqueue_entity() {
 StorageWriterQu-1100154 [006] 1634719038.397265: funcgraph_entry:        0.234 us   |                        update_curr();
 StorageWriterQu-1100154 [006] 1634719038.397265: funcgraph_entry:        0.321 us   |                        __update_load_avg_cfs_rq();
 StorageWriterQu-1100154 [006] 1634719038.397266: funcgraph_entry:        0.198 us   |                        attach_entity_load_avg();
 StorageWriterQu-1100154 [006] 1634719038.397266: funcgraph_entry:        0.180 us   |                        update_cfs_group();
 StorageWriterQu-1100154 [006] 1634719038.397267: funcgraph_entry:        0.196 us   |                        __enqueue_entity();
 StorageWriterQu-1100154 [006] 1634719038.397267: funcgraph_exit:         2.875 us   |                      }
 StorageWriterQu-1100154 [006] 1634719038.397268: funcgraph_entry:                   |                      enqueue_entity() {
 StorageWriterQu-1100154 [006] 1634719038.397268: funcgraph_entry:        0.176 us   |                        update_curr();
 StorageWriterQu-1100154 [006] 1634719038.397268: funcgraph_entry:        0.240 us   |                        __update_load_avg_se();
 StorageWriterQu-1100154 [006] 1634719038.397269: funcgraph_entry:        0.276 us   |                        __update_load_avg_cfs_rq();
 StorageWriterQu-1100154 [006] 1634719038.397270: funcgraph_entry:                   |                        update_cfs_group() {
 StorageWriterQu-1100154 [006] 1634719038.397270: funcgraph_entry:        0.194 us   |                          reweight_entity();
 StorageWriterQu-1100154 [006] 1634719038.397270: funcgraph_exit:         0.698 us   |                        }
 StorageWriterQu-1100154 [006] 1634719038.397271: funcgraph_entry:        0.174 us   |                        __enqueue_entity();
 StorageWriterQu-1100154 [006] 1634719038.397271: funcgraph_exit:         3.732 us   |                      }
 StorageWriterQu-1100154 [006] 1634719038.397272: funcgraph_entry:        0.249 us   |                      hrtick_update();
 StorageWriterQu-1100154 [006] 1634719038.397272: funcgraph_exit:         7.731 us   |                    }
 StorageWriterQu-1100154 [006] 1634719038.397272: funcgraph_entry:                   |                    ttwu_do_wakeup() {
 StorageWriterQu-1100154 [006] 1634719038.397272: funcgraph_entry:                   |                      check_preempt_curr() {
 StorageWriterQu-1100154 [006] 1634719038.397273: funcgraph_entry:                   |                        resched_curr() {
 StorageWriterQu-1100154 [006] 1634719038.397273: funcgraph_entry:                   |                          native_smp_send_reschedule() {
 StorageWriterQu-1100154 [006] 1634719038.397273: funcgraph_entry:        3.417 us   |                            x2apic_send_IPI();
 StorageWriterQu-1100154 [006] 1634719038.397277: funcgraph_exit:         3.830 us   |                          }
 StorageWriterQu-1100154 [006] 1634719038.397277: funcgraph_exit:         4.254 us   |                        }
 StorageWriterQu-1100154 [006] 1634719038.397277: funcgraph_exit:         4.756 us   |                      }
 StorageWriterQu-1100154 [006] 1634719038.397277: funcgraph_entry:        0.941 us   |                      __traceiter_sched_wakeup();
 StorageWriterQu-1100154 [006] 1634719038.397279: funcgraph_exit:         6.337 us   |                    }
 StorageWriterQu-1100154 [006] 1634719038.397279: funcgraph_exit:       + 25.545 us  |                  }
 StorageWriterQu-1100154 [006] 1634719038.397279: funcgraph_entry:        0.207 us   |                  __lock_text_start();
 StorageWriterQu-1100154 [006] 1634719038.397279: funcgraph_exit:       + 36.927 us  |                }
 StorageWriterQu-1100154 [006] 1634719038.397280: funcgraph_exit:       + 37.294 us  |              }
 StorageWriterQu-1100154 [006] 1634719038.397280: funcgraph_exit:       + 38.393 us  |            }
 StorageWriterQu-1100154 [006] 1634719038.397280: funcgraph_exit:       + 39.452 us  |          }
 StorageWriterQu-1100154 [006] 1634719038.397280: funcgraph_entry:        0.197 us   |          __lock_text_start();
 StorageWriterQu-1100154 [006] 1634719038.397281: funcgraph_exit:       + 40.621 us  |        }
 StorageWriterQu-1100154 [006] 1634719038.397281: funcgraph_exit:       + 40.974 us  |      }
 StorageWriterQu-1100154 [006] 1634719038.397281: funcgraph_exit:       + 44.034 us  |    }
 StorageWriterQu-1100154 [006] 1634719038.397281: funcgraph_entry:                   |    irq_exit_rcu() {
 StorageWriterQu-1100154 [006] 1634719038.397282: funcgraph_entry:        0.194 us   |      idle_cpu();
 StorageWriterQu-1100154 [006] 1634719038.397282: funcgraph_exit:         0.670 us   |    }
 StorageWriterQu-1100154 [006] 1634719038.397283: funcgraph_exit:       $ 5160753 us |  }

It looks like the kernel is waiting for ZFS to complete the page write to stable storage.

Describe how to reproduce the problem

  1. Start EventStoreDB (20.10.4)
  2. Write data constantly to a stream in the database
  3. Check for SLOW QUEUE messages after 1 day in the EventStoreDB logs

I'm attempting a minimal repro with a C application which repeatedly flushes an 8-byte memory-mapped file and I've seen the issue happen once so it confirms that the issue is not with EventStoreDB (i'm trying to get the repro to work consistently).

Include any warning/errors/backtraces from the system logs

Nothing in the syslogs. The machine is quite powerful as well and has plenty of free memory.

shaan1337 commented 3 years ago

After some more investigation, I've made a few changes to the initial issue. it definitely looks related to zfs since:

rincebrain commented 3 years ago

Could try #12284 and see if it makes life any better.

Could also try twiddling WBT off - there have been issues with huge amounts of IO contention eventually deadlocking with that, and I don't know whether the fix for that made it into stable.

shaan1337 commented 3 years ago

@rincebrain thanks!

we're not using an SLOG on these machines, so it doesn't appear that #12284 would apply in our case. we'll try to turn off write-back throttling to see if this fixes it.

shaan1337 commented 3 years ago

We've turned off WBT (with echo 0 > /sys/block/sdb/queue/wbt_lat_usec) and we can see that the issue still occurs.

shaan1337 commented 2 years ago

i've finally got a consistent repro: repro.working.zip

It requires changing the kernel's dirty page writeback frequency to make it much more reproducible:

sudo sysctl -w "vm.dirty_expire_centisecs=1"
sudo sysctl -w "vm.dirty_background_bytes=1"
sudo sysctl -w "vm.dirty_writeback_centisecs=1"

It appears that the delay occurs when both an application and the kernel are writing back the same dirty page. the relevant part of the code is: https://github.com/openzfs/zfs/blob/269b5dadcfd1d5732cf763dddcd46009a332eae4/module/os/linux/zfs/zfs_vnops_os.c#L3528

~/repro
[2021-11-18 10:31:31] slow msync: 2303.270000 ms
[2021-11-18 10:31:47] slow msync: 4050.328000 ms
[2021-11-18 10:31:57] slow msync: 2838.579000 ms
[2021-11-18 10:32:02] slow msync: 4380.869000 ms
[2021-11-18 10:32:22] slow msync: 4681.770000 ms
[2021-11-18 10:32:28] slow msync: 2799.192000 ms
[2021-11-18 10:32:33] slow msync: 3692.178000 ms
[2021-11-18 10:32:38] slow msync: 5001.400000 ms
[2021-11-18 10:32:43] slow msync: 3201.233000 ms
...

The page stays stuck in the write back state until the transaction group containing that write closes. As a workaround, the zfs transaction group timeout can be reduced but it will cause more frequent writes to stable storage (set to 1 second below):

echo 1 > /sys/module/zfs/parameters/zfs_txg_timeout