Closed citrus-it closed 4 months ago
I booted a gimlet using an image generated with these changes and confirmed that it ended up in the image and that the value is set as expected:
gimlet-sn06 # cat /etc/system.d/.self-assembly
*
* The default NVMe admin command timeout is one second. We have observed that,
* from the driver's perspective, the SN840 drives in a gimlet occasionally take
* over two seconds to respond to a log page request. When the request times out
* (after the first second), an abort is sent which also times out resulting in
* the device being marked as dead by the driver.
* Pending upstream driver work to improve things, we bump the timeout.
*
set nvme:nvme_admin_cmd_timeout = 0xa
*
* Normally, the dbuf cache is 1/32nd of RAM and the dbuf metadata cache is
* 1/64th of RAM. On a 1TiB system, these are way, way too big for us --
* especially with 800GiB already spoken for. Moreover, the primary advantage
* of the dbuf cache -- namely, eliminate the cost of uncompression on a dbuf
* cache hit -- is negated by the non-compressability for Crucible data (which
* is encrypted). We therefore tune these numbers down quite a bit, knowing
* that any eviction from the dbuf cache can still be in the ARC.
*
set zfs:dbuf_cache_max_bytes = 0x40000000
set zfs:dbuf_metadata_cache_max_bytes = 0x40000000
gimlet-sn06 # mdb -ke nvme_admin_cmd_timeout/X
nvme_admin_cmd_timeout:
nvme_admin_cmd_timeout: a
This is another mitigation option for https://github.com/oxidecomputer/stlouis/issues/562