NVMe/TCP requests pool per thread

jkalwas commented 3 weeks ago

Currently, each NVMe queue pair allocates its own requests and uses a queue (TAILQ) for get/put operations. This means that for a large number of queue pairs (N), there are N such TAILQs. It would be more effective for cache to have a single per-thread queue so that requests can be reused sooner, keeping cache lines hot. Such a change of ownership may impose some design traps that need to be discussed.

tomzawadzki commented 2 weeks ago

[20min]

jimharris commented 2 weeks ago

requests currently per qpair, too much memory, most requests never used at high scaling requests per poll group instead, LIFO to maximize cache usage one for TCP, one for RDMA don't try to pool across transports, since each transport has its own enveloping nvmf request structure each qpair needs a structure embedded in it, that can be put on a per-poll group "waiter list" when requests available, qpair can fetch up to "batch size" requests (batch size configurable)

iov pool per thread - not needed if there is a request pool, since iov is inside of generic nvmf request object

nvmf => bdev => nvme for a write command, bottom end (nvme/tcp driver) may get ack for data transmit long before nvme completion one idea was to invalidate the data (enabling reuse of iov array), after data ack received this doesn't work in cases where retry needed at lower level - i.e. nvme multipath - because we cannot rebuild the data iov if the nvme completion fails after the data was invalidated

spdk / spdk.github.io

NVMe/TCP requests pool per thread #22