In most SPDK data flows, there's a "front end" hardware queue (a socket, RDMA QP, etc.) and a "back end" hardware queue (a socket, RDMA QP, or NVMe queue pair). The software then generally ping-pongs between the two queues, pulling batches and running to completion.
However, if there are more than two queues like this, we begin to see performance problems. For example, imagine a stage in between for an accelerator, such as a crypto engine. And to make things more complex, imagine that some I/O uses the crypto engine on submission and some use it on completion. The flow then goes like this:
Poll front-end queue for incoming I/O and submit to either the accelerator or to the backend queue. When this completes, it does not "ring the doorbells" or "flush" - those operations happen in the other pollers.
Run the accelerator poller, which submits any queued commands and polls for completions. For any completions it finds, it submits them to the backend queue.
Run the backend poller, which submits any queued commands and polls for completions. For any completions it finds, it submits them to the front-end or accelerator queues.
The steps can be in a different order - 3 can be before 2 for example. We don't have any control over that currently, but controlling the order does not fix the problem anyway.
The issue is that the steps can take a significant amount of time relative to how long the hardware operations take, which can delay the submission and leave the hardware idle. For example, on a completion path that uses an accelerator, poller 3 runs, queues up operations to the accelerator, but then poller 1 runs. This may take 40us. Then after that poller 2 runs and the operations are submitted to the accelerator. Except, if the operations had been submitted immediately, they'd already be done during that 40us period.
Our current thoughts on solving this are to add a new kind of poller, or a flag to the poller registration, that registers a special kind of poller that runs in between all of the standard pollers. This special poller is very performance sensitive, so it's only responsibility is to "flush" after each poller exits.
In most SPDK data flows, there's a "front end" hardware queue (a socket, RDMA QP, etc.) and a "back end" hardware queue (a socket, RDMA QP, or NVMe queue pair). The software then generally ping-pongs between the two queues, pulling batches and running to completion.
However, if there are more than two queues like this, we begin to see performance problems. For example, imagine a stage in between for an accelerator, such as a crypto engine. And to make things more complex, imagine that some I/O uses the crypto engine on submission and some use it on completion. The flow then goes like this:
The steps can be in a different order - 3 can be before 2 for example. We don't have any control over that currently, but controlling the order does not fix the problem anyway.
The issue is that the steps can take a significant amount of time relative to how long the hardware operations take, which can delay the submission and leave the hardware idle. For example, on a completion path that uses an accelerator, poller 3 runs, queues up operations to the accelerator, but then poller 1 runs. This may take 40us. Then after that poller 2 runs and the operations are submitted to the accelerator. Except, if the operations had been submitted immediately, they'd already be done during that 40us period.
Our current thoughts on solving this are to add a new kind of poller, or a flag to the poller registration, that registers a special kind of poller that runs in between all of the standard pollers. This special poller is very performance sensitive, so it's only responsibility is to "flush" after each poller exits.