Closed dkruces closed 3 weeks ago
There is a conflict with latest master branch. Please rebase and resubmit. Thanks!
I am not an expert on nvme. But maybe people using/working on nvme or io performance can comment on why this tool is useful and how it can help trouble shoot the production issue.
Thanks @yonghong-song for the review. Just to add a bit more information why this is important for LBS (I may add it to the commit message after fixing the conflict): Instrumenting the NVMe layer with nvmecmd allows us to validate the min order [1] concept and make sure the NVMe commands are actually sent with the correct order and alignment. This would ultimately have to match with the Namespace Preferred Write Granularity (NPWG), so ideally this tool can be extended with support to read that from the device and warn with the cases (commands) that don't honor the value.
[1] min order:
This extension (to read from the device) is something we haven't yet explored but we think adding xNVMe as dependency would help moving towards that feature. However, the tool is more simpler and just reports the alignment values. Would this be something acceptable to upstream to IO Visor BCC project (in case nvmecmd gets accepted)? We tried to keep it simple as indicated by the contributing guidelines [2] as our interpretation of that document suggests potential for conflict with such extension of the tool. We have also a version of the nvmecmd where we can record NVMe commands into a database for later inspection (tcpdump-like) [3] that we skipped in the PR because of the same reason. Could you or someone else from the IO Visor Project clarify on this topic?
[2] https://github.com/iovisor/bcc/blob/master/CONTRIBUTING-SCRIPTS.md#tools [3] https://github.com/dagmcr/bcc/tree/nvmecmd-dump
Hi again,
I've updated the PR. Changes are:
--ops
argument type int to str. To allow the user to indicate read/write operation to instrument.Note: RFC is kept in the PR until getting more feedback (https://github.com/iovisor/bcc/pull/4813#issuecomment-1838079061, https://github.com/iovisor/bcc/pull/4813#pullrequestreview-1761584446).
TODO:
local_strncmp
with bpf_strncmp
: https://github.com/iovisor/bcc/pull/4813#discussion_r1413361414@dagmcr Thanks for detailed explanation for the current state and future vision of nvmecmd tool. I can run on my company production machine with nvme drives and it works as expected. I think the extension of the tool to validate the command seems useful too for production debugging, assuming reading from the disk won't disrupt the system.
cc @brendangregg
A new PR update:
After our latest LBS community meeting feedback and presenting the tool, a general suggestion was to make it more generic, and not NVMe specific, so any block device can benefit from it.
This suggestion makes the following 'big' changes from the previous PR version:
nvme_setup_cmd
-> blk_mq_start_request
.
bpf.attach_kprobe
instead of the kprobe__
prefix used before with nvme function.nvmecmd
to blkalgn
. After all, we are recording block commands issued from the block layer to the block device for alignment purposes (LBA and size).Note: I'm very much open to suggestions with naming of the tool. Let me know if you have better name ideas.
lba_shift
is given by the block device and we are one layer above, lba_shift
is now retrieved directly from the logical_block_size
(limits struct) and using the logarithmic bpf helper (bpf_log
). Note: the bpf_log2l
is adding an additional order that would required subtraction. That's the reason to pick bpf_log
instead of the bpf_log2l
(https://github.com/iovisor/bcc/issues/1649).
Functionality of the tool remains the same. A tcpdump-like version can be found here: https://github.com/dagmcr/bcc/tree/blkalgn-dump. @yonghong-song, you mentioned the extension of the tool might be valuable. Would that version of the tool with recording capabilities for later parsing suitable for iovisor/bcc?
Tool has been ported to libbpf. New PR here: https://github.com/iovisor/bcc/pull/5128
The tool observes NVMe commands and checks for LBA and block size alignment.
The tool is used as part of the Large block size (LBS) effort [1] in the kernel to validate part of the work.
[1] https://kernelnewbies.org/KernelProjects/large-block-size