iovisor / bcc

BCC - Tools for BPF-based Linux IO analysis, networking, monitoring, and more
Apache License 2.0
20.36k stars 3.86k forks source link

RFC: blkalgn: add block command alignment observability tool #4813

Open dkruces opened 10 months ago

dkruces commented 10 months ago

The tool observes NVMe commands and checks for LBA and block size alignment.

The tool is used as part of the Large block size (LBS) effort [1] in the kernel to validate part of the work.

[1] https://kernelnewbies.org/KernelProjects/large-block-size

yonghong-song commented 10 months ago

There is a conflict with latest master branch. Please rebase and resubmit. Thanks!

dkruces commented 10 months ago

I am not an expert on nvme. But maybe people using/working on nvme or io performance can comment on why this tool is useful and how it can help trouble shoot the production issue.

Thanks @yonghong-song for the review. Just to add a bit more information why this is important for LBS (I may add it to the commit message after fixing the conflict): Instrumenting the NVMe layer with nvmecmd allows us to validate the min order [1] concept and make sure the NVMe commands are actually sent with the correct order and alignment. This would ultimately have to match with the Namespace Preferred Write Granularity (NPWG), so ideally this tool can be extended with support to read that from the device and warn with the cases (commands) that don't honor the value.

[1] min order:

This extension (to read from the device) is something we haven't yet explored but we think adding xNVMe as dependency would help moving towards that feature. However, the tool is more simpler and just reports the alignment values. Would this be something acceptable to upstream to IO Visor BCC project (in case nvmecmd gets accepted)? We tried to keep it simple as indicated by the contributing guidelines [2] as our interpretation of that document suggests potential for conflict with such extension of the tool. We have also a version of the nvmecmd where we can record NVMe commands into a database for later inspection (tcpdump-like) [3] that we skipped in the PR because of the same reason. Could you or someone else from the IO Visor Project clarify on this topic?

[2] https://github.com/iovisor/bcc/blob/master/CONTRIBUTING-SCRIPTS.md#tools [3] https://github.com/dagmcr/bcc/tree/nvmecmd-dump

dkruces commented 10 months ago

Hi again,

I've updated the PR. Changes are:

Note: RFC is kept in the PR until getting more feedback (https://github.com/iovisor/bcc/pull/4813#issuecomment-1838079061, https://github.com/iovisor/bcc/pull/4813#pullrequestreview-1761584446).

TODO:

yonghong-song commented 10 months ago

@dagmcr Thanks for detailed explanation for the current state and future vision of nvmecmd tool. I can run on my company production machine with nvme drives and it works as expected. I think the extension of the tool to validate the command seems useful too for production debugging, assuming reading from the disk won't disrupt the system.

yonghong-song commented 9 months ago

cc @brendangregg

dkruces commented 9 months ago

A new PR update:

After our latest LBS community meeting feedback and presenting the tool, a general suggestion was to make it more generic, and not NVMe specific, so any block device can benefit from it.

This suggestion makes the following 'big' changes from the previous PR version:

Note: I'm very much open to suggestions with naming of the tool. Let me know if you have better name ideas.

Note: the bpf_log2l is adding an additional order that would required subtraction. That's the reason to pick bpf_log instead of the bpf_log2l (https://github.com/iovisor/bcc/issues/1649).

Functionality of the tool remains the same. A tcpdump-like version can be found here: https://github.com/dagmcr/bcc/tree/blkalgn-dump. @yonghong-song, you mentioned the extension of the tool might be valuable. Would that version of the tool with recording capabilities for later parsing suitable for iovisor/bcc?