Contributing a more advanced memory analyzer tool

Hello BCC community!

I work on the development of a large complex application, and at points I've needed to analyze my application to make memory-related bug fixes or optimizations. I've tried to use a variant of the BCC memleak tool, but I found that it wasn't useful from a dev-oriented perspective where I'm trying to understand where memory allocations come from when performing a specific task in the application. Long story short, I've eventually developed a more in-depth memory analysis tool that helped greatly with my development work, and was wondering if the community had any interest in me contributing my tool to this repository?

The reason why I found a need to make a different tool was as follows. My understanding was the memleak tool is best used for testing if a memory leak exists at all on a continuously running long-lived application, and that it also can be used to save some basic metrics about when and where the memory was leaked. In such a scenario, the exact timing of when the memory stats are tracked / reported are not too critical, and the tool can focus on just the largest individual allocations.

When I try to study a specific action in an application, I end up having to use a stopwatch or a clock to time the periodic reporting so that I can get information about the action I initiate. If I'm interested in reducing the peak memory usage for an application that has no leaks, I needed the reporting to happen at some point in the middle of the workload. My actions always felt like a race against time because of the periodic dumping style user interface.

If there are many small allocated chunks from many stacks, the tool takes a long time to aggregate the results unless I enable eBPF-level stats aggregation (through the misleading --combined-only option), and when I do use this option I find that the tool gives inaccurate results for a multi-threaded application, causing me to hunt down phantom memory stacks. The "top-N allocations summary" output format also hasn't been very helpful in judging what functions in the application are creating the biggest memory impact, since a function with many small allocations is ranked very low even if the majority of memory utilization comes from that function, and also because this format emphasizes low-level functions near the top of the call stack rather than higher-level functions closer to the bottom. In other words, a lot of information is lost in the output format even if I ask for the top-1000 stacks.

The version of the tool I made uses a terminal interface to let the user manually start and end profiling (something like the interface of gdb). It produces a top-down summary of allocations found during the profiling period, in a format similar to perf report. It also supports outputting to a format compatible with flamegraph. It uses the faster eBPF-level aggregation without the inaccuracies from the memleak tool, and can detect if the maps were saturated (which causes the stats to be corrupted). My main concern is that the existing tools all seem to use the periodic dumping style and top-N style output formats, so this one would have a different style. Would it still fit in as a contribution to the community?

Some points:

If you're aware of problems with memleak, please file tickets so we can fix them.
How many lines of code is your tool? The largest in bcc is trace.py at 898 lines, which does a lot. The average size is 260 lines. I'm guessing your tool is way larger (and may involve multiple files) at which point, it's really its own thing that should belong in its own repo. If someone wrote a BPF version of gdb, say, (since you mentioned gdb), we wouldn't want it in BCC as it's likely be an enormous sprawling project: we'd want them to create their own repo called "BPF gdb" or something. Makes sense?
People don't want complex tools if it can be avioded. I get my colleagues to run BCC tools when debugging issues, and ideally I can say "run biosnoop" and they run it and the default output solves their problem without unnecessary clutter. No arguments, no learning its own debugging syntax or language. Just literally "run biosnoop" and nothing more. (Even better is if it's a button in the Spinnaker UI they can click, and get a report.) So I'd think carefully if you can simplify the interface, and make it driven though just options at least, without an interactive component. I just can't imagine telling my software engineering colleagues -- who are busy with their own deadlines and work -- to learn how to use a gdb-like memory debugger when it's 5:30pm and they are anxious to solve an issue fast and go home. I never lose sight that a lot of people don't enjoy running these tools: they just want to solve issues ASAP.
malloc() tracing is way expensive (it should be mentioned in the memleak(8) man page). That's why we haven't done much with it other than memleak(8) and some one-liners. The overhead is typically prohibitive. Once we fix the overhead (separate discussion) expect to see a lot more work in this space and interest in this type of tracing and in your tool.

Thank you for the response!

I will file some tickets about the problems in the memleak tool.
My tool is about 1000 lines and one file, so it is larger than trace.py but not quite the size of an individual repo. I understand where you're coming from if your repo has a philosophy of simplicity and want all tools to be less than 1k lines though.
I see, that would explain the user interface for many of the existing tools. The default use case for my tool is not quite as complicated as you describe though. Usually you start the memory analyzer tool with no flags other than -p pid, run the application to be analyzed, then type dump into the BCC tool interface after finishing the desired actions in the application. The printed output is a top-down view of the remaining memory allocations as one might expect.
This is true, the overhead is high if many small allocations are made. The interrupt instructions added by the USDT tracepoints cause many kernel-user context switches, which are much more expensive than a call to malloc even though this type of context switch is relatively cheap overall. This makes memleak risky as a non-invasive probing solution on a production system, but for a dev / debugging use case my opinion is that the reduced performance is not as big of a problem, since the tool will be used on a debug-only deployment. BCC is still very useful due to the flexibility, since applications don't need to be recompiled to get the probing information. My main pain point in these scenarios was instead that memleak doesn't provide the information needed to study memory-related bugs as a dev.

iovisor / bcc

Contributing a more advanced memory analyzer tool #3331