Proposal: Snapshot shm folder as "tech support dump" procedure

lukego commented 7 years ago

Here is a modest proposal...

Problem: How do you capture the state of a Snabb process for performing diagnostics? This is important in many different contexts: interactive testing on your laptop, auditing CI test results for processes that have already terminated, troubleshooting a running production system with escalation down a support chain involving many people in different places at different times, and everything in between. It would be handy to have a standard way to comprehensively capture the state of a process (running or dead) for "offline" reference throughout this process. Something like a coredump file in which you can find all of the answers if you only look hard enough.

Suggestion: How about if we make a tarball of the shm folder from a Snabb process the standard artifact for performing diagnostics? Test environments could run to completion and preserve this data. Production environments could passively make a copy while the process is running (tar czf techsupportdump.tar.gz /var/run/snabb/[0-9]*). If users always capture this information then developers can focus on including all the relevant information there and we can create a positive feedback loop of ever-better diagnostic data.

Benefit: Help everybody involved in diagnostics be productive and independent. User captures the data and everybody else refers to that using the tools and skills available. Minimize dependencies like rerunning in debug mode, connecting support engineer to production system, asking a long series of follow-on questions deep in the support chain, etc.

Details

Here is what you can find in the shm folder today:

Counters values.
Histogram object tracking the distribution of engine latency.
Definition of the app network links.

Here is what is coming down the pipeline now:

Timeline log with 1M entries to extract performance metrics from.

Here are ideas for adding in the future:

Configuration of the application.
JIT dump/log in binary format containing full details.
Profiler data for the whole LuaJIT runtime system (luajit/luajit#224).
Information about the machine (CPU model / NUMA / hyperthreading / kernel / etc).

If this proposal makes sense then I think a first step is to update Hydra so that the shm folders are archived for each test. Then we can begin the loop of "for next time I will also log foo, bar, and baz to shm objects..."

mwiget commented 7 years ago

Makes a lot of sense. In fact, I created a rudimentary version of collecting support data here:

https://github.com/Juniper/vmx-docker-lwaftr/blob/master/SUPPORT-INFO.md

Shell script: https://github.com/Juniper/vmx-docker-lwaftr/blob/master/tests/collect-support-infos.sh, where I collect all the relevant configuration files from within the container I run snabb in and save that into a tar file.

While I didn't collect shm data, I do add the actual snabb binary plus config files and a shell script to run that binary standalone. This allows the developer to rerun the very same application, thanks to how snabb binaries are built.

lukego commented 7 years ago

@mwiget Nice procedure! On reflection my suggestion is only that we include the shm directory in support dumps and not that it be the only thing we capture.

I really like the idea of picking up the Snabb binary. I am actually meaning to write some code to copy that into the shm directory too, mostly to make the core.worker process robust to software updates while running, but also with the advantage you mention if this means it is included in support info.

(I have half wondered if it would even make sense to compile the source code into the Snabb binary instead of the bytecode as we do now. This could also be very useful for "deep support." Could have it as a build option for people who prefer to obfuscate that e.g. in a proprietary product.)

snabbco / snabb

Proposal: Snapshot shm folder as "tech support dump" procedure #1105

Details