Examples of dangerous bugs and diagnostics catching them

springmeyer commented 7 years ago

Context

Currently the scope of skel is best practices through working examples.

However in https://github.com/mapbox/node-cpp-skel/pull/29 we are experimenting with writing sample code that exhibits "common performance problems" like thread contention along with code that demonstrates idealized cpu usage for async code. The motivation is that:

the scenarios (good perf and bad) have more value and are easier to explain in direction comparison
a problem like thread contention is quietly sinister: you can't see it, there is no obvious crash or error. Just a slow program.

So, we plan to document the tell-tale signs of things like thread contention when seen via a profiler.

Demonstrating diagnostics to catch hard to see bugs

Similar to thread contention that may be difficult to diagnose without specialized methods like function level profiling, there is a class of dangerous bugs that can cause silent memory corruption and are not often detectable without advanced methods.

Therefore, we could consider adding example code (perhaps first in an advanced branch) that contains intentional dangerous bugs that corrupt memory. Then write documentation for how to detect them.

Examples/ideas:

Attempt to convert empty v8 objects
- Interacting with V8 objects in invalid ways can silently succeed in release (probably corrupting memory) but will crash in DEBUG mode.
- Write docs about how to install node_g (debug node build) and leverage the advanced checks in V8 that catch programming errors and abort: https://github.com/nodejs/node/blob/73ae2d1895c2c0d1d4eeaddd284c58d776d2be87/deps/v8/src/base/logging.h
- Refs https://github.com/mapbox/carmen-cache/issues/85 for an example of this bug
- Intentionally corrupt the v8 heap
- Write docs that describe how to use the node --verify-heap flag to catch the problem. This could be tricky, but we can learn from how the v8 engineers try to do this: https://bugs.chromium.org/p/v8/issues/detail?id=2120

springmeyer commented 7 years ago

further idea: write a function that uses v8 objects and does not have a handlescope and leaks therefore. Ensure that the leak checker run on travis catches the leak.

springmeyer commented 7 years ago

This ticket is blue sky and most valuable for thinking about what is possible. However @mapsam over in https://github.com/mapbox/hpp-skel/pull/17#issuecomment-319199473 just did a quick test in hpp-skel to add a bug in a throwaway branch to confirm manually that the sanitizers are working. We should do the same for node-cpp-skel here, just to quickly confirm they are working. @mapsam want to pair with @GretaCB on this in a free moment?

[x] Create a throwaway branch
[x] Add a bug to the code (either a memory leak or undefined behavior or both), ideally inside the threadpool function
[x] Ensure the correct sanitizer catches it.

GretaCB commented 7 years ago

@springmeyer I created a throwaway branch and removed the Handlescopes to trigger undefined behaviour within the threadpool. Looks like Travis successfully failed with the undefined sanitizer 🎉

springmeyer commented 7 years ago

Looks like Travis successfully failed with the undefined sanitizer 🎉

👍 Per chat, this proves that things are working. Which is great. So, going forward we'll be confident that if we make a coding mistake in node-cpp-skel (or a project based on it) we should have help from the address and undefined sanitizer.

Note: The error that was thrown in your branch actually is one that was suppressed in the master branch. It is coming from the vptr part of the undefined behavior sanitizer and arose when you added -faddress=undefined since that overrode the -fno-sanitize=vptr,function at https://github.com/mapbox/node-cpp-skel/blob/72092b408dae95f12d067395e42176eb601ed8fb/scripts/setup.sh#L94. So, this still proves things are working, but did not actually detect a problem due to the code changes (only your .travis.yml change: https://github.com/mapbox/node-cpp-skel/compare/sanitize#diff-354f30a63fb0907d4ad57269548329e3R79).

springmeyer commented 7 years ago

@GretaCB - noticed your last commit definitely triggered the use-after-free error, as we hoped! 🎉

The sanitizer output is verbose, but the key thing to note is the type of error heap-use-after-free and the line it was encountered: hello_async.cpp:130:17. That means line 130 and 17 characters in, which points exactly to the bug x[5]; at https://github.com/mapbox/node-cpp-skel/compare/663df40bb5ee...4df31d7a9ec5#diff-032fe0b90870cc3a55a6326685d9adbcR130

=================================================================
==4355==ERROR: AddressSanitizer: heap-use-after-free on address 0x60700000a825 at pc 0x7fb32b9d0926 bp 0x7fb329d38b50 sp 0x7fb329d38b48
READ of size 1 at 0x60700000a825 thread T8
    #0 0x7fb32b9d0925 in object_async::do_expensive_work(bool, std::string const&) /home/travis/build/mapbox/node-cpp-skel/build/../src/object_async/hello_async.cpp:130:17

at https://travis-ci.org/mapbox/node-cpp-skel/jobs/262424521#L664

GretaCB commented 7 years ago

Gave a couple more bugs a try in the sanitizer:

double free: attempting to free (deallocate) memory that's already been free'd. This threw errors in the tests both with and without the sanitizer, but the sanitizer caught it as...
```
AddressSanitizer: attempting free on address which was not malloc()-ed: 0x60700000a821 in thread T8
```
memory leak: not explicitly deallocating memory, and allowing it to "leak" or remain "reserved" or unable to be used by the program. This was caught by the sanitizer...

LeakSanitizer: detected memory leaks

springmeyer commented 7 years ago

Per chat - this ^^ is great news. I now propose pausing on this quest. There are ideas in the description of even more types of problems we could try to trigger (to see if the sanitizers catch the errors). But for the sake of time, probably best to use node-cpp-skel as a place to build-back checks when we hit real world problems. And not put more effort now into trying to simulate errors. I'll therefore close this and re-open new issues if/when we:

a need to add more sanitizers
hit a narly bug in production, find a way to detect it/prevent it, and want to build that learning back here.

mapbox / node-cpp-skel

Examples of dangerous bugs and diagnostics catching them #40

Context

Demonstrating diagnostics to catch hard to see bugs