mosn / holmes

self-aware Golang profile dumper
Apache License 2.0
1.06k stars 135 forks source link

license

Holmes

中文版

Self-aware Golang profile dumper.

Our online system often crashes at midnight (usually killed by the OS due to OOM). As lazy developers, we don't want to be woken up at midnight and waiting for the online error to recur.

holmes comes to rescue.

Design

Holmes collects the following stats every interval passed:

In addition, holmes will collect RSS based on GC cycle, if you enable GC heap.

After warming up(10 times collects after starting application) phase finished, Holmes will compare the current stats with the average of previous collected stats(10 cycles). If the dump rule is matched, Holmes will dump the related profile to log(text mode) or binary file(binary mode).

When you get warning messages sent by your own monitor system, e.g, memory usage exceed 80%, OOM killed, CPU usage exceed 80%, goroutine num exceed 100k. The profile is already dumped to your dump path. You could just fetch the profile and see what actually happened without pressure.

How to use

    go get mosn.io/holmes

Dump goroutine when goroutine number spikes

h, _ := holmes.New(
    holmes.WithCollectInterval("5s"),
    holmes.WithDumpPath("/tmp"),
    holmes.WithTextDump(),
    holmes.WithDumpToLogger(true),
    holmes.WithGoroutineDump(10, 25, 2000, 10*1000,time.Minute),
)
h.EnableGoroutineDump()

// start the metrics collect and dump loop
h.Start()

// stop the dumper
h.Stop()

dump cpu profile when cpu load spikes

h, _ := holmes.New(
    holmes.WithCollectInterval("5s"),
    holmes.WithDumpPath("/tmp"),
    holmes.WithCPUDump(20, 25, 80, time.Minute),
    holmes.WithCPUMax(90),
)
h.EnableCPUDump()

// start the metrics collect and dump loop
h.Start()

// stop the dumper
h.Stop()

dump heap profile when RSS spikes

h, _ := holmes.New(
    holmes.WithCollectInterval("5s"),
    holmes.WithDumpPath("/tmp"),
    holmes.WithTextDump(),
    holmes.WithMemDump(30, 25, 80,time.Minute),
)

h.EnableMemDump()

// start the metrics collect and dump loop
h.Start()

// stop the dumper
h.Stop()

Dump heap profile when RSS spikes based GC cycle

In some situations we can not get useful information, such the application allocates heap memory and collects it between one CollectInterval. So we design a new heap memory monitor rule, which bases on GC cycle, to control holmes dump. It will dump twice heap profile continuously while RSS spike, then devs can compare the profiles through pprof base command.

    h, _ := holmes.New(
        holmes.WithDumpPath("/tmp"),
        holmes.WithLogger(holmes.NewFileLog("/tmp/holmes.log", mlog.INFO)),
        holmes.WithBinaryDump(),
        holmes.WithMemoryLimit(100*1024*1024), // 100MB
        holmes.WithGCHeapDump(10, 20, 40, time.Minute),
        // holmes.WithProfileReporter(reporter),
    )
    h.EnableGCHeapDump().Start()
    time.Sleep(time.Hour)

Set holmes configurations on fly

You can use Set method to modify holmes' configurations when the application is running.

    h.Set(
        WithCollectInterval("2s"),
        WithGoroutineDump(min, diff, abs, 90, time.Minute))

Reporter dump event

You can use Reporter to implement the following features:

        type ReporterImpl struct{}
        func (r *ReporterImpl)  Report(pType string, filename string, reason ReasonType, eventID string, sampleTime time.Time, pprofBytes []byte, scene Scene) error{
            // do something 
        }
        ......
        r := &ReporterImpl{} // a implement of holmes.ProfileReporter Interface.
        h, _ := holmes.New(
            holmes.WithProfileReporter(reporter),
            holmes.WithDumpPath("/tmp"),
            holmes.WithLogger(holmes.NewFileLog("/tmp/holmes.log", mlog.INFO)),
            holmes.WithBinaryDump(),
            holmes.WithMemoryLimit(100*1024*1024), // 100MB
            holmes.WithGCHeapDump(10, 20, 40, time.Minute),
)

Enable holmes as pyroscope client

Holmes supports to upload your profile to pyroscope server. More details click here please.

Noted that NOT set TextDump when you enable holmes as pyroscope client.

Enable them all!

It's easy.

h, _ := holmes.New(
    holmes.WithCollectInterval("5s"),
    holmes.WithDumpPath("/tmp"),
    holmes.WithTextDump(),

    holmes.WithCPUDump(10, 25, 80, time.Minute),
    holmes.WithMemDump(30, 25, 80, time.Minute),
    holmes.WithGCHeapDump(10, 20, 40, time.Minute),
    holmes.WithGoroutineDump(500, 25, 20000, 0, time.Minute),
)

    h.EnableCPUDump().
    EnableGoroutineDump().
    EnableMemDump().
    EnableGCHeapDump().Start()

Running in docker or other cgroup limited environment

h, _ := holmes.New(
    holmes.WithCollectInterval("5s"),
    holmes.WithDumpPath("/tmp"),
    holmes.WithTextDump(),

    holmes.WithCPUDump(10, 25, 80,time.Minute),
    holmes.WithCGroup(true), // set cgroup to true
)

known risks

If golang version < 1.19, collect a goroutine itself may cause latency spike because of the long time STW. At golang 1.19, it has been optz by concurrent way at this CL.

Show cases

Click here

Contributing

See our contributor guide.

Community

Scan the QR code below with DingTalk(钉钉) to join the Holmes user group.

dingtalk