kobolog / gorb

IPVS on steroids – REST API, heartbeats, service discovery and more
GNU Lesser General Public License v3.0
809 stars 83 forks source link

Add prometheus monitoring #27

Closed crisidev closed 7 years ago

crisidev commented 8 years ago

This is how I instructed Gorb to produce metrics to be consumed by my Prometheus server.

Every time a service or a backend is created, a new time-serie is exposed. The related goroutine updates health metrics from the Pulse. Once the backed or the service is removed, the related time-serie is deleted to allow Prometheus knows about the change.

Gorb exposes Prometheus consumable metrics on http://:4672/metrics

Available timeseries

curl http://gorb:4672/metrics |grep gorb_lbs
...
gorb_lbs_health{host="",lb_name="nginx-443",method="wlc",persistent="true",port="443",protocol="tcp"} 0.5
gorb_lbs_health{host="",lb_name="nginx-80",method="wlc",persistent="true",port="443",protocol="tcp"} 1
...
gorb_lbs_backends_number{host="",lb_name="nginx-443",method="wlc",port="443"} 2
gorb_lbs_backends_number{host="",lb_name="nginx-80",method="wlc",port="443"} 2
...
gorb_lbs_backends_health{backend_name="10.0.0.4-443",host="10.0.0.4",lb_name="nginx-443",method="nat",port="443"} 1
gorb_lbs_backends_health{backend_name="10.0.128.2-443",host="10.0.128.2",lb_name="nginx-443",method="nat",port="443"} 0
gorb_lbs_backends_health{backend_name="10.0.160.2-80",host="10.0.160.2",lb_name="nginx-80",method="nat",port="80"} 1
gorb_lbs_backends_health{backend_name="10.0.32.2-80",host="10.0.32.2",lb_name="nginx-80",method="nat",port="80"} 1
...
gorb_lbs_backends_status{backend_name="10.0.0.4-443",host="10.0.0.4",lb_name="nginx-443",method="nat",port="443"} 0
gorb_lbs_backends_status{backend_name="10.0.128.2-443",host="10.0.128.2",lb_name="nginx-443",method="nat",port="443"} 1
gorb_lbs_backends_status{backend_name="10.0.160.2-80",host="10.0.160.2",lb_name="nginx-80",method="nat",port="80"} 0
gorb_lbs_backends_status{backend_name="10.0.32.2-80",host="10.0.32.2",lb_name="nginx-80",method="nat",port="80"} 0
...
gorb_lbs_backends_weight{backend_name="10.0.0.4-443",host="10.0.0.4",lb_name="nginx.crisidev.org-443",method="nat",port="443"} 100
gorb_lbs_backends_weight{backend_name="10.0.128.2-443",host="10.0.128.2",lb_name="nginx.crisidev.org-443",method="nat",port="443"} 0
gorb_lbs_backends_weight{backend_name="10.0.160.2-80",host="10.0.160.2",lb_name="nginx.crisidev.org-80",method="nat",port="80"} 50
gorb_lbs_backends_weight{backend_name="10.0.32.2-80",host="10.0.32.2",lb_name="nginx.crisidev.org-80",method="nat",port="80"} 50

Dashboard

This allows you to create a dashboard like this:

Prometheus Dashboard

crisidev commented 8 years ago

Hum, interesting.. I added prometheus client in travis file. Maybe I don't understand how the build and coverage check is performed. Am I missing something? :smile:

crisidev commented 8 years ago

Go tip (https://travis-ci.org/kobolog/gorb/jobs/130436526) fails during testing:

--- FAIL: TestMustMarshalPanic (0.00s)
    Error Trace:    util_test.go:69
    Error:      func (assert.PanicTestFunc)(0x472300) should panic
    Panic value:    <nil>

Go 1.5 compiles fine.

noxiouz commented 8 years ago

@crisidev there is no panice in tip, becasue tip version supports json.Encoding of maps with non-string keys. So this test is not valid for tip. https://github.com/golang/go/commit/f05c3aa24d815cd3869153750c9875e35fc48a6e

At first glance you modify two global defined maps from different goroutines. Looks like data race

crisidev commented 8 years ago

Interesting, I didn't figure out. Ok, I'll check and fix this.

kobolog commented 8 years ago

Hi Matteo!

This is an amazing PR, thank you so much! Overall, I have a couple minor and one major comment: you seem not take any locks when you work with 1) notification channel maps in your code 2) accessing backends and such in the Context. It's a race since multiple clients can come at the same time via HTTP handlers.

crisidev commented 8 years ago

Thanks for bringing gorb to life 😉

I am not taking any lock as I did not figure out I was going into race accessing Context.

I will add a locking mechanism in my goroutine calls and when I access Context.

Do you have any preferred way to do it?

kobolog commented 8 years ago

Hi Matteo,

Yeah, the race is introduced due to the fact that Go's built-in HTTP server essentially spawns a goroutine on every new HTTP request and these goroutines might execute concurrently, accessing the same map for reads and writes at the same time.

I don't have any preference for locking: the most straightforward way is to hold an RLock for reads and a normal Lock for writes and that's it.

One sidenote: in the high-performance code it's usually better to call Lock() and Unlock() manually instead of defer mutex.Unlock() since defer introduces a runtime penalty, but here it's not the case since GORB doesn't have any high-performance code paths.

crisidev commented 8 years ago

Hey Andrey, thanks for the help. I have done the change to my control channels and added locking when I read from shared map like ctx.backends and ctx.services.

I am not taking locks when I call GetService() or GetBackend since the lock is taken inside the called function. Do you think it is enough?

Regarding the error with travis build on go tip, I honestly don't know how to fix that test on go 1.5 and tip...

codecov-io commented 8 years ago

Current coverage is 98.45%

Merging #27 into master will increase coverage by 0.27%

  1. 4 files (not in diff) in util were modified. more
    • Misses -1
    • Hits -6
  2. 4 files (not in diff) in pulse were modified. more
    • Hits -7
  3. 2 files (not in diff) in disco were modified. more
    • Hits -5
  4. 3 files in pulse were modified. more
    • Hits -5
@@           master        #27   diff @@
========================================
  Files          13         13          
  Lines         218        194    -24   
  Methods         0          0          
  Branches       34          0    -34   
========================================
- Hits          214        191    -23   
+ Misses          3          2     -1   
  Partials        1          1          

Powered by Codecov. Last updated by 2d8df45...3c3a284

noxiouz commented 8 years ago

I honestly don't know how to fix that test on go 1.5 and tip...

I think this test could be dropped because it's not valid anymore. Or you can use conditional build for the test function.

kobolog commented 8 years ago

Hey, so how's it going with testing? I'm excited about merging this PR!

crisidev commented 8 years ago

Very busy right now and there is a problem on locking on deletion. I will have time to work on that and fix it over the weekend. I'll be back ASAP! :) On 25 May 2016 8:32 p.m., "Andrey Sibiryov" notifications@github.com wrote:

Hey, so how's it going with testing? I'm excited about merging this PR!

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/kobolog/gorb/pull/27#issuecomment-221681936

kobolog commented 8 years ago

Hey @crisidev, do you think you'll be able to get this PR going?

kobolog commented 7 years ago

Hey @crisidev, do you think you'll get some time to get this PR ready to roll?

crisidev commented 7 years ago

@kobolog I'll try to get that done during my Xmas holidays.. I don't have the test infrastructure anymore, hence I need to rebuild that first. I am sorry for keeping this so long..

crisidev commented 7 years ago

Folk, I cannot reproduce the settings I had to fully test and fix this. I am sorry. I think it is better to close this PR.

I'll try to make another one sooner or later. Otherwise you can start from my idea here. Sorry about this.