osrg / gobgp

BGP implemented in the Go Programming Language
https://osrg.github.io/gobgp/
Apache License 2.0
3.56k stars 682 forks source link

prometheus support #1507

Open fujita opened 6 years ago

fujita commented 6 years ago

Is anyone interested in?

CoreTex commented 6 years ago

Yes, but more interesting would be Zabbix.

benerudolph commented 6 years ago

Maybe there is a common way to handle both monitoring via prometheus and other monitoring systems in a generic way?

nstgt commented 6 years ago

These metrics must be helpful for us.

kilo-bit commented 6 years ago

Prometheus support would be greatly appreciated.

imcom commented 6 years ago

+1 for Prometheus

tonidas commented 6 years ago

+1 for Prometheus!

xiaopeng163 commented 6 years ago

+1 for Prometheus!

greenpau commented 6 years ago

Is anyone interested in?

@fujita, I am interested πŸ˜ƒ There are at least two ways to do that.

  1. A separate project for gobgp_exporter, which queries gobgp API.
  2. A separate go routine in gobgp

Any preference?

greenpau commented 6 years ago

@fujita , I looked that gobgp code and I personally like point 1. Could you create a separate project, say gobgp_exporter for the exporter under osrg? I could put a PR for the exporter code pretty fast. The client API for gobgp is pretty straight-forward, e.g. showRibInfo() function.

fujita commented 6 years ago

Sorry about the delay, I've just started to investigate implementation options.

@greenpau why can't we pick the best of option 1 and 2? Having a separate project (exporter) forces users to manage another daemon. We can't run a go routine to execute the gRPC API and handle http requests from prometheus?

I've just create a branch to only send the received number in each peer: https://github.com/fujita/gobgp/commit/090cca2967b3251d4656b06e13864de4c1f8ac4b

greenpau commented 6 years ago

@fujita, it depends on the circumstances.

  1. Sync vs Async polling...imagine you want to monitor a routing table contining 1000s routes. Do you want the polling to be sync or async?
  2. Containerizing daemons. Once you containerized gobgp with prom built-in, the different go routine compete for the same cpu, because 1 cpu for a container.
  3. Relid configs. How do you reload prom config?
  4. etc.

Prom exporters are designed on per app basis. Compare node_exporter and backbox_exporter and you will get the gist.

fujita commented 6 years ago
  1. Sync vs Async polling.

I don't understand. I don't see how this is related with the exporter and exposing models.

  1. Containerizing daemons.

1 cpu for a container? Why? You can configure what you like.

  1. Relid configs. How do you reload prom config?

I had quick look at the official exporters, mysql, consul. Looks like they don't support the config reloading.

greenpau commented 6 years ago

@fujita , see below.

1 cpu for a container? Why? You can configure what you like.

Not in containers. With docker for example, you can configure cpusets. However, that setting just allows you to switch between CPUs, it does not allow running on multiple CPUs concurrently.

I don't understand. I don't see how this is related with the exporter and exposing models.

Polling = scraping

Sync scraping means you make a call to the URL of an exporter, and subsequent data collection starts. Once the data collected it outputs it to your scraper.

Async scraping means that data collection runs independent of the scrape calls. In that mode, the scraper gets that data from data collection buffer.

Why that matters. Consider your exporter has 100-100s of metrics and data collection takes 1-3s. That's the time your GoBGP server routine competes with the collectors' routines for single CPU time. It obviously does not matter when you collect status of BGP peers, because it probably takes 0.00001s.

Looks like they don't support the config reloading.

Right. Consider you want to change the configuration of the exporter, would you restart GoBGP and drop neighbor relations with your peers?

fujita commented 6 years ago

Not in containers. With docker for example, you can configure cpusets. However, that setting just allows you to switch between CPUs, it does not allow running on multiple CPUs concurrently.

Hmm, really? Inside a single container, can you try to run two processes; each execute an endless loop? Then see the output of top command on a host OS.

greenpau commented 6 years ago

Hmm, really?

Your GoBGP and Prometheus go routines are not separate processes. They are go routines. Hence, will be on the same CPU.

greenpau commented 6 years ago

can you try to run two processes; each execute an endless loop?

@fujita , that would work :-) However, that would NOT be parent and child processes like in a container. These would be two separate parents. In a container, you have a single parent (akin init) and bunch of child processes.

greenpau commented 6 years ago

By default, processes can run on any CPU. However, processes can be instructed to run on a predetermined selection of CPUs, by changing the affinity of the process. Child processes inherent the CPU affinities of their parents. https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Affinity.html

@fujita , refreshed my memory on the subject. Here, there is a snippet to run the test you proposed.

As the sample runs below demonstrate, the amount of real and CPU time consumed when running the program will depend on intra-core caching effects and whether the processes are using the same CPU.

fujita commented 6 years ago

can you try to run two processes; each execute an endless loop?

@fujita, that would work :-) However, that would NOT be parent and child processes like in a container. These would be two separate parents. In a container, you have a single parent (akin init) and bunch of child processes.

Please run a process that create two child processes executing an endless loop and wait forever. See the output of top on the host.

greenpau commented 6 years ago

Please run a process that create two child processes executing an endless loop and wait forever. See the output of top on the host.

@fujita , yes, two processes running in a container can be on separate CPUs. My mistake.

The test script:

import os

def child():
   print('\nA new child ',  os.getpid())
   os._exit(0)

def parent():
   while True:
      newpid = os.fork()
      if newpid == 0:
         child()
      else:
         pids = (os.getpid(), newpid)
         print("parent: %d, child: %d\n" % pids)
      reply = input("q for quit / c for new fork")
      if reply == 'c':
          continue
      else:
          break

parent()

Running the test:

# python test.py
parent: 29, child: 30

q for quit / c for new fork('\nA new child ', 30)

The two processes are 1847 (parent) and 1848 (child):

$ ps -ef | grep python
root      1847  1817  0 20:05 pts/3    00:00:00 python test.py
root      1848  1847  0 20:05 pts/3    00:00:00 [python] <defunct>

CPU Sets:

# cat /proc/1847/status  | grep Cpus
Cpus_allowed:   f
Cpus_allowed_list:  0-3

# cat /proc/1848/status  | grep Cpus
Cpus_allowed:   f
Cpus_allowed_list:  0-3

The CPU process:

#  cat /proc/1848/stat | cut -d' ' -f39
1

#  cat /proc/1847/stat | cut -d' ' -f39
0
fujita commented 6 years ago

Good, now you understand goroutines DON'T compete for a single CPU inside a container. A container is just namespaces and processes from the perspective of Linux kernel so it can use as many CPUs as you have.

For me, some official exporters don't support reloading because it's not useful much. So I go with the current design. If someone needs reloading, it's pretty easy for him to implement a own gobgp exporter by using gobgp/exporter as a package; he just needs a main function.

greenpau commented 6 years ago

now you understand goroutines DON'T compete for a single CPU inside a container.

@fujita , overconfidence πŸ˜† they DO compete when there is only 1 CPU to compete over πŸ˜„ They also compete when you have more goroutines than CPUs. I get that an application, over time, might use all CPUs. How do you know that goroutines are being load balanced across all CPUs?

fujita commented 6 years ago

No difference between the exporter design and the current design on that.

greenpau commented 6 years ago

FYI, check this article for CPU, memory profiling with Go.

jeffbean commented 6 years ago

Tally allows you to define and emit metrics in different ways. Its an option that allows you to configure multiple ways of collecting/emitting metrics. Just a suggestion.

greenpau commented 5 years ago

@fujita , I started with external exporter here: https://github.com/greenpau/gobgp_exporter Would love to get your feedback on the code. There are probably some issues with concurrency. Still getting up to speed with Go. I spent my weekend on this πŸ˜„

fujita commented 5 years ago

@greenpau nice, but why you modify pb.go file?

greenpau commented 5 years ago

why you modify pb.go file?

@fujita , I guess the primary reason is that I want to control the timeout on the client.

The Collect() will be called on each scrape. If the daemon does not respond, the srcape holds the session and will not timeout right away. This results in hanging scrapes. It is not long, but should timeout in a second or two. In my last commit, I added a logic where the GoBGP polling and scraping happen async. There is also some concurrency logic.

Additionally, ... somewhat related, but not ... since the changes to the API (hiding behind the /internal), the only way to get to some functionality (e.g. things related to ptype, any, etc.) is by adding extra functions in the way I did it with extended client. Not ideal. I would prefer to have that code in upstream, but I don't know whether you want to have it there.

greenpau commented 5 years ago

I think I am close to getting some of the things mentioned here done.

fujita commented 5 years ago

why not using context.WithTimeout?

greenpau commented 5 years ago

@fujita, because I don’t know how context works. I will take time to read up on it.

greenpau commented 5 years ago

why not using context.WithTimeout?

@fujita , I was reading up on the Go's context package. Say I implemented the context with timeout. The timeout gets triggered. How do I close the connection? (Ideally, in case of a timeout, I would prefer not keeping half-closed connections). How do I assess whether the connection is still alive?

nemith commented 3 years ago

@greenpau you do a select on ctx.Done() at some point of your connection. This is automatic with the net package I believe.