too much metrics cause heplify-server output to hang

games130 commented 6 years ago

I did some change to prometheus.go I added new counter (to grab all sip messages with ip and port)

p.CounterVecMetrics["heplify_method_capture01"] = prometheus.NewCounterVec(prometheus.CounterOpts{Name: "heplify_method_capture01", Help: "All SIP message counter"}, []string{"method", "cseq_method", "source_ip", "source_port", "destination_ip", "destination_port"})

After running the newly added counter (it is running fine heplify-server output and telegraf managed to get the output from heplify-server and storing it into influxdb), it will hang / stop working after a few minutes.

Telegraf log says "getsockopt: connection refused". Below is some of the log result: Any idea how to debug this hang / stop working problem?

I have tested to output smaller number of metrics and it will run fine, no hangs.

Apr 16 21:24:10 localhost telegraf: 2018-04-16T13:24:10Z D! Output [influxdb] wrote batch of 1000 metrics in 50.68262ms Apr 16 21:24:10 localhost telegraf: 2018-04-16T13:24:10Z D! Output [influxdb] wrote batch of 1000 metrics in 41.738095ms Apr 16 21:24:10 localhost telegraf: 2018-04-16T13:24:10Z D! Output [influxdb] wrote batch of 1000 metrics in 34.860311ms Apr 16 21:24:10 localhost telegraf: 2018-04-16T13:24:10Z E! Error in plugin [inputs.prometheus]: took longer to collect than collection interval (1s) Apr 16 21:24:10 localhost telegraf: 2018-04-16T13:24:10Z D! Output [influxdb] wrote batch of 1000 metrics in 37.493373ms Apr 16 21:24:10 localhost telegraf: 2018-04-16T13:24:10Z D! Output [influxdb] wrote batch of 1000 metrics in 48.7213ms Apr 16 21:24:10 localhost telegraf: 2018-04-16T13:24:10Z D! Output [influxdb] wrote batch of 1000 metrics in 36.79713ms Apr 16 21:24:10 localhost telegraf: 2018-04-16T13:24:10Z D! Output [influxdb] wrote batch of 1000 metrics in 41.394166ms Apr 16 21:24:10 localhost telegraf: 2018-04-16T13:24:10Z D! Output [influxdb] wrote batch of 1000 metrics in 29.940258ms Apr 16 21:24:10 localhost telegraf: 2018-04-16T13:24:10Z D! Output [influxdb] wrote batch of 1000 metrics in 35.980791ms Apr 16 21:24:10 localhost telegraf: 2018-04-16T13:24:10Z D! Output [influxdb] wrote batch of 1000 metrics in 28.388011ms Apr 16 21:24:10 localhost telegraf: 2018-04-16T13:24:10Z D! Output [influxdb] wrote batch of 1000 metrics in 38.591658ms Apr 16 21:24:10 localhost telegraf: 2018-04-16T13:24:10Z D! Output [influxdb] wrote batch of 1000 metrics in 40.697321ms Apr 16 21:24:10 localhost telegraf: 2018-04-16T13:24:10Z D! Output [influxdb] wrote batch of 1000 metrics in 44.516893ms Apr 16 21:24:10 localhost telegraf: 2018-04-16T13:24:10Z D! Output [influxdb] wrote batch of 1000 metrics in 37.445192ms Apr 16 21:24:10 localhost telegraf: 2018-04-16T13:24:10Z D! Output [influxdb] wrote batch of 1000 metrics in 37.947831ms Apr 16 21:24:11 localhost telegraf: 2018-04-16T13:24:11Z D! Output [influxdb] wrote batch of 1000 metrics in 45.27528ms Apr 16 21:24:11 localhost telegraf: 2018-04-16T13:24:11Z D! Output [influxdb] wrote batch of 1000 metrics in 39.588898ms Apr 16 21:24:11 localhost telegraf: 2018-04-16T13:24:11Z D! Output [influxdb] wrote batch of 1000 metrics in 41.734043ms Apr 16 21:24:11 localhost telegraf: 2018-04-16T13:24:11Z E! Error in plugin [inputs.prometheus]: took longer to collect than collection interval (1s) Apr 16 21:24:11 localhost telegraf: 2018-04-16T13:24:11Z D! Output [influxdb] wrote batch of 1000 metrics in 47.767747ms Apr 16 21:24:11 localhost telegraf: 2018-04-16T13:24:11Z D! Output [influxdb] wrote batch of 1000 metrics in 35.308999ms Apr 16 21:24:11 localhost telegraf: 2018-04-16T13:24:11Z D! Output [influxdb] wrote batch of 1000 metrics in 37.057446ms Apr 16 21:24:11 localhost telegraf: 2018-04-16T13:24:11Z D! Output [influxdb] wrote batch of 1000 metrics in 38.844666ms Apr 16 21:24:11 localhost telegraf: 2018-04-16T13:24:11Z D! Output [influxdb] wrote batch of 1000 metrics in 37.312039ms Apr 16 21:24:11 localhost telegraf: 2018-04-16T13:24:11Z D! Output [influxdb] wrote batch of 1000 metrics in 38.181017ms Apr 16 21:24:11 localhost telegraf: 2018-04-16T13:24:11Z E! Error in plugin [inputs.prometheus]: error making HTTP request to http://localhost:9999/metrics: Get http://localhost:9999/metrics: dial tcp [::1]:9999: getsockopt: connection refused Apr 16 21:24:11 localhost telegraf: 2018-04-16T13:24:11Z D! Output [influxdb] wrote batch of 1000 metrics in 64.495392ms Apr 16 21:24:12 localhost telegraf: 2018-04-16T13:24:12Z E! Error in plugin [inputs.prometheus]: error making HTTP request to http://localhost:9999/metrics: Get http://localhost:9999/metrics: dial tcp [::1]:9999: getsockopt: connection refused Apr 16 21:24:13 localhost telegraf: 2018-04-16T13:24:13Z E! Error in plugin [inputs.prometheus]: error making HTTP request to http://localhost:9999/metrics: Get http://localhost:9999/metrics: dial tcp [::1]:9999: getsockopt: connection refused Apr 16 21:24:14 localhost telegraf: 2018-04-16T13:24:14Z E! Error in plugin [inputs.prometheus]: error making HTTP request to http://localhost:9999/metrics: Get http://localhost:9999/metrics: dial tcp [::1]:9999: getsockopt: connection refused Apr 16 21:24:15 localhost telegraf: 2018-04-16T13:24:15Z E! Error in plugin [inputs.prometheus]: error making HTTP request to http://localhost:9999/metrics: Get http://localhost:9999/metrics: dial tcp [::1]:9999: getsockopt: connection refused

negbie commented 6 years ago

Hi telegraf exposes some own metrics too. There are also some github issues for telegraf which are similar. Mby they can help.

I would suggest you to rethink your additions. Every tag will create a new series. So every new IP Port combination will create a new series. This might be working in low traffic environments but with some more traffic you will have such a high cardinality in your data that will slow down the processing and eat your ram.

So when you have some traffic avoid tags with high cardinality like IP, useragents, phonenumbers

games130 commented 6 years ago

okay, i will try posting for help in those.

Also are you working on ways to add new metric without recompiling the whole code?

negbie commented 6 years ago

Hm this would require to integrate a scripting language like lua or python( datadog embedded recently python into their go agent). I can't tell you now if I will address it. First I need fix some more important things.

What kind of metrics would you like to add?

games130 commented 6 years ago

I am able to add new metrics, currently i am adding metric to monitor individual SIP Trunks. With these tags(which is working fine at the moment - I am able to generate statistic for calls incoming and outgoing of the SIP Trunk ):

p.CounterVecMetrics["heplify_method_capture03"] = prometheus.NewCounterVec(prometheus.CounterOpts{Name: "heplify_method_capture03", Help: "All SIP message counter"}, []string{"method", "cseq_method", "source_ip", "destination_ip"})

I am just asking because the current method require recompile of the heplify-server for changes to the metrics. But yea not something really important to address now.

negbie commented 6 years ago

Take a look at the promhunterip flags. Mby they can help you

negbie commented 6 years ago

You can specify a IP and a name for example 192.168.1.1 and loadbalancer. Now you will get sip method stats for this src,dst IP and the tag loadbalancer

negbie commented 6 years ago

Any news here @games130 ? I tried to reproduce it but I can't. What I did:

1) Add following to Telegraf (Please replace with your ip+port)

[[inputs.prometheus]] urls = ["http://192.168.11.1:9999/metrics"]

To get per second values from counters in this case concurrent calls:

SELECT NON_NEGATIVE_DERIVATIVE(mean("counter"),1s) AS "mean_counter" FROM "telegraf"."autogen"."heplify_method_response" WHERE time > now() - 5m AND "method"='INVITE' AND "response"='200' GROUP BY time(:interval:)

games130 commented 6 years ago

I actually modify prometheus.go and added extra metric measurements.

p.CounterVecMetrics["heplify_method_capture01"] = prometheus.NewCounterVec(prometheus.CounterOpts{Name: "heplify_method_capture01", Help: "All SIP message counter"}, []string{"method", "cseq_method", "user-agent", "source_ip", "source_port", "destination_ip", "destination_port"})

then i added this line for all IP Address

p.CounterVecMetrics["heplify_method_capture01"].WithLabelValues(pkt.SIP.StartLine.Method, pkt.SIP.CseqMethod, pkt.SIP.UserAgent, pkt.SrcIP, strconv.FormatUint(uint64(pkt.SrcPort), 10), pkt.DstIP, strconv.FormatUint(uint64(pkt.DstPort), 10)).Inc()
  if pkt.SIP != nil && pkt.ProtoType == 1 {
      if !p.TargetEmpty {
                      ...
      } else {

added after the else statement so that it will record all IP Address. Since i capture end to end, this include alot of UE IP address. I understand now it might not be the best way to do it since there is too much unique tag. it was writing at least 9k or more metris every second i think.

games130 commented 6 years ago

Another question i have related to metrics would be why use prometheus client? I have been playing around to add more metrics to do some calculation like ASR, NER. However i found the limitation that prometheus client can only insert one "field" per measurement.

Because of the limitation, this means you either add more measurement or tag by SessionAttempt, SuccessSession, etc etc.

Then comes the next problem which would be to calculate the actual ASR and NER, because influxdb is not a relational db, it is so far as i understand very difficult to calculate from all the metric you got (either store as separate measurement or tag)

Would like to hear you though on implementing ASR and NER calculation.

I am trying to do it but not using prometheus client, but still in a testing phase. only started today

negbie commented 6 years ago

"p.CounterVecMetrics["heplify_method_capture01"].WithLabelValues(pkt.SIP.StartLine.Method, pkt.SIP.CseqMethod, pkt.SIP.UserAgent, pkt.SrcIP, strconv.FormatUint(uint64(pkt.SrcPort), 10), pkt.DstIP, strconv.FormatUint(uint64(pkt.DstPort), 10)).Inc()"

this hurts only by looking at it ;) It might work with your traffic but in environments with more traffic this will need a lot of resources.

I use the prometheus client because I use prometheus ;) Calculating ASR, NER in prometheus is just a mathematical term. For influx you will probably need ifql since influxql might be too limited.

If you need native influxdb support please open a new issue and I will see what I can do for you.

games130 commented 6 years ago

owww okay that explains alot. I have never use any of this(prometheus, influxdb) so got alot of learning to do. Guess i will have to try out prometheus next.

any shortcut you can help to get something running? i see that you have prometheus docker included.. so if i run all the docker (homer-heplify and prometheus). Whats the way i can get the metrics sending to prometheus.

lmangani commented 6 years ago

@games130 its the other way around. Prometheus will pull from HEPlify (just like Telegraf does)

negbie commented 6 years ago

There is a lot of reading material about those two online. Many things did change to the positive for prometheus > 2.0 and on influx side with ifql.

I don't know if the homer-heplify docker container still works because some flags has changed. But to try out prometheus you only need heplify-server and prometheus. I uploaded the latest heplify-server binary under releases https://github.com/sipcapture/heplify-server/releases

Just download it and make it executable. The prometheus docker-compose file should work but you need to change the targets under https://github.com/sipcapture/heplify-server/blob/master/docker/prometheus/prometheus/prometheus.yml

job_name: 'heplify-server' scrape_interval: 5s static_configs:
- targets: ['localhost:8888'] // your IP and port for example 192.168.1.1:9999

negbie commented 6 years ago

Run heplify-server with your configured ip and port.

./heplify-server -promaddr 192.168.1.1:9999

negbie commented 6 years ago

@games130 I changed the prometheus docker-compose.yml a bit. From the prometheus base folder, simply run: docker-compose up

Go to http://localhost:9090 and login with admin admin

In this case heplify-server won't connect to a database because of the empty dbaddr enviroment variable inside the docker-compose.yml. It will only expose ports 9060 and 9999

negbie commented 6 years ago

And keep in mind that the next homer release is just around the corner. It can display metrics inside its own GUI. I would say especially beginners should wait for it and be a little bit more patient with us.

games130 commented 6 years ago

Will give it a try tomorrow when i am in office. Any hint on when the release date will be? 😁

negbie commented 6 years ago

Lorenzo will speak @ OpenSIPS Summit - 1st - 4th of May and give you more details

games130 commented 6 years ago

okay i got prometheus working. calculating ASR and NER is so much easier in prometheus.

negbie commented 6 years ago

Now start using grafana localhost:3000 admin admin

Enjoy

sipcapture / heplify-server

too much metrics cause heplify-server output to hang #36